H-company releases Holo3.1, a local-first computer-use agent family with up to 35B parameters
H-company released Holo3.1 on June 2, 2026, a family of computer-use agents in four sizes — 0.8B, 4B, 9B, and 35B-A3B (a 35-billion-parameter Mixture-of-Experts with roughly 3 billion active per token) — with the explicit goal of running locally on hardware as small as Apple Silicon or as performance-tuned as an NVIDIA DGX Spark. The release is the first in the Holo line to ship quantized checkpoints — FP8, Q4 GGUF, and NVFP4 — optimized for local inference, and the company reports a 25%+ improvement over the prior generation in its own product harness.
What's new
H-company frames the release on the Hugging Face blog as "Holo3.1: Fast & Local Computer Use Agents." The post argues the upgrade focuses on production-grade robustness: "Holo3.1 improves robustness across the three dimensions that matter most in production: environments (web, desktop, mobile), agent frameworks, and deployment targets."
The headline numbers:
- Model sizes. "The Holo3.1 family is available in four sizes: Holo3.1-0.8B, Holo3.1-4B, Holo3.1-9B, Holo3.1-35B-A3B."
- AndroidWorld benchmark. "On AndroidWorld, our 35B-A3B model improves from 67% to 79.3%, while the smaller 4B and 9B variants improve from 58% to 72%."
- In-house harness. "Holo3.1 also delivers more than a 25% improvement over Holo3 when evaluated inside our Holotab product harness."
- Local-first quantization. "For the first time, we release quantized checkpoints optimized for local inference, including FP8, Q4 GGUF, and NVFP4."
- Throughput on DGX Spark. "on DGX Spark, NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16."
- Step latency. "cutting average step time from 6.8s to 3.3s."
On privacy: "execution stays fully private and local, with nothing leaving the user's network."
Context
Computer-use agents — models that take screenshots, plan actions, and emit clicks or keystrokes against a real OS — have been a pulled-forward frontier for the major labs through 2025 and 2026. Anthropic shipped computer use in the Claude tool family; OpenAI surfaced operator; both have pushed the bar on cloud-side benchmark scores. The friction has consistently been the same shape: latency to a hosted model, recurring per-action token cost, and the privacy ceiling that comes with sending screen content to a third-party API.
H-company's pitch with Holo3.1 attacks that friction directly. The 0.8B and 4B sizes can run on consumer-grade hardware, the 35B-A3B sparse design fits a single workstation-class accelerator, and the quantization recipes are tuned for local throughput rather than cloud cost.
Why it matters
The benchmark jump — AndroidWorld 67% → 79.3% on the flagship size — narrows the gap that local computer-use agents have carried versus hosted frontier offerings. The 6.8s → 3.3s step-time cut is the more practically meaningful number: latency is what makes computer-use agents unusable at desktop pace, and roughly halving it changes the texture of how an agent feels to drive.
Holo3.1 is a niche release in the sense that computer-use is still an emerging surface, but it is a real one. The fact that H-company is willing to publish concrete benchmark numbers, quantized checkpoints, and explicit local-deployment recipes is the substance worth tracking — the open-source half of the computer-use space has been thin on disciplined, comparable benchmarks, and this release adds one.
Corroborating sources
- Huggingface.co
https://huggingface.co/blog/Hcompany/holo31
“On AndroidWorld, our 35B-A3B model improves from 67% to 79.3%, while the smaller 4B and 9B variants improve from 58% to 72%.”