Together AI ships MiniMax M3 inference with sparse-attention speedups for 1M-token context
Together AI on June 2, 2026 published an engineering deep-dive on serving MiniMax's forthcoming M3 model on its inference cloud, claiming order-of-magnitude speedups on the model's signature sparse-attention path. The post — authored by Yubo Wang, Michael Granado, Connor Li, Jue Wang, Brian Mak, Wei Gong, Hiral Jasani, Yineng Zhang, and Dan Fu — positions Together AI as the launch cloud partner for the open-weights release.
What's new
- Together AI describes MiniMax M3 as "an all-in-one model that brings together state-of-the-art coding performance, agentic workflow support, and native multimodal reasoning."
- The company says it is "the preferred cloud partner for MiniMax M3" and will host the open-weights model as a developer endpoint upon public release.
- Engineering claims: "MiniMax Sparse Attention (MSA)... brings a speed up of more than 9x in the prefilling stage and more than 15x in the decoding stage," achieved with KV-Block-Major sparse attention and paged-attention integration.
- Together reports "81–125% throughput improvements across different concurrency levels" once MSA is paired with its serving stack.
- The model is positioned to support "1M context while being highly economically friendly to serve," and Together is opening a waitlist for endpoint access.
Context
MiniMax has been one of the more aggressive Chinese frontier labs on long-context architecture, and M3 is its first all-in-one release that bundles coding, agentic, and multimodal reasoning into a single open-weights checkpoint. MSA — the sparse-attention scheme Together is benchmarking against — is what makes 1M-token context economically defensible at serving time: full quadratic attention at that scale is the reason most labs cap context far below the marketing numbers.
Together AI's role here is the same one it has played with prior open-weights releases from Mistral, DeepSeek, and Meta: be the first cloud with a tuned kernel path so developers can actually run the model the day weights drop. The 9x prefill and 15x decode numbers, if they hold up under independent reproduction, are the kind of step-change that determines whether 1M-token apps become routine or stay novelty demos.
Why it matters
The inference-side announcement is more interesting than the model announcement itself. Frontier labs have spent two years racing to claim ever-longer context windows; the practical bottleneck has quietly become serving cost, not model capability. Together's claim that MSA plus its kernel stack delivers 9–15x speedups across the two dominant inference phases — prefill and decode — is a direct attempt to make 1M context an everyday default rather than a premium tier. If the throughput gains translate into per-token pricing that undercuts the Anthropic and OpenAI 1M-context offerings, the open-weights track gets a credible long-context option for the first time. For developers betting on retrieval-augmented or agent-loop workloads, the difference between 200K and 1M of usable context — at sane cost — is the difference between aggressive chunking and just handing the model the whole repository, document set, or session history. M3 alone won't decide that fight, but a tuned serving path for it is the kind of plumbing the open ecosystem has historically lacked.
Corroborating sources
- Together
https://www.together.ai/blog/serving-minimax-m3-for-efficient-inference-unlocking-1m-token-context-and-multimodality-without-regrets
“MiniMax M3 is an all-in-one model that brings together state-of-the-art coding performance, agentic workflow support, and native multimodal reasoning.”