Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

Martin Holloway·Published 2month ago·5 min read·Based on 1 source

Reading level

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

Key Takeaways

Xiaomi and TileRT released MiMo-V2.5-Pro-UltraSpeed on 8 June 2026, a trillion-parameter model delivering 1,000 tokens per second on general-purpose GPUs.
The throughput is attributed to extreme model-system codesign — architecture and runtime decisions made in tandem rather than sequentially.
At 1,000 t/s, a 2,000-token response completes in approximately two seconds, making trillion-parameter models viable for interactive applications.
Full benchmarking details — GPU SKUs, batch sizes, concurrency curves — have not yet been publicly disclosed, limiting independent validation.
The codesign approach mirrors a broader structural shift in the inference stack, where software-layer optimisation on commodity hardware is closing the gap previously addressed only by custom silicon.

Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, a trillion-parameter language model capable of sustained generation at 1,000 tokens per second, developed in collaboration with inference runtime specialist TileRT. The announcement, published on 8 June 2026, marks a notable step in the ongoing effort to close the gap between model scale and deployment practicality.

What Was Built and How

At a trillion parameters, MiMo-V2.5-Pro-UltraSpeed sits firmly in the class of models that, until recently, were largely confined to research clusters or specialised accelerator hardware. The headline figure — 1,000 tokens per second — is an inference throughput number that matters precisely because it is achieved on general-purpose GPUs rather than purpose-built silicon such as Groq's LPUs or dedicated NPU arrays.

The performance is attributed to what Xiaomi and TileRT describe as extreme model-system codesign optimisation. That phrase covers a spectrum of techniques: kernel-level fusion, attention mechanism restructuring, KV-cache management, quantisation strategies tuned to the specific weight distribution of this model, and scheduler-level batching decisions made in concert with model architecture choices rather than after the fact. The codesign framing is deliberate — it signals that the throughput gains are not the result of a generic inference stack applied to a standard checkpoint, but of architecture and runtime decisions made in tandem from the outset.

TileRT's involvement is central to that story. TileRT is a runtime and compilation layer designed to extract memory-bandwidth efficiency and arithmetic throughput from GPU hardware that was not purpose-built for transformer inference. The collaboration with Xiaomi's MiMo team suggests that the model's architecture was shaped, at least in part, by what TileRT could exploit at the kernel level — a tighter feedback loop than the more typical pattern of training a model and then optimising its serving separately.

Why 1,000 Tokens Per Second Is Worth Examining

To put the number in operational context: at 1,000 tokens per second, a 2,000-token response — roughly the length of a detailed technical summary — completes in two seconds. That latency profile moves trillion-parameter models into territory where interactive and near-real-time applications become genuinely viable, not merely theoretical. Most production deployments of large models today operate at throughputs that require careful batching and load-balancing to keep latency acceptable at scale; a 1,000 t/s figure, if it holds under realistic batch and concurrency conditions, materially changes the economics of serving a model at this parameter count.

It is worth being precise about what the number does and does not tell us at this stage. Peak or sustained single-stream throughput on controlled hardware configurations is not the same as p95 latency under concurrent load with variable sequence lengths. Xiaomi and TileRT have not yet published a full inference benchmark breakdown — the kind of detail that would include batch sizes, GPU SKUs, memory footprint, and throughput-latency curves. That information matters for any engineering team evaluating deployment. The 1,000 t/s claim is the headline; the benchmarking substrate is the substance, and it is not yet fully disclosed.

The Broader Context: Codesign as a Structural Shift

The model-system codesign approach Xiaomi and TileRT are describing is not new in principle. There is a clear lineage here: Google's TPU programme, which began with the insight that serving production inference workloads demanded hardware co-evolved with the model graph, established this pattern at scale more than a decade ago. What is different in 2026 is that codesign discipline is propagating downward — from hyperscaler custom silicon to software-layer optimisation on commodity GPU hardware. The implication is that the performance ceiling for general-purpose GPU inference is higher than the standard stack suggests, provided the optimisation work is done at the right abstraction level.

We have seen this pattern before, when the early cloud infrastructure teams discovered that squeezing meaningful performance from off-the-shelf x86 required not just better software but a willingness to redesign the software with specific silicon characteristics in mind — cache hierarchies, NUMA topology, memory bus contention. The teams that did that work pulled meaningfully ahead of those applying generic optimisation. The TileRT collaboration looks structurally similar: runtime and model architecture evolving together, rather than independently.

What This Means for the MiMo Roadmap and the Competitive Field

MiMo-V2.5-Pro-UltraSpeed is the latest in Xiaomi's MiMo model series, which has progressively emphasised reasoning and coding capabilities alongside raw language generation. The UltraSpeed designation, combined with the TileRT partnership, indicates that inference efficiency has become a first-class design objective within the MiMo programme — not an afterthought addressed post-training.

That is relevant for how Xiaomi positions MiMo commercially. Xiaomi is primarily a consumer hardware company, but MiMo's publication history and the technical depth of its releases suggest an intent to be taken seriously as an AI infrastructure contributor. A trillion-parameter model at 1,000 t/s on general-purpose GPUs, if the full benchmark picture bears out the headline, is a meaningful engineering credential in that bid.

The competitive field is moving fast. Inference optimisation has become one of the most actively contested areas in the AI stack, with contributions from Mistral, Together AI, Fireworks AI, and Groq on the serving side, alongside runtime projects like vLLM, TensorRT-LLM, and SGLang. TileRT is a less publicly profiled entrant in that space, and the Xiaomi collaboration gives it a high-parameter, high-throughput reference deployment that will draw scrutiny from the inference engineering community.

Looking at what this means for practitioners evaluating large model deployment: the gap between frontier model capability and deployable inference efficiency has been one of the most persistent friction points in the field for the past three years. Every credible advance on that front — whether through speculative decoding, continuous batching, hardware-software codesign, or architectural choices like mixture-of-experts routing — reduces the cost and latency penalty of running at scale. MiMo-V2.5-Pro-UltraSpeed, on the available evidence, is a genuine contribution to that effort, even if the full technical picture awaits further disclosure.

The next disclosures to watch for: hardware configuration and SKU details, batch-size and concurrency benchmarks, memory footprint per serving instance, and any independent reproduction of the 1,000 t/s figure by third-party inference teams.

Technology

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

Martin Holloway·Published 2month ago·5 min read·Based on 1 source

Reading level

Key Takeaways

Xiaomi and TileRT released MiMo-V2.5-Pro-UltraSpeed on 8 June 2026, a trillion-parameter model delivering 1,000 tokens per second on general-purpose GPUs.
The throughput is attributed to extreme model-system codesign — architecture and runtime decisions made in tandem rather than sequentially.
At 1,000 t/s, a 2,000-token response completes in approximately two seconds, making trillion-parameter models viable for interactive applications.
Full benchmarking details — GPU SKUs, batch sizes, concurrency curves — have not yet been publicly disclosed, limiting independent validation.
The codesign approach mirrors a broader structural shift in the inference stack, where software-layer optimisation on commodity hardware is closing the gap previously addressed only by custom silicon.

What Was Built and How

Why 1,000 Tokens Per Second Is Worth Examining

The Broader Context: Codesign as a Structural Shift

What This Means for the MiMo Roadmap and the Competitive Field

Technology

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

Martin Holloway·Published 2month ago·5 min read·Based on 1 source

Reading level

Key Takeaways

Xiaomi and TileRT released MiMo-V2.5-Pro-UltraSpeed on 8 June 2026, a trillion-parameter model delivering 1,000 tokens per second on general-purpose GPUs.
The throughput is attributed to extreme model-system codesign — architecture and runtime decisions made in tandem rather than sequentially.
At 1,000 t/s, a 2,000-token response completes in approximately two seconds, making trillion-parameter models viable for interactive applications.
Full benchmarking details — GPU SKUs, batch sizes, concurrency curves — have not yet been publicly disclosed, limiting independent validation.
The codesign approach mirrors a broader structural shift in the inference stack, where software-layer optimisation on commodity hardware is closing the gap previously addressed only by custom silicon.

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

What Was Built and How

Why 1,000 Tokens Per Second Is Worth Examining

The Broader Context: Codesign as a Structural Shift

What This Means for the MiMo Roadmap and the Competitive Field

Related Articles

Chinese Labs Moonshot and Alibaba Release Trillion-Parameter Open-Weight Models, Narrowing the Gap with US Frontier Systems

Zhipu AI Positions GLM-5.2 as Open-Source Coding Leader With 1M-Token Context Window

Moonshot AI's Kimi K3 Targets Anthropic's Opus 4.8 With a 2–3 Trillion Parameter Open-Weight Model

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

What Was Built and How

Why 1,000 Tokens Per Second Is Worth Examining

The Broader Context: Codesign as a Structural Shift

What This Means for the MiMo Roadmap and the Competitive Field

Related Articles

Chinese Labs Moonshot and Alibaba Release Trillion-Parameter Open-Weight Models, Narrowing the Gap with US Frontier Systems

Zhipu AI Positions GLM-5.2 as Open-Source Coding Leader With 1M-Token Context Window

Moonshot AI's Kimi K3 Targets Anthropic's Opus 4.8 With a 2–3 Trillion Parameter Open-Weight Model

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

What Was Built and How

Why 1,000 Tokens Per Second Is Worth Examining

The Broader Context: Codesign as a Structural Shift

What This Means for the MiMo Roadmap and the Competitive Field

Related Articles

Chinese Labs Moonshot and Alibaba Release Trillion-Parameter Open-Weight Models, Narrowing the Gap with US Frontier Systems

Zhipu AI Positions GLM-5.2 as Open-Source Coding Leader With 1M-Token Context Window

Moonshot AI's Kimi K3 Targets Anthropic's Opus 4.8 With a 2–3 Trillion Parameter Open-Weight Model

Related Articles

Technology
Chinese Labs Moonshot and Alibaba Release Trillion-Parameter Open-Weight Models, Narrowing the Gap with US Frontier Systems
Martin Holloway·5 min read

Technology
Zhipu AI Positions GLM-5.2 as Open-Source Coding Leader With 1M-Token Context Window
Martin Holloway·4 min read
Technology
Zhipu AI Positions GLM-5.2 as Open-Source Coding Leader With 1M-Token Context Window
Martin Holloway·4 min read

Technology
Moonshot AI's Kimi K3 Targets Anthropic's Opus 4.8 With a 2–3 Trillion Parameter Open-Weight Model
Martin Holloway·4 min read
Technology
Moonshot AI's Kimi K3 Targets Anthropic's Opus 4.8 With a 2–3 Trillion Parameter Open-Weight Model
Martin Holloway·4 min read