Xiaomi's MiMo-V2.5-Pro-UltraSpeed Hits 1,000 Tokens Per Second on General-Purpose GPUs

Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, a trillion-parameter language model capable of sustained generation at 1,000 tokens per second, developed in collaboration with inference runtime specialist TileRT. The announcement, published on 8 June 2026, marks a notable step in the ongoing effort to close the gap between model scale and deployment practicality.
What Was Built and How
At a trillion parameters, MiMo-V2.5-Pro-UltraSpeed sits firmly in the class of models that, until recently, were largely confined to research clusters or specialised accelerator hardware. The headline figure — 1,000 tokens per second — is an inference throughput number that matters precisely because it is achieved on general-purpose GPUs rather than purpose-built silicon such as Groq's LPUs or dedicated NPU arrays.
The performance is attributed to what Xiaomi and TileRT describe as extreme model-system codesign optimisation. That phrase covers a spectrum of techniques: kernel-level fusion, attention mechanism restructuring, KV-cache management, quantisation strategies tuned to the specific weight distribution of this model, and scheduler-level batching decisions made in concert with model architecture choices rather than after the fact. The codesign framing is deliberate — it signals that the throughput gains are not the result of a generic inference stack applied to a standard checkpoint, but of architecture and runtime decisions made in tandem from the outset.
TileRT's involvement is central to that story. TileRT is a runtime and compilation layer designed to extract memory-bandwidth efficiency and arithmetic throughput from GPU hardware that was not purpose-built for transformer inference. The collaboration with Xiaomi's MiMo team suggests that the model's architecture was shaped, at least in part, by what TileRT could exploit at the kernel level — a tighter feedback loop than the more typical pattern of training a model and then optimising its serving separately.
Why 1,000 Tokens Per Second Is Worth Examining
To put the number in operational context: at 1,000 tokens per second, a 2,000-token response — roughly the length of a detailed technical summary — completes in two seconds. That latency profile moves trillion-parameter models into territory where interactive and near-real-time applications become genuinely viable, not merely theoretical. Most production deployments of large models today operate at throughputs that require careful batching and load-balancing to keep latency acceptable at scale; a 1,000 t/s figure, if it holds under realistic batch and concurrency conditions, materially changes the economics of serving a model at this parameter count.
It is worth being precise about what the number does and does not tell us at this stage. Peak or sustained single-stream throughput on controlled hardware configurations is not the same as p95 latency under concurrent load with variable sequence lengths. Xiaomi and TileRT have not yet published a full inference benchmark breakdown — the kind of detail that would include batch sizes, GPU SKUs, memory footprint, and throughput-latency curves. That information matters for any engineering team evaluating deployment. The 1,000 t/s claim is the headline; the benchmarking substrate is the substance, and it is not yet fully disclosed.
The Broader Context: Codesign as a Structural Shift
The model-system codesign approach Xiaomi and TileRT are describing is not new in principle. There is a clear lineage here: Google's TPU programme, which began with the insight that serving production inference workloads demanded hardware co-evolved with the model graph, established this pattern at scale more than a decade ago. What is different in 2026 is that codesign discipline is propagating downward — from hyperscaler custom silicon to software-layer optimisation on commodity GPU hardware. The implication is that the performance ceiling for general-purpose GPU inference is higher than the standard stack suggests, provided the optimisation work is done at the right abstraction level.
We have seen this pattern before, when the early cloud infrastructure teams discovered that squeezing meaningful performance from off-the-shelf x86 required not just better software but a willingness to redesign the software with specific silicon characteristics in mind — cache hierarchies, NUMA topology, memory bus contention. The teams that did that work pulled meaningfully ahead of those applying generic optimisation. The TileRT collaboration looks structurally similar: runtime and model architecture evolving together, rather than independently.
What This Means for the MiMo Roadmap and the Competitive Field
MiMo-V2.5-Pro-UltraSpeed is the latest in Xiaomi's MiMo model series, which has progressively emphasised reasoning and coding capabilities alongside raw language generation. The UltraSpeed designation, combined with the TileRT partnership, indicates that inference efficiency has become a first-class design objective within the MiMo programme — not an afterthought addressed post-training.
That is relevant for how Xiaomi positions MiMo commercially. Xiaomi is primarily a consumer hardware company, but MiMo's publication history and the technical depth of its releases suggest an intent to be taken seriously as an AI infrastructure contributor. A trillion-parameter model at 1,000 t/s on general-purpose GPUs, if the full benchmark picture bears out the headline, is a meaningful engineering credential in that bid.
The competitive field is moving fast. Inference optimisation has become one of the most actively contested areas in the AI stack, with contributions from Mistral, Together AI, Fireworks AI, and Groq on the serving side, alongside runtime projects like vLLM, TensorRT-LLM, and SGLang. TileRT is a less publicly profiled entrant in that space, and the Xiaomi collaboration gives it a high-parameter, high-throughput reference deployment that will draw scrutiny from the inference engineering community.
Looking at what this means for practitioners evaluating large model deployment: the gap between frontier model capability and deployable inference efficiency has been one of the most persistent friction points in the field for the past three years. Every credible advance on that front — whether through speculative decoding, continuous batching, hardware-software codesign, or architectural choices like mixture-of-experts routing — reduces the cost and latency penalty of running at scale. MiMo-V2.5-Pro-UltraSpeed, on the available evidence, is a genuine contribution to that effort, even if the full technical picture awaits further disclosure.
The next disclosures to watch for: hardware configuration and SKU details, batch-size and concurrency benchmarks, memory footprint per serving instance, and any independent reproduction of the 1,000 t/s figure by third-party inference teams.


