Mira Murati's Thinking Machines Lab Debuts Real-Time AI Models with 400ms Response Times

Mira Murati's Thinking Machines Lab Debuts Real-Time AI Models with 400ms Response Times
Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, announced its first interaction models on Monday, delivering the industry's fastest response times for conversational AI at 0.40 seconds. The company published its initial technical paper over the weekend and demonstrated performance that significantly outpaces existing voice models from OpenAI and Google.
The startup's TML-Interaction-Small model achieves response latencies nearly three times faster than OpenAI's GPT-realtime-2.0, which averages 1.18 seconds, and more than twice as fast as Google's Gemini live model at 0.94 seconds. These performance gains come through a full-duplex communication architecture that processes audio and video in 200-millisecond chunks while maintaining continuous listening capabilities.
Technical Architecture
TML-Interaction-Small operates as a 276-billion parameter Mixture-of-Experts model but activates only 12 billion parameters at any given moment, allowing real-time inference on current hardware. The system employs a dual-model approach: the primary interaction model handles real-time conversation while a background model runs in parallel for computationally intensive tasks like web searches and tool use.
This architectural choice addresses a fundamental constraint in conversational AI. Most existing models must complete their current response before processing new input, creating the stilted turn-taking pattern familiar to anyone who has used voice assistants. Thinking Machines Lab's full-duplex approach allows simultaneous speaking and listening, enabling interruptions, clarifications, and more natural conversation flow.
The performance advantages extend beyond speed. On the FD-bench interaction quality test, TML-Interaction-Small scored 77.8, substantially higher than the 46-54 range achieved by competing models from OpenAI and Google. The system also outperformed GPT-realtime-2.0 on the Audio MultiChallenge test, scoring 43.4 compared to OpenAI's 37.6.
Market Context and Competition
The launch arrives as major AI labs race to solve latency problems in voice interfaces. Current voice models typically require users to wait for complete responses before speaking again, creating unnatural pauses that break conversational rhythm. This limitation has constrained adoption of AI voice assistants in professional settings where interruptions and rapid exchanges are common.
We have seen this pattern before, when smartphone manufacturers competed on camera shutter speed and app launch times — seemingly small technical improvements that fundamentally changed user behavior. Response latency in conversational AI occupies similar territory: the difference between 1.2 seconds and 0.4 seconds shifts interaction from "using a tool" to something approaching natural conversation.
Thinking Machines Lab plans a limited research preview in the coming months, though no commercial timeline has been announced. The company acknowledged that larger, more capable versions of their interaction models currently run too slowly for real-time deployment, with plans to release these more powerful variants later in 2026 as computational efficiency improves.
Technical Tradeoffs and Limitations
The focus on speed requires architectural compromises that merit examination. The 12-billion active parameter limit, while enabling real-time performance, constrains the model's reasoning capabilities compared to larger language models that can bring hundreds of billions or trillions of parameters to bear on complex queries.
The dual-model architecture offers one solution: offloading research, analysis, and multi-step reasoning to the background model while keeping conversational responses fast. This approach mirrors how humans handle complex questions in real-time conversations — providing immediate acknowledgment and follow-up questions while processing deeper analysis in parallel.
However, this separation introduces coordination challenges. The interaction model must decide when to invoke the background model, how to integrate its results, and how to manage user expectations during longer processing times. These coordination problems become particularly complex in multi-turn conversations where context builds across both models.
Industry Implications
The announcement positions Thinking Machines Lab directly against established players in conversational AI, leveraging Murati's credibility and technical expertise from her tenure scaling ChatGPT and GPT-4 at OpenAI. The startup's technical paper and benchmarking approach suggest a research-focused culture similar to leading AI labs, though with narrower focus on interaction quality.
Looking at what this means for conversational AI adoption, the latency improvements could unlock new use cases where current voice models fail. Customer service, collaborative work sessions, and educational applications all require natural interruption patterns that 400ms latencies enable but 1+ second delays prevent.
The broader context here involves a shift from general-purpose language models toward specialized architectures optimized for specific interaction patterns. Thinking Machines Lab's approach represents one path: accepting parameter limitations in exchange for temporal performance that enables new interaction modalities.
Commercial success will depend on whether the speed advantages translate to meaningfully better user experiences and whether the company can scale its architecture to larger, more capable models without sacrificing the latency gains that define its competitive position. The planned research preview will provide initial data on both fronts, particularly around user preference between fast-but-focused responses versus slower-but-comprehensive ones.
The technical achievements are clear. The market validation remains to be demonstrated.


