Launch

Mira Murati’s Thinking Machines Lab Challenges OpenAI With Real-Time Response Model

co-written by newsrooms.ai12. May 2026, 14:36

Jensen Huang (NVIDIA) and Mira Murati (Thinking Machines). © Nvidia / Thinking Machines Labs

Until September 2024, she was known as CTO of OpenAI, but then unexpectedly left the ChatGPT maker to found her own AI startup: little has been seen of Mira Murati’s Thinking Machines Lab beyond a billion-dollar deal with Nvidia. But now the research company has unveiled a so-called “Interaction Model” that is intended to shape artificial intelligence in a fundamentally different way than existing systems. The model, named TML-Interaction-Small, processes audio, video, and text simultaneously and in real time, without relying on external control components. It is a research preview that, according to the company, demonstrates qualitatively new possibilities for human-machine collaboration.

The Problem with Today’s AI Models

Current language models operate according to a rigid principle: the user speaks or writes, the model waits, processes, and responds. This so-called “turn-based” model creates, in Thinking Machines Lab’s view, an artificial bottleneck in the collaboration between humans and AI. While the model is responding, it takes in no new information. While the user is speaking, the model waits idly.

“We believe we can solve this bandwidth bottleneck by making AI interactive in real time and across every modality. This enables AI interfaces to meet people where they are, rather than forcing people to adapt to AI interfaces.”

Many existing systems attempt to work around this problem through so-called “harnesses” — that is, by connecting external components such as speech recognition modules or conversation management systems. Thinking Machines Lab argues that this approach is fundamentally limited, because these auxiliary components are significantly less intelligent than the actual model.

The New Approach: Micro-Turns Instead of Conversation Rounds

The core principle of TML-Interaction-Small is a so-called Multi-Stream Micro-Turn Design. Rather than waiting for complete conversation rounds, the model continuously processes inputs and generates outputs in blocks of 200 milliseconds each. Incoming and outgoing signals are treated as parallel data streams, not as a sequential series.

This design enables the model to simultaneously listen and speak, perceive pauses, recognize interruptions, and respond to visual cues — without the user having to say anything explicitly. The model thus has a direct sense of time and can independently decide when a response is appropriate.

Key Capabilities at a Glance

Seamless conversation management: The model implicitly recognizes whether a user is thinking, has finished speaking, or is expecting a response — without separate control logic.
Verbal and visual interruptions: The model can intervene at any time, not only at the end of an utterance.
Simultaneous speaking: The user and the model can speak in parallel — for example, for real-time translations.
Time awareness: The model has a direct sense of elapsed time and can respond to it.
Parallel tool use: While the model is speaking and listening, it can simultaneously search the web, retrieve data, or generate user interfaces.

Technical Architecture

The model was trained from scratch and uses encoder-free early fusion of the various modalities. Audio signals are processed as so-called dMel representations and fed in via a lightweight embedding layer. Video frames are divided into 40×40-pixel blocks and encoded through an hMLP module. All components are trained jointly with the central Transformer model, not separately.

For more demanding tasks that require deeper reasoning, the interaction model delegates to an asynchronous background model. This handles complex reasoning tasks, web searches, or agent-based workflows, while the interaction model remains in contact with the user and seamlessly integrates the results into the conversation.

Performance Comparison with Other Systems

In benchmark tests, TML-Interaction-Small performs significantly better than comparable systems from OpenAI and Google in terms of interaction quality and response speed. Particularly notable is the response latency: while GPT-Realtime-2.0 requires an average of 1.18 seconds to respond to a user utterance, the Thinking Machines Lab model comes in at 0.40 seconds.

Model	Response Latency (s)	FD-bench V1.5 (Interaction Quality)	Audio MultiChallenge (Intelligence)

TML-Interaction-Small	0.40	77.8	43.4
GPT-Realtime-2.0 (minimal)	1.18	46.8	37.6
GPT-Realtime-1.5	0.59	48.3	34.7
Gemini-3.1-Flash-Live (minimal)	0.57	54.3	26.8
GPT-Realtime-2.0 (xhigh, with reasoning)	1.63	47.8	48.5

In addition, new internal benchmarks were developed that measure capabilities no commercial model has previously mastered: for example, responding with precise timing to verbal or visual cues, and counting repetitions in a video without an explicit prompt. Competing models consistently performed poorly on these tasks or gave no response at all.

Limitations and Outlook

Thinking Machines Lab openly identifies several current limitations of the system. Very long sessions with continuous audio and video input quickly generate large context volumes, which makes management more difficult. A stable internet connection is absolutely essential for real-time processing, as the experience degrades significantly with a poor connection.

The current model is a Mixture-of-Experts model with 276 billion parameters, of which 12 billion are active at any given time. According to the company, larger models are currently still too slow for use in this real-time setting, but are expected to follow later in the year. The company also announces a research fellowship to invite the scientific community to develop new evaluation standards for interaction models.