Thinking Machines Ships TML-Interaction-Small: Why 0.40s Changes the Voice AI Conversation
TL;DR
Thinking Machines' TML-Interaction-Small hits 0.40s turn-taking latency — 3x faster than OpenAI — by scrapping the pipeline architecture entirely and letting the model learn interactivity at scale. Here's what that actually means.
On May 11, 2026, Thinking Machines Lab released their first model: TML-Interaction-Small.
Its turn-taking latency is 0.40 seconds. OpenAI’s GPT-Realtime-2.0 clocks 1.18 seconds. That’s not an incremental improvement. It’s a structural gap.
The more interesting question: why the difference? The answer isn’t in the engineering details. It’s in a more fundamental design decision.
The One-Second Problem
Every mainstream voice AI in 2026 runs on the same pipeline: automatic speech recognition (ASR) → language model inference → text-to-speech (TTS). This works fine for text interfaces. Put it inside a real conversation, though, and it carries an inescapable assumption: the system must wait for the user to finish speaking before it can begin processing.
That assumption creates latency. Not a network problem. Not insufficient compute. The architecture was designed to process one complete turn, then emit one complete response.
Researchers have patched around this for years. Voice activity detection (VAD), barge-in thresholds, turn-prediction classifiers. Each one is hand-coded logic trying to answer a question the architecture was never designed to handle: when has the user finished speaking?
These patches make voice AI usable. They never make it feel like a conversation.
What Thinking Machines Actually Built
Thinking Machines Lab was founded by Mira Murati, former OpenAI CTO, and John Schulman, former OpenAI researcher. Their first model doesn’t add another patch. It redesigns the foundation.
TML-Interaction-Small is a 276-billion-parameter mixture-of-experts model with 12 billion active parameters at inference time.
The key departure from every existing voice system: there is no pipeline. The model continuously processes audio, video, and text in parallel 200-millisecond chunks. Speaking, listening, deciding to interrupt, choosing to stay silent — these are token-level decisions made inside the model, not controlled by external rules.
Thinking Machines calls this category an “interaction model,” contrasting it with the “turn-based model” that dominates today’s voice APIs. The distinction isn’t speed. It’s whether interactivity is something the model learned, or something engineers wrote.
They also designed a two-part system: the interaction model handles the live conversation stream, while a background model handles asynchronous reasoning and tool calls. One keeps the conversation fluid; the other handles complexity.
What the Numbers Say
| Model | Turn-taking Latency | FD-bench V1.5 |
|---|---|---|
| TML-Interaction-Small | 0.40s | 77.8 |
| Google Gemini-3.1-flash-live | 0.57s | ~42 |
| OpenAI GPT-Realtime-2.0 | 1.18s | 46.8 |
At 0.40 seconds, TML-Interaction-Small operates near the speed of natural human conversation. At 1.18 seconds, you feel the other party is thinking. The experiential gap between these numbers is much larger than the numbers suggest.
FD-bench V1.5 measures full-duplex interaction quality. TML’s score of 77.8 nearly doubles its closest competitor. On visual interaction tests including RepCount-A and ProactiveVideoQA, TML continues observing the user’s actions and responding in real time while still speaking. Other frontier models stay silent or answer incorrectly on the same tasks.
TML also generates backchannel cues — “I see,” “mm-hmm” — without interrupting conversation flow. In traditional pipeline architectures, this is nearly impossible: the system must complete one full turn before outputting anything.
The Bitter Lesson, Again
In March 2019, reinforcement learning pioneer Richard Sutton published a short essay on his personal website titled “The Bitter Lesson.”
His central claim: across seventy years of AI research, the methods that ultimately win are general approaches that leverage computation at scale — not domain-specific knowledge that researchers encode by hand.
His example was chess. Researchers spent decades encoding chess knowledge: piece evaluation functions, positional heuristics, opening theory. This worked, for a while. Deep Blue beat Kasparov through deeper search and raw compute. Then AlphaGo made all of that chess expertise obsolete in a single afternoon by learning from self-play.
Computer vision followed the same script. Hand-engineered features, edge detectors, histogram of oriented gradients — mainstream until AlexNet. AlexNet had no better feature engineering. It just pushed computation further up.
Speech recognition too. Rule-based phoneme models, hidden Markov chains, carefully tuned acoustic models. End-to-end neural networks replaced all of it, not because they were smarter about speech, but because they let computation handle what engineers had been doing manually.
Sutton’s conclusion: “We have to learn the bitter lesson that building in how we think we think does not work in the long run.”
Voice AI just reached the same inflection point.
Why Knowledge Engineering Keeps Losing
Traditional voice AI models conversation by asking: what patterns of human dialogue can be written as rules?
Silence over 300ms means the user finished speaking. Rising intonation means a question. Speech rate increase means emotion.
These rules work in lab conditions. They break in the real world. People pause while thinking. Some speak in monotone. Accents vary. Environments are noisy. Rules encode what engineers observed about conversation, not what conversation actually is.
The deeper problem: conversation is dynamic. Its rhythm, signals, and intent emerge from context, not static patterns. No hand-coded threshold captures that.
TML-Interaction-Small doesn’t write rules. The model continuously infers what to do next within a 200ms stream. That inference capability was learned from data and compute, not specified by engineers.
This choice has short-term costs: more expensive to train, harder to debug, failure modes are less interpretable. But that’s exactly the trade Sutton described: accept short-term difficulty, let computation solve problems that hand-coded knowledge can’t.
Every time someone makes this bet and it works, it becomes another footnote to the same lesson.
What Comes Next
TML-Interaction-Small is currently a research preview available to a limited set of partners, with a wider release planned later in 2026. Thinking Machines has indicated larger models are coming, pending resolution of latency constraints at greater scale.
The competitive pressure is already visible. OpenAI’s GPT-Realtime-2.0 launched three days earlier and was immediately outperformed on latency. Google’s Gemini Live holds advantages in breadth — 380 voices across 75 languages — but trails on turn-taking speed.
The architectural choice facing voice AI is binary: keep patching the pipeline, or train a model where interactivity is native. That’s not just a technical decision. It’s a statement about whether you believe the Bitter Lesson applies here.
Murati and Schulman have placed their bet. 0.40 seconds is what it looks like so far.
Related Articles
White House Demands Zero Jailbreaks for Fable 5: Security Experts Say It's Impossible
Day 7 of the Fable 5 ban: the White House demands the model be completely jailbreak-proof before it relaunches. Security experts are unanimous: that's technically impossible for any frontier LLM, and Dario Amodei has already refused both of the government's proposed fixes.
US Orders Anthropic to Pull Fable 5 and Mythos 5: A Narrow Jailbreak That Took Down Its Most Powerful Models
The US Commerce Department ordered Anthropic to suspend its two most capable models, Fable 5 and Mythos 5, citing a narrow jailbreak tied to cybersecurity capabilities. Anthropic complied. Then it pushed back.