Why WebRTC? The latency problem in voice AI

In this chapter, you will learn why transport protocol choice is the single most consequential infrastructure decision in voice AI, how to think about latency budgets, and why WebRTC — not WebSockets — is the only viable foundation for conversational AI that feels natural.

WebRTCWebSocketsLatency budgetsRTPUDP vs TCPAdaptive bitrate

The 500-millisecond wall

Human conversation has a rhythm. Psycholinguistics research tells us that the average gap between conversational turns is roughly 200 milliseconds — and listeners start to perceive awkwardness at around 500ms of silence after they finish speaking. Beyond one second, the interaction feels broken.

This means a voice AI system has an absolute upper bound of roughly 500 milliseconds from the moment a user stops speaking to the moment they hear the first syllable of a response. Miss that window and the experience degrades from "talking to a smart assistant" to "talking to a robot on a bad phone line."

500ms sounds generous until you break it down.

Anatomy of the voice AI latency budget

Every millisecond matters. Here is where they go:

Audio capture and encoding (~50ms)

The user's microphone captures raw PCM audio. The browser or native client encodes it — typically Opus at 20ms frame sizes. Between capture latency, encoding, and the operating system's audio buffer, expect around 50ms before a single encoded packet is ready to leave the device.

Transport to the server (~10-500ms)

This is where protocol choice dominates. The encoded audio must travel from the user's device to the server running speech-to-text. Depending on the transport mechanism, this step alone can consume most of the budget — or barely register.

Speech-to-text (STT) processing (~100-300ms)

The STT model transcribes the audio into text. Streaming STT models like Deepgram Nova or OpenAI Whisper (streaming mode) produce partial results as audio arrives, but even the fastest models need 100-300ms to produce a confident final transcript of a complete utterance.

LLM inference (~50-200ms)

The transcript hits the language model. With streaming token generation, the first token can arrive in 50ms from a well-optimized provider, but real-world p50 latency for the first token is closer to 100-200ms depending on model size, prompt length, and provider load.

Text-to-speech (TTS) synthesis (~100-300ms)

The LLM's output tokens stream into a TTS engine. Modern streaming TTS (like Cartesia or ElevenLabs Turbo) begins synthesizing audio from partial text, but the first audible chunk still takes 100-300ms to produce.

Transport back to the user (~10-500ms)

The synthesized audio travels back to the user's device. Same transport considerations as step 2, in reverse.

Add those up. Even in the best case — 50 + 10 + 100 + 50 + 100 + 10 = 320ms — you are already using most of the budget. The model processing steps (STT, LLM, TTS) are largely fixed by the state of the art. The only variable you fully control as an infrastructure engineer is transport.

Transport is the tax you pay twice

Notice that transport appears twice in the pipeline — once from user to server, once from server back to user. A transport layer that adds 200ms each way contributes 400ms to the total, blowing the entire budget before the AI models even run.

The WebSocket approach: why it falls short

WebSockets are the default choice for real-time web communication. They provide a full-duplex, persistent TCP connection. For chat applications, notifications, and collaborative editing, they are excellent. For streaming audio in a voice AI pipeline, they are architecturally wrong.

Here is why:

TCP guarantees ordered delivery. Every packet must arrive, and it must arrive in sequence. If packet 47 is lost, packets 48 through 200 queue up in a buffer waiting for the retransmission of 47. This is called head-of-line blocking. For a text chat, this is fine — you need every character. For audio, it is catastrophic. A single lost packet can stall the entire stream for a full round-trip retransmission, often 50-200ms.

Buffering is required for smoothness. Because TCP delivery can be bursty (packets arriving in clumps after a retransmission), WebSocket-based audio systems must implement a jitter buffer — a delay intentionally added to smooth out uneven packet arrival. Typical jitter buffers add 50-150ms of latency.

No built-in media awareness. WebSockets transport raw bytes. There is no concept of audio frames, timestamps, synchronization, or codec negotiation. Every voice AI team building on WebSockets ends up reimplementing a subset of what WebRTC provides out of the box — poorly.

No adaptive bitrate. When network conditions degrade, a WebSocket connection has no mechanism to reduce audio quality gracefully. It either keeps sending at full bitrate (causing congestion and loss) or the application must build its own bandwidth estimation — another hard problem solved by WebRTC natively.

What's happening

The fundamental mismatch is this: TCP was designed for data where every byte matters and order is sacred. Audio is the opposite — a dropped 20ms frame is imperceptible, but a 200ms stall while waiting for retransmission is immediately noticeable. WebSockets inherit TCP's priorities, which are exactly backwards for voice.

The WebRTC approach: built for media

WebRTC (Web Real-Time Communication) was designed from the ground up for audio and video. It uses UDP as its transport layer, wrapped in protocols purpose-built for media delivery.

UDP has no head-of-line blocking. Each packet is independent. If packet 47 is lost, packets 48 onward arrive immediately. The application decides whether to request a retransmission or simply skip the lost frame. For audio at 50 packets per second, a single lost 20ms frame is usually imperceptible.

RTP provides media-aware framing. The Real-time Transport Protocol (RTP) runs over UDP and adds timestamps, sequence numbers, and payload type identification. The receiver knows exactly when each audio frame was generated, can detect gaps, and can compensate with packet loss concealment — a technique that synthesizes a plausible replacement for a lost frame rather than waiting for retransmission.

NACK-based selective recovery. When a packet is lost and the receiver decides it is important enough to recover (perhaps a keyframe in video), it sends a NACK (Negative Acknowledgment) requesting just that specific packet. This is surgical retransmission, not TCP's "stop everything and wait" approach.

Adaptive bitrate via REMB and Transport-CC. WebRTC continuously estimates available bandwidth using receiver-side feedback. When the network degrades, the sender automatically reduces bitrate — lowering audio quality slightly rather than introducing stalls or packet loss. The user hears a brief drop in fidelity rather than a gap in conversation.

Overkill for voice-only? No. Even if you only need audio, WebRTC's infrastructure gives you DTLS encryption, ICE for NAT traversal, automatic codec negotiation, echo cancellation, noise suppression, and automatic gain control — all battle-tested across billions of browser sessions.

Side-by-side: WebSocket vs WebRTC for voice AI

Dimension	WebSocket	WebRTC
Transport protocol	TCP	UDP (via RTP/RTCP)
Typical one-way latency	200-500ms (with jitter buffer)	10-30ms
Head-of-line blocking	Yes — single lost packet stalls the stream	No — packets are independent
Packet loss handling	Full retransmission, in-order	NACK selective recovery + concealment
Jitter buffer	Required (50-150ms added latency)	Adaptive, typically 20-40ms
Adaptive bitrate	Must build yourself	Built-in (REMB / Transport-CC)
Encryption	TLS (transport only)	DTLS + SRTP (media encrypted)
NAT traversal	Not needed (client-initiated)	ICE / STUN / TURN built-in
Echo cancellation	Must build yourself	Browser-native
Codec negotiation	Must build yourself	SDP-based, automatic
Maturity for media	Repurposed data protocol	Purpose-built for audio/video

The numbers that matter

The transport delta between WebSocket and WebRTC is typically 200-400ms round-trip. In a pipeline where STT + LLM + TTS already consumes 300-500ms, that delta is the difference between making the 500ms budget and missing it entirely.

Why this matters for voice AI specifically

You might wonder: if WebSockets add latency, can we just make the AI models faster? To some extent, yes — and model providers are racing to reduce inference time. But consider this:

Model latency is a floor, not a variable. You cannot make Whisper transcribe faster than the speed of sound processing. You cannot make an LLM generate tokens faster than its architecture allows. These latencies represent physics and mathematics, not engineering sloppiness.
Transport latency is pure waste. Unlike model inference, transport latency produces nothing. It is dead time — bytes sitting in a buffer, waiting. Every millisecond saved on transport is a millisecond you can give back to the AI models for better quality, or simply subtract from the total for a snappier experience.
Perception is nonlinear. The difference between 400ms and 600ms total latency is not merely "50% slower." It is the difference between a conversation that feels like talking to a sharp colleague and one that feels like a satellite phone call. Users do not measure latency — they feel it.

WebRTC is not a marginal improvement over WebSockets for voice AI. It is a category change — from "workable but awkward" to "genuinely conversational." This is why LiveKit, and every serious voice AI infrastructure provider, builds on WebRTC.

Test your knowledge

Question 1 of 2

Why is WebRTC preferred over WebSockets for voice AI transport?

Looking ahead

In the next chapter, we will explore how LiveKit routes WebRTC media through its Selective Forwarding Unit (SFU) — and why the SFU architecture is critical for scaling voice AI beyond a single conversation. The transport protocol gets packets off the device quickly; the SFU gets them where they need to go.