What is voice AI? Why WebRTC matters
What is voice AI? Why WebRTC matters
In this chapter, you will learn how the voice AI pipeline works end to end, where every millisecond of latency goes, why WebRTC is the only transport protocol that delivers conversational-quality voice AI, and how LiveKit fits into the picture. This is the conceptual foundation for everything you will build in this course.
The voice AI pipeline
Every voice AI system — from a customer support bot to the dental receptionist you will build in this course — follows the same fundamental pipeline. The user speaks, the system thinks, the system speaks back.
That pipeline has three stages:
Speech-to-Text (STT)
Raw audio from the user's microphone is transcribed into text. Streaming STT models like Deepgram Nova-3 produce partial transcripts as audio arrives, so the system does not wait for the user to finish an entire sentence before starting to process it. Typical latency: 100-300ms for a confident final transcript.
Large Language Model (LLM)
The transcript is sent to an LLM — GPT-4o-mini, Claude, or similar — which generates a text response. With streaming token generation, the first token can arrive in 50-200ms. The LLM is where your agent's personality, knowledge, and decision-making live.
Text-to-Speech (TTS)
The LLM's text output is converted into audio. Modern streaming TTS engines like Cartesia or ElevenLabs begin synthesizing audio from partial text, producing the first audible chunk in 100-300ms. This is the voice your users hear.
This is the pipeline model: STT feeds LLM, LLM feeds TTS, each component is a separate service. It is the dominant architecture today, and it is what you will build with in this course.
Think of it like a relay race. The STT runner takes the baton (audio) and hands off text to the LLM runner, who hands off text to the TTS runner, who delivers audio back to the user. The total time is the sum of all three legs plus the time spent passing the baton — the transport.
Pipeline models vs realtime models
There is an alternative architecture emerging: realtime (speech-to-speech) models. Instead of three separate stages, a single model accepts audio input and produces audio output directly. OpenAI's GPT-4o Realtime and Google's Gemini Live are examples.
| Dimension | Pipeline (STT + LLM + TTS) | Realtime (speech-to-speech) |
|---|---|---|
| Architecture | Three separate models in sequence | Single multimodal model |
| Latency | Sum of three stages (~300-800ms) | Single inference pass (~200-500ms) |
| Voice quality | Choose best-in-class TTS independently | Voice quality tied to the model |
| Model flexibility | Swap any component (use Deepgram STT with Claude LLM with Cartesia TTS) | Locked to one provider |
| Cost | Pay for three services | Pay for one (often more expensive per minute) |
| Maturity | Production-proven, well-understood | Emerging, rapid improvement |
| Emotional nuance | TTS determines expressiveness | Model can respond to tone of voice |
LiveKit supports both architectures. In this course, you will use the pipeline model because it gives you maximum control and flexibility — you can swap any component independently. In Course 2.3 (Realtime vs Pipeline), you will explore speech-to-speech models in depth.
The pipeline model is not going away
Even as realtime models improve, pipeline architectures remain dominant in production. The ability to choose the best STT, the best LLM, and the best TTS independently — and to upgrade each on its own schedule — is a powerful operational advantage. Most production voice AI systems today use the pipeline model.
The latency budget: 500ms or bust
Human conversation has a natural rhythm. The average gap between conversational turns is roughly 200 milliseconds. Listeners start perceiving awkwardness around 500ms of silence. Beyond one second, the interaction feels broken.
Your voice AI system has a total budget of roughly 500 milliseconds from the moment a user stops speaking to the moment they hear the first syllable of a response.
Here is where those milliseconds go:
| Stage | Best case | Typical | Notes |
|---|---|---|---|
| Audio capture + encoding | ~50ms | ~50ms | OS audio buffer + Opus encoding |
| Transport to server | ~10ms | ~200ms | Protocol-dependent — this is the key variable |
| STT processing | ~100ms | ~200ms | Streaming models, final transcript |
| LLM first token | ~50ms | ~150ms | Depends on model, prompt length, provider |
| TTS first audio chunk | ~100ms | ~200ms | Streaming synthesis from partial text |
| Transport to user | ~10ms | ~200ms | Same transport overhead, in reverse |
| Total | ~320ms | ~1000ms | Typical case blows the budget |
The model processing stages — STT, LLM, TTS — are largely determined by the state of the art. You can choose faster models, but you cannot cheat physics. The variable you fully control is transport, and it appears in the budget twice.
Transport is the tax you pay twice
A transport layer that adds 200ms each way contributes 400ms to the total. That alone nearly consumes the entire 500ms budget before the AI models even begin processing. This is why protocol choice is the single most impactful infrastructure decision in voice AI.
Why WebRTC, not WebSockets
If you completed Course 0.1 (LiveKit Architecture), you explored this topic in detail. Here is the summary for voice AI specifically.
WebSockets use TCP. TCP guarantees ordered delivery — if packet 47 is lost, packets 48 through 200 wait in a buffer for the retransmission. This head-of-line blocking can stall audio for 50-200ms. WebSocket audio systems also require a jitter buffer (50-150ms of added latency) to smooth out bursty packet arrival.
WebRTC uses UDP with RTP. Each packet is independent. A lost 20ms audio frame is imperceptible; a 200ms stall waiting for retransmission is not. WebRTC adds adaptive bitrate, echo cancellation, noise suppression, automatic gain control, and DTLS encryption — all purpose-built for media.
| Dimension | WebSocket | WebRTC |
|---|---|---|
| One-way latency | 200-500ms | 10-30ms |
| Head-of-line blocking | Yes | No |
| Jitter buffer | 50-150ms added | 20-40ms adaptive |
| Echo cancellation | Build it yourself | Built-in |
| Adaptive bitrate | Build it yourself | Built-in |
The transport delta is 200-400ms round-trip. In a pipeline where STT + LLM + TTS already consumes 300-500ms, that delta determines whether you make the 500ms budget or miss it entirely.
Where LiveKit fits
LiveKit is the infrastructure layer that makes voice AI work at production scale. It provides three things:
1. WebRTC transport. LiveKit runs a Selective Forwarding Unit (SFU) that routes WebRTC media between participants. Your agent connects to a LiveKit Room as a participant, just like a human user would. Audio flows over WebRTC in both directions — low-latency, encrypted, with all the media-aware features described above.
2. The Agents framework. LiveKit's Python and TypeScript SDKs give you a high-level API for building voice AI agents. You define an Agent with instructions, configure STT/LLM/TTS models, and LiveKit handles the plumbing: audio capture, streaming to STT, feeding transcripts to the LLM, streaming LLM output to TTS, and delivering synthesized audio back to the user.
3. Cloud infrastructure. LiveKit Cloud runs the SFU, manages room lifecycle, handles scaling, and provides monitoring and debugging tools. You focus on your agent's logic; LiveKit handles the infrastructure.
Think of LiveKit as the phone system for your AI agent. Just as a traditional phone system handles the connection, audio routing, and call quality so a human receptionist can focus on helping callers, LiveKit handles WebRTC transport, media routing, and infrastructure so your AI agent can focus on the conversation. Your dental receptionist agent does not need to know anything about UDP packets, codec negotiation, or NAT traversal — LiveKit handles all of it.
The building blocks: Rooms, Participants, and Tracks
LiveKit organizes everything around three concepts you will use throughout this course:
Room — A virtual space where a conversation happens. When a caller connects to your dental receptionist, a Room is created. Every Room has a unique name and a lifecycle (created, active, closed).
Participant — Anyone in the Room. Your dental receptionist agent is a Participant. The human caller is a Participant. Each Participant has an identity and can publish or subscribe to media.
Track — A single stream of media. An audio Track carries voice. A Participant can publish audio Tracks (their microphone) and subscribe to other Participants' audio Tracks (to hear them). Your agent subscribes to the caller's audio Track (to listen) and publishes its own audio Track (to speak).
When someone calls your dental receptionist, here is what happens:
- A Room is created on LiveKit's servers
- The caller joins as a Participant and publishes an audio Track (their microphone)
- Your agent joins as a Participant, subscribes to the caller's audio Track, and publishes its own audio Track
- Audio flows: caller's Track goes to your agent's STT pipeline; your agent's TTS output goes back as a Track to the caller
- When the call ends, all Participants leave and the Room closes
This is the same architecture whether the caller is using a web browser, a mobile app, or a phone line. The transport is always WebRTC. The experience is always low-latency.
Test your knowledge
Question 1 of 2
In the voice AI pipeline model, what is the correct sequence of stages?
What you will build in this course
Over the next twelve chapters, you will build a complete dental clinic AI receptionist that:
- Greets callers warmly and identifies itself as Bright Smile Dental
- Answers frequently asked questions about the clinic
- Checks appointment availability using tool calls
- Books appointments by collecting patient name and preferred time
- Confirms bookings and handles edge cases gracefully
- Uses noise cancellation for clean audio in any environment
- Has behavioral tests that verify it works correctly
- Deploys to LiveKit Cloud with monitoring and debugging
You will start in the next chapter by installing the LiveKit CLI, scaffolding the project, and speaking your first words to your AI receptionist.
Looking ahead
In the next chapter, we will set up your development environment and build your first agent. You will go from zero to a running voice AI agent in under 20 minutes — and you will speak to it.