What is voice AI? Why WebRTC matters

In this chapter, you will learn how the voice AI pipeline works end to end, where every millisecond of latency goes, why WebRTC is the only transport protocol that delivers conversational-quality voice AI, and how LiveKit fits into the picture. This is the conceptual foundation for everything you will build in this course.

STT->LLM->TTSWebRTCSFURoomsParticipantsTracks

The voice AI pipeline

Every voice AI system — from a customer support bot to the dental receptionist you will build in this course — follows the same fundamental pipeline. The user speaks, the system thinks, the system speaks back.

That pipeline has three stages:

Speech-to-Text (STT)

Raw audio from the user's microphone is transcribed into text. Streaming STT models like Deepgram Nova-3 produce partial transcripts as audio arrives, so the system does not wait for the user to finish an entire sentence before starting to process it. Typical latency: 100-300ms for a confident final transcript.

Large Language Model (LLM)

The transcript is sent to an LLM — GPT-4o-mini, Claude, or similar — which generates a text response. With streaming token generation, the first token can arrive in 50-200ms. The LLM is where your agent's personality, knowledge, and decision-making live.

Text-to-Speech (TTS)

The LLM's text output is converted into audio. Modern streaming TTS engines like Cartesia or ElevenLabs begin synthesizing audio from partial text, producing the first audible chunk in 100-300ms. This is the voice your users hear.

This is the pipeline model: STT feeds LLM, LLM feeds TTS, each component is a separate service. It is the dominant architecture today, and it is what you will build with in this course.

What's happening

Think of it like a relay race. The STT runner takes the baton (audio) and hands off text to the LLM runner, who hands off text to the TTS runner, who delivers audio back to the user. The total time is the sum of all three legs plus the time spent passing the baton — the transport.

Pipeline models vs realtime models

There is an alternative architecture emerging: realtime (speech-to-speech) models. Instead of three separate stages, a single model accepts audio input and produces audio output directly. OpenAI's GPT-4o Realtime and Google's Gemini Live are examples.

Dimension	Pipeline (STT + LLM + TTS)	Realtime (speech-to-speech)
Architecture	Three separate models in sequence	Single multimodal model
Latency	Sum of three stages (~300-800ms)	Single inference pass (~200-500ms)
Voice quality	Choose best-in-class TTS independently	Voice quality tied to the model
Model flexibility	Swap any component (use Deepgram STT with Claude LLM with Cartesia TTS)	Locked to one provider
Cost	Pay for three services	Pay for one (often more expensive per minute)
Maturity	Production-proven, well-understood	Emerging, rapid improvement
Emotional nuance	TTS determines expressiveness	Model can respond to tone of voice

LiveKit supports both architectures. In this course, you will use the pipeline model because it gives you maximum control and flexibility — you can swap any component independently. In Course 2.3 (Realtime vs Pipeline), you will explore speech-to-speech models in depth.

The pipeline model is not going away

Even as realtime models improve, pipeline architectures remain dominant in production. The ability to choose the best STT, the best LLM, and the best TTS independently — and to upgrade each on its own schedule — is a powerful operational advantage. Most production voice AI systems today use the pipeline model.

The latency budget: 500ms or bust

Human conversation has a natural rhythm. The average gap between conversational turns is roughly 200 milliseconds. Listeners start perceiving awkwardness around 500ms of silence. Beyond one second, the interaction feels broken.

Your voice AI system has a total budget of roughly 500 milliseconds from the moment a user stops speaking to the moment they hear the first syllable of a response.

Here is where those milliseconds go:

Stage	Best case	Typical	Notes
Audio capture + encoding	~50ms	~50ms	OS audio buffer + Opus encoding
Transport to server	~10ms	~200ms	Protocol-dependent — this is the key variable
STT processing	~100ms	~200ms	Streaming models, final transcript
LLM first token	~50ms	~150ms	Depends on model, prompt length, provider
TTS first audio chunk	~100ms	~200ms	Streaming synthesis from partial text
Transport to user	~10ms	~200ms	Same transport overhead, in reverse
Total	~320ms	~1000ms	Typical case blows the budget

The model processing stages — STT, LLM, TTS — are largely determined by the state of the art. You can choose faster models, but you cannot cheat physics. The variable you fully control is transport, and it appears in the budget twice.

Transport is the tax you pay twice

A transport layer that adds 200ms each way contributes 400ms to the total. That alone nearly consumes the entire 500ms budget before the AI models even begin processing. This is why protocol choice is the single most impactful infrastructure decision in voice AI.

Why WebRTC, not WebSockets

If you completed Course 0.1 (LiveKit Architecture), you explored this topic in detail. Here is the summary for voice AI specifically.

WebSockets use TCP. TCP guarantees ordered delivery — if packet 47 is lost, packets 48 through 200 wait in a buffer for the retransmission. This head-of-line blocking can stall audio for 50-200ms. WebSocket audio systems also require a jitter buffer (50-150ms of added latency) to smooth out bursty packet arrival.

WebRTC uses UDP with RTP. Each packet is independent. A lost 20ms audio frame is imperceptible; a 200ms stall waiting for retransmission is not. WebRTC adds adaptive bitrate, echo cancellation, noise suppression, automatic gain control, and DTLS encryption — all purpose-built for media.

Dimension	WebSocket	WebRTC
One-way latency	200-500ms	10-30ms
Head-of-line blocking	Yes	No
Jitter buffer	50-150ms added	20-40ms adaptive
Echo cancellation	Build it yourself	Built-in
Adaptive bitrate	Build it yourself	Built-in

The transport delta is 200-400ms round-trip. In a pipeline where STT + LLM + TTS already consumes 300-500ms, that delta determines whether you make the 500ms budget or miss it entirely.

Where LiveKit fits

LiveKit is the infrastructure layer that makes voice AI work at production scale. It provides three things:

1. WebRTC transport. LiveKit runs a Selective Forwarding Unit (SFU) that routes WebRTC media between participants. Your agent connects to a LiveKit Room as a participant, just like a human user would. Audio flows over WebRTC in both directions — low-latency, encrypted, with all the media-aware features described above.

2. The Agents framework. LiveKit's Python and TypeScript SDKs give you a high-level API for building voice AI agents. You define an Agent with instructions, configure STT/LLM/TTS models, and LiveKit handles the plumbing: audio capture, streaming to STT, feeding transcripts to the LLM, streaming LLM output to TTS, and delivering synthesized audio back to the user.

3. Cloud infrastructure. LiveKit Cloud runs the SFU, manages room lifecycle, handles scaling, and provides monitoring and debugging tools. You focus on your agent's logic; LiveKit handles the infrastructure.

What's happening

Think of LiveKit as the phone system for your AI agent. Just as a traditional phone system handles the connection, audio routing, and call quality so a human receptionist can focus on helping callers, LiveKit handles WebRTC transport, media routing, and infrastructure so your AI agent can focus on the conversation. Your dental receptionist agent does not need to know anything about UDP packets, codec negotiation, or NAT traversal — LiveKit handles all of it.

The building blocks: Rooms, Participants, and Tracks

LiveKit organizes everything around three concepts you will use throughout this course:

Room — A virtual space where a conversation happens. When a caller connects to your dental receptionist, a Room is created. Every Room has a unique name and a lifecycle (created, active, closed).

Participant — Anyone in the Room. Your dental receptionist agent is a Participant. The human caller is a Participant. Each Participant has an identity and can publish or subscribe to media.

Track — A single stream of media. An audio Track carries voice. A Participant can publish audio Tracks (their microphone) and subscribe to other Participants' audio Tracks (to hear them). Your agent subscribes to the caller's audio Track (to listen) and publishes its own audio Track (to speak).

When someone calls your dental receptionist, here is what happens:

A Room is created on LiveKit's servers
The caller joins as a Participant and publishes an audio Track (their microphone)
Your agent joins as a Participant, subscribes to the caller's audio Track, and publishes its own audio Track
Audio flows: caller's Track goes to your agent's STT pipeline; your agent's TTS output goes back as a Track to the caller
When the call ends, all Participants leave and the Room closes

This is the same architecture whether the caller is using a web browser, a mobile app, or a phone line. The transport is always WebRTC. The experience is always low-latency.

Test your knowledge

Question 1 of 2

In the voice AI pipeline model, what is the correct sequence of stages?

What you will build in this course

Over the next twelve chapters, you will build a complete dental clinic AI receptionist that:

Greets callers warmly and identifies itself as Bright Smile Dental
Answers frequently asked questions about the clinic
Checks appointment availability using tool calls
Books appointments by collecting patient name and preferred time
Confirms bookings and handles edge cases gracefully
Uses noise cancellation for clean audio in any environment
Has behavioral tests that verify it works correctly
Deploys to LiveKit Cloud with monitoring and debugging

You will start in the next chapter by installing the LiveKit CLI, scaffolding the project, and speaking your first words to your AI receptionist.

Looking ahead

In the next chapter, we will set up your development environment and build your first agent. You will go from zero to a running voice AI agent in under 20 minutes — and you will speak to it.