Chapter 120m

Architecture comparison

Architecture comparison: pipeline vs realtime

Voice AI agents can be built with two fundamentally different architectures. The pipeline approach chains separate models together — speech-to-text, then a language model, then text-to-speech. The realtime approach uses a single end-to-end model that processes audio directly and produces audio output. Each architecture makes different tradeoffs around latency, control, cost, and capability. Understanding these tradeoffs is the foundation for every decision in this course.

Pipeline modelRealtime modelTradeoffs

The pipeline architecture

The pipeline model is the traditional approach to building voice AI. It breaks the problem into three discrete stages, each handled by a specialized model.

Pipeline architecture

User Audio

STT

LLM

TTS

Agent Audio

1

Speech-to-Text (STT)

The user's audio is transcribed into text by a dedicated STT model such as Deepgram Nova-3 or Google Chirp. Streaming STT produces partial transcripts as the user speaks, so the next stage can begin processing before the user finishes.

2

Large Language Model (LLM)

The transcribed text is sent to a text-based LLM like GPT-4o or Claude. The LLM generates a text response based on the conversation history and instructions. Streaming output means the first tokens arrive quickly.

3

Text-to-Speech (TTS)

The LLM's text output is converted back to audio by a TTS model such as Cartesia or ElevenLabs. Streaming TTS begins synthesizing audio as soon as the first sentence arrives from the LLM, without waiting for the full response.

What's happening

The pipeline architecture works because each stage streams into the next. The STT does not wait for the user to finish before sending partial transcripts. The LLM does not wait for the full transcript before generating tokens. The TTS does not wait for the full response before synthesizing audio. This pipelining is what makes the architecture viable for real-time conversation despite having three separate models in the chain.

In LiveKit, a pipeline agent looks like this:

pipeline_agent.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(instructions="You are a helpful assistant."),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o"),
      tts=cartesia.TTS(),
  )


if __name__ == "__main__":
  server.run()
pipelineAgent.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({ instructions: "You are a helpful assistant." }),
    room: session.room,
    stt: new DeepgramSTT({ model: "nova-3" }),
    llm: new OpenAILLM({ model: "gpt-4o" }),
    tts: new CartesiaTTS(),
  });
},
});

Three separate models, each independently configurable. This is the defining characteristic of the pipeline — you can swap any component without touching the others. Want to switch from Deepgram to Whisper for STT? Change one line:

swap_stt.pypython
# Before: Deepgram Nova-3
stt=deepgram.STT(model="nova-3"),

# After: OpenAI Whisper — only this line changes
stt=openai.STT(model="whisper-1"),

# The LLM and TTS stay exactly the same
llm=openai.LLM(model="gpt-4o"),
tts=cartesia.TTS(),

The same applies to any component. Switch the LLM from GPT-4o to Claude without touching STT or TTS. Switch TTS from Cartesia to ElevenLabs without touching STT or the LLM. Each component is an independent slot.

The realtime architecture

Realtime models take a fundamentally different approach. Instead of converting audio to text, processing text, and converting back to audio, a single model handles the entire flow end-to-end.

Realtime architecture

User Audio

Realtime Model

Agent Audio

The model receives raw audio input and produces raw audio output. There is no intermediate text representation. The model "hears" the user and "speaks" the response directly.

realtime_agent.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(instructions="You are a helpful assistant."),
      room=session.room,
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      ),
  )


if __name__ == "__main__":
  server.run()
realtimeAgent.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({ instructions: "You are a helpful assistant." }),
    room: session.room,
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      voice: "alloy",
    }),
  });
},
});

Notice what is missing: no stt and no tts parameters. The realtime model handles everything. You provide it as the llm and the framework routes audio directly to and from it.

Realtime models still use the llm parameter

Even though a realtime model is not a traditional text LLM, LiveKit's AgentSession accepts it via the llm parameter. The framework detects that it is a realtime model and adjusts the pipeline accordingly — skipping STT and TTS entirely and routing audio directly. This keeps the API surface simple.

Comparing the architectures

Here is a side-by-side comparison of the two approaches across the dimensions that matter most for production voice agents.

DimensionPipeline (STT + LLM + TTS)Realtime (end-to-end)
LatencyHigher — three model round-trips, even with streamingLower — single model, no intermediate steps
Component controlFull — swap STT, LLM, or TTS independentlyLimited — the model is a single unit
Voice selectionWide — choose from any TTS provider's voice libraryNarrow — limited to the realtime model's built-in voices
Tool useMature — standard LLM function callingEmerging — supported but with fewer patterns
Conversation nuanceText-based — loses tone, emphasis, emotion from audioAudio-native — preserves vocal nuance and prosody
Provider optionsMany — mix and match across dozens of providersFew — OpenAI Realtime, Gemini Live, and a small number of others
CostVariable — pay separately for each modelBundled — single pricing, often higher per-minute
Interruption handlingEngineered — requires VAD and explicit logicNative — the model handles interruptions naturally
MultilingualPer-component — each model must support the languageUnified — the model handles language natively
TransparencyHigh — you can log transcripts at each stageLower — no intermediate text unless explicitly requested
What's happening

Neither architecture is universally better. The pipeline gives you maximum control and provider flexibility at the cost of higher latency and engineering complexity. Realtime models give you lower latency and more natural conversation dynamics at the cost of fewer customization options and provider lock-in. The right choice depends on your specific requirements — which is exactly what the rest of this course will help you determine.

How latency adds up in a pipeline

Understanding where time goes in each architecture helps explain the latency difference. In a pipeline, the total time from the user finishing their sentence to hearing the first syllable of the response is approximately:

Pipeline latency components

STT finalization

LLM first token

TTS first audio chunk

Typical values:

StageTypical latencyWhat contributes
STT finalization200-400msEndpointing delay, final transcript processing
LLM first token200-500msNetwork round-trip, model inference startup
TTS first chunk100-300msNetwork round-trip, audio buffer fill
Total500-1200msCumulative across three models

A realtime model collapses all three stages into one:

StageTypical latencyWhat contributes
Model first audio200-500msSingle network round-trip, model inference
Total200-500msSingle model, no cascading delays

These are typical ranges, not guarantees

Actual latency depends on model choice, network conditions, region, load, and configuration. A well-tuned pipeline with fast models can approach realtime latency. A poorly configured realtime model can be slower than expected. The numbers above represent typical production scenarios, not theoretical minimums.

When pipeline wins

The pipeline architecture is the better choice in several common scenarios:

  • You need a specific voice. TTS providers like Cartesia and ElevenLabs offer hundreds of voices with fine-grained control over speed, emotion, and style. Realtime models offer a handful of built-in voices.
  • You need the best LLM. Pipeline lets you use whatever text LLM is best for your use case — Claude for nuanced reasoning, GPT-4o for general capability, a fine-tuned model for domain-specific tasks. Realtime models are limited to what the provider offers.
  • You need full transcript logging. In regulated industries (healthcare, finance, legal), you may need a complete text transcript of every conversation. The pipeline produces transcripts naturally at the STT and LLM stages. Realtime models can produce transcripts but it is not their primary output.
  • You need heavy tool use. Complex multi-step tool calling with validation, retries, and chained operations is more mature and predictable in text-based LLMs.
  • You need to control cost. With a pipeline, you can use a cheaper STT, a smaller LLM, and a budget TTS to minimize cost. Realtime models have bundled pricing that you cannot optimize component by component.

When realtime wins

Realtime models are the better choice in other scenarios:

  • Latency is critical. If you need the fastest possible response time — a customer service bot where every millisecond matters, or an interactive game character — realtime models have a fundamental advantage.
  • Natural conversation is the goal. Realtime models hear tone, pacing, and emphasis. They can respond with matching energy, pause naturally, and handle overlapping speech gracefully. Pipelines lose this nuance in the text conversion.
  • Simplicity matters. One model instead of three means fewer configuration decisions, fewer failure points, and fewer API keys to manage.
  • Interruption handling must be seamless. Realtime models handle barge-in (the user interrupting the agent mid-sentence) natively. Pipelines require explicit VAD configuration and careful interruption logic.

Test your knowledge

Question 1 of 3

Why can a pipeline architecture achieve viable real-time conversation performance despite chaining three separate models?

What comes next

You now understand the two architectures at a conceptual level. In the next chapter, you will build a fully optimized pipeline agent, exploring model selection, latency tuning, and the configuration options that make pipelines competitive. Then in Chapters 3 and 4, you will implement the same agent using OpenAI Realtime and Gemini Live, giving you hands-on experience with both approaches.

Build both, then decide

The most effective way to choose an architecture is to build both and compare them in your specific context. This course is structured to help you do exactly that — by the end, you will have pipeline, OpenAI Realtime, and Gemini Live implementations of the same agent, with benchmarks to compare them.

Concepts covered
Pipeline modelRealtime modelTradeoffs