Chapter 220m

Pipeline implementation

Pipeline implementation

In this chapter, you will build a production-quality pipeline agent with carefully selected models and latency optimizations. You already know that a pipeline chains STT, LLM, and TTS together. Now you will learn how to choose the right model for each stage, tune the configuration for minimum latency, and handle the edge cases that separate a demo from a production system.

Pipeline setupModel selectionLatency tuning

The standard pipeline

Here is a complete pipeline agent with explicit model configuration for every stage:

pipeline_agent.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a customer support agent for Acme Corp. "
              "Help customers with orders, returns, and product questions. "
              "Keep responses concise — two sentences maximum. "
              "Never use markdown or bullet points."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
      vad=silero.VAD.load(),
  )


if __name__ == "__main__":
  server.run()
pipelineAgent.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a customer support agent for Acme Corp. " +
        "Help customers with orders, returns, and product questions. " +
        "Keep responses concise — two sentences maximum. " +
        "Never use markdown or bullet points.",
    }),
    room: session.room,
    stt: new DeepgramSTT({ model: "nova-3" }),
    llm: new OpenAILLM({ model: "gpt-4o-mini" }),
    tts: new CartesiaTTS({ voice: "79a125e8-cd45-4c13-8a67-188112f4dd22" }),
    vad: await SileroVAD.load(),
  });
},
});

Each parameter represents a stage in the pipeline. Let us examine the model choices for each stage and why they matter.

Choosing your STT model

The STT model is the first link in the chain. Its speed and accuracy directly impact everything downstream — a slow STT delays the LLM, and a misheard word produces a wrong response.

STT ProviderModelLatencyAccuracyBest for
Deepgramnova-3Very lowHighGeneral-purpose, English
Deepgramnova-3-medicalLowVery high (medical)Healthcare terminology
Googlechirp-2LowHighMultilingual support
OpenAIwhisper-large-v3HigherVery highOffline/batch processing
stt_options.py (excerpt)python
# Fast and accurate for English conversations
stt=deepgram.STT(model="nova-3")

# With keyword boosting for domain terms
stt=deepgram.STT(
  model="nova-3",
  keywords=[("Acme Corp", 3.0), ("SKU", 2.0)],
)

# Multilingual with Google
from livekit.plugins import google
stt=google.STT(model="chirp-2", languages=["en", "es", "fr"])
sttOptions.ts (excerpt)typescript
// Fast and accurate for English conversations
const stt = new DeepgramSTT({ model: "nova-3" });

// With keyword boosting for domain terms
const stt = new DeepgramSTT({
model: "nova-3",
keywords: [{ word: "Acme Corp", boost: 3.0 }, { word: "SKU", boost: 2.0 }],
});

// Multilingual with Google
import { GoogleSTT } from "@livekit/agents-plugin-google";
const stt = new GoogleSTT({ model: "chirp-2", languages: ["en", "es", "fr"] });

Keyword boosting is underrated

Domain-specific terms — product names, technical jargon, proper nouns — are the most common source of STT errors. Boosting these keywords costs nothing in latency but can dramatically improve accuracy. Add them early and update them as you learn what your users say.

Choosing your LLM

The LLM is the brain of your pipeline. It determines response quality, reasoning capability, and a significant portion of overall latency.

LLM ProviderModelFirst tokenQualityBest for
OpenAIgpt-4o-miniFastGoodHigh-volume, simple tasks
OpenAIgpt-4oModerateVery highComplex reasoning, tool use
Anthropicclaude-sonnet-4-20250514FastVery highNuanced conversation, long context
Anthropicclaude-haiku-3-20250603Very fastGoodCost-optimized, simple tasks
llm_options.py (excerpt)python
# Fast and cost-effective
llm=openai.LLM(model="gpt-4o-mini")

# High quality for complex conversations
llm=openai.LLM(model="gpt-4o")

# Using Anthropic models
from livekit.plugins import anthropic
llm=anthropic.LLM(model="claude-sonnet-4-20250514")

# With temperature control for more consistent responses
llm=openai.LLM(model="gpt-4o-mini", temperature=0.6)
llmOptions.ts (excerpt)typescript
// Fast and cost-effective
const llm = new OpenAILLM({ model: "gpt-4o-mini" });

// High quality for complex conversations
const llm = new OpenAILLM({ model: "gpt-4o" });

// Using Anthropic models
import { AnthropicLLM } from "@livekit/agents-plugin-anthropic";
const llm = new AnthropicLLM({ model: "claude-sonnet-4-20250514" });

// With temperature control for more consistent responses
const llm = new OpenAILLM({ model: "gpt-4o-mini", temperature: 0.6 });
What's happening

For voice agents, first-token latency matters more than throughput. A model that generates the first token in 200ms but takes 3 seconds for the full response is better than one that takes 500ms for the first token but finishes in 2 seconds total. Why? Because TTS begins synthesizing as soon as the first complete sentence arrives. The user hears audio while the LLM is still generating.

Choosing your TTS

The TTS model determines how your agent sounds. It is often the most impactful choice for user experience — a natural-sounding voice with appropriate pacing makes the difference between an agent people want to talk to and one they hang up on.

TTS ProviderLatencyVoice qualityVoice optionsBest for
CartesiaVery lowHighLarge libraryLowest-latency production use
ElevenLabsLowVery highLarge library, voice cloningPremium voice quality
OpenAIModerateHighLimited setSimple setup
GoogleLowGoodMany languagesMultilingual
tts_options.py (excerpt)python
# Low-latency with Cartesia
tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22")

# With speed and emotion control
tts=cartesia.TTS(
  voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
  speed="normal",
  emotion=["positivity:high"],
)

# Premium quality with ElevenLabs
from livekit.plugins import elevenlabs
tts=elevenlabs.TTS(voice_id="pNInz6obpgDQGcFmaJgB")

# Simple setup with OpenAI
tts=openai.TTS(voice="nova")
ttsOptions.ts (excerpt)typescript
// Low-latency with Cartesia
const tts = new CartesiaTTS({ voice: "79a125e8-cd45-4c13-8a67-188112f4dd22" });

// With speed and emotion control
const tts = new CartesiaTTS({
voice: "79a125e8-cd45-4c13-8a67-188112f4dd22",
speed: "normal",
emotion: ["positivity:high"],
});

// Premium quality with ElevenLabs
import { ElevenLabsTTS } from "@livekit/agents-plugin-elevenlabs";
const tts = new ElevenLabsTTS({ voiceId: "pNInz6obpgDQGcFmaJgB" });

// Simple setup with OpenAI
import { OpenAITTS } from "@livekit/agents-plugin-openai";
const tts = new OpenAITTS({ voice: "nova" });

Latency tuning strategies

Model selection is the biggest lever for latency, but several other techniques can shave off meaningful milliseconds.

1. Minimize instructions length

Every token in your system prompt adds to LLM processing time. Shorter instructions mean faster first-token latency.

instructions.py (excerpt)python
# Too long — adds latency
agent=Agent(
  instructions="""You are a customer support agent for Acme Corp, a leading
  provider of innovative consumer products. Founded in 1985, Acme Corp has
  been serving customers worldwide with a commitment to quality and service.
  You should help customers with their orders, process returns according to
  our 30-day return policy, answer product questions based on our catalog,
  and escalate complex issues to human agents when necessary. Always maintain
  a professional but friendly tone. Never use technical jargon unless the
  customer uses it first. Always confirm order numbers before making changes.
  ...(500 more tokens)...""",
)

# Better — concise and specific
agent=Agent(
  instructions=(
      "You are Acme Corp customer support. Help with orders, returns, "
      "and product questions. Two sentences max. No markdown."
  ),
)

2. Use region-aware deployment

Deploy your agent in the same region as your model providers. A pipeline agent calling Deepgram (US), OpenAI (US), and Cartesia (US) from a server in Europe adds 100-200ms of network latency per hop — and there are three hops.

Three hops means three penalties

In a pipeline, network latency is paid three times — once for STT, once for LLM, once for TTS. If each hop adds 50ms of unnecessary latency due to cross-region calls, you lose 150ms total. Deploy in the same region as your providers, or use providers with edge endpoints.

3. Tune VAD sensitivity

Voice Activity Detection determines when the user has stopped speaking. Aggressive settings (shorter silence threshold) start the response faster but risk cutting off mid-sentence pauses. Conservative settings are safer but add latency.

vad_tuning.py (excerpt)python
from livekit.plugins import silero

# Default — balanced
vad=silero.VAD.load()

# Faster response — shorter silence threshold
# Good for quick Q&A style interactions
vad=silero.VAD.load(
  min_silence_duration=0.3,  # seconds of silence before end-of-turn
)
vadTuning.ts (excerpt)typescript
import { SileroVAD } from "@livekit/agents-plugin-silero";

// Default — balanced
const vad = await SileroVAD.load();

// Faster response — shorter silence threshold
const vad = await SileroVAD.load({
minSilenceDuration: 0.3, // seconds of silence before end-of-turn
});

4. Enable LLM streaming

Streaming is typically enabled by default, but verify it. Non-streaming LLM calls wait for the entire response before passing text to TTS — a latency catastrophe for voice agents.

What's happening

With streaming enabled, the flow looks like this: the LLM emits tokens one at a time. As soon as enough tokens form a complete sentence (or a natural pause point), the TTS begins synthesizing audio for that sentence. The user hears the first sentence while the LLM is still generating the second. Without streaming, the user waits for the entire response to be generated before hearing anything.

Putting it all together: optimized pipeline

Here is a fully optimized pipeline agent that applies all the tuning strategies:

optimized_pipeline.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are Acme Corp support. Help with orders, returns, "
              "and products. Two sentences max. No markdown."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(
          model="nova-3",
          keywords=[("Acme", 3.0), ("SKU", 2.0)],
      ),
      llm=openai.LLM(
          model="gpt-4o-mini",
          temperature=0.6,
      ),
      tts=cartesia.TTS(
          voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
          speed="normal",
      ),
      vad=silero.VAD.load(
          min_silence_duration=0.3,
      ),
  )


if __name__ == "__main__":
  server.run()
optimizedPipeline.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are Acme Corp support. Help with orders, returns, " +
        "and products. Two sentences max. No markdown.",
    }),
    room: session.room,
    stt: new DeepgramSTT({
      model: "nova-3",
      keywords: [
        { word: "Acme", boost: 3.0 },
        { word: "SKU", boost: 2.0 },
      ],
    }),
    llm: new OpenAILLM({
      model: "gpt-4o-mini",
      temperature: 0.6,
    }),
    tts: new CartesiaTTS({
      voice: "79a125e8-cd45-4c13-8a67-188112f4dd22",
      speed: "normal",
    }),
    vad: await SileroVAD.load({
      minSilenceDuration: 0.3,
    }),
  });
},
});

This configuration targets a total end-to-end latency of 600-900ms: Deepgram Nova-3 for fast STT, GPT-4o-mini for fast first-token, Cartesia for fast audio synthesis, and aggressive VAD for quick turn detection.

Observing the pipeline

One major advantage of the pipeline architecture is observability. You can log what happens at every stage:

observable_pipeline.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  @session.on("user_input_transcribed")
  def on_transcript(transcript):
      print(f"[STT] User said: {transcript.text}")

  @session.on("agent_state_changed")
  def on_state(state: str):
      print(f"[Pipeline] Agent state: {state}")

  @session.on("conversation_item_added")
  def on_item(item):
      print(f"[LLM] Conversation item: {item}")

  await session.start(
      agent=Agent(
          instructions=(
              "You are Acme Corp support. Help with orders, returns, "
              "and products. Two sentences max. No markdown."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(),
      vad=silero.VAD.load(),
  )


if __name__ == "__main__":
  server.run()
observablePipeline.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";

export default defineAgent({
entry: async (session: RtcSession) => {
  session.on("userInputTranscribed", (transcript) => {
    console.log("[STT] User said:", transcript.text);
  });

  session.on("agentStateChanged", (state) => {
    console.log("[Pipeline] Agent state:", state);
  });

  session.on("conversationItemAdded", (item) => {
    console.log("[LLM] Conversation item:", item);
  });

  await session.start({
    agent: new Agent({
      instructions:
        "You are Acme Corp support. Help with orders, returns, " +
        "and products. Two sentences max. No markdown.",
    }),
    room: session.room,
    stt: new DeepgramSTT({ model: "nova-3" }),
    llm: new OpenAILLM({ model: "gpt-4o-mini" }),
    tts: new CartesiaTTS(),
    vad: await SileroVAD.load(),
  });
},
});

Every user utterance, every agent state transition, every conversation item — all visible. This transparency is invaluable for debugging, analytics, and quality monitoring. You will see in the next chapter how realtime models provide less visibility by default.

Test your knowledge

Question 1 of 3

Why is first-token latency more important than total generation time for LLMs in a voice AI pipeline?

What comes next

You have built an optimized pipeline agent with careful model selection, latency tuning, and observability. In the next chapter, you will implement the same agent using OpenAI's Realtime API — a single model that replaces the entire STT + LLM + TTS chain. The comparison will be illuminating.

Keep this agent running

As you build the realtime versions in Chapters 3 and 4, keep your pipeline agent available for comparison. In Chapter 6, you will benchmark all three implementations side by side.

Concepts covered
Pipeline setupModel selectionLatency tuning