Pipeline implementation
Pipeline implementation
In this chapter, you will build a production-quality pipeline agent with carefully selected models and latency optimizations. You already know that a pipeline chains STT, LLM, and TTS together. Now you will learn how to choose the right model for each stage, tune the configuration for minimum latency, and handle the edge cases that separate a demo from a production system.
The standard pipeline
Here is a complete pipeline agent with explicit model configuration for every stage:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are a customer support agent for Acme Corp. "
"Help customers with orders, returns, and product questions. "
"Keep responses concise — two sentences maximum. "
"Never use markdown or bullet points."
),
),
room=session.room,
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
vad=silero.VAD.load(),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are a customer support agent for Acme Corp. " +
"Help customers with orders, returns, and product questions. " +
"Keep responses concise — two sentences maximum. " +
"Never use markdown or bullet points.",
}),
room: session.room,
stt: new DeepgramSTT({ model: "nova-3" }),
llm: new OpenAILLM({ model: "gpt-4o-mini" }),
tts: new CartesiaTTS({ voice: "79a125e8-cd45-4c13-8a67-188112f4dd22" }),
vad: await SileroVAD.load(),
});
},
});Each parameter represents a stage in the pipeline. Let us examine the model choices for each stage and why they matter.
Choosing your STT model
The STT model is the first link in the chain. Its speed and accuracy directly impact everything downstream — a slow STT delays the LLM, and a misheard word produces a wrong response.
| STT Provider | Model | Latency | Accuracy | Best for |
|---|---|---|---|---|
| Deepgram | nova-3 | Very low | High | General-purpose, English |
| Deepgram | nova-3-medical | Low | Very high (medical) | Healthcare terminology |
| chirp-2 | Low | High | Multilingual support | |
| OpenAI | whisper-large-v3 | Higher | Very high | Offline/batch processing |
# Fast and accurate for English conversations
stt=deepgram.STT(model="nova-3")
# With keyword boosting for domain terms
stt=deepgram.STT(
model="nova-3",
keywords=[("Acme Corp", 3.0), ("SKU", 2.0)],
)
# Multilingual with Google
from livekit.plugins import google
stt=google.STT(model="chirp-2", languages=["en", "es", "fr"])// Fast and accurate for English conversations
const stt = new DeepgramSTT({ model: "nova-3" });
// With keyword boosting for domain terms
const stt = new DeepgramSTT({
model: "nova-3",
keywords: [{ word: "Acme Corp", boost: 3.0 }, { word: "SKU", boost: 2.0 }],
});
// Multilingual with Google
import { GoogleSTT } from "@livekit/agents-plugin-google";
const stt = new GoogleSTT({ model: "chirp-2", languages: ["en", "es", "fr"] });Keyword boosting is underrated
Domain-specific terms — product names, technical jargon, proper nouns — are the most common source of STT errors. Boosting these keywords costs nothing in latency but can dramatically improve accuracy. Add them early and update them as you learn what your users say.
Choosing your LLM
The LLM is the brain of your pipeline. It determines response quality, reasoning capability, and a significant portion of overall latency.
| LLM Provider | Model | First token | Quality | Best for |
|---|---|---|---|---|
| OpenAI | gpt-4o-mini | Fast | Good | High-volume, simple tasks |
| OpenAI | gpt-4o | Moderate | Very high | Complex reasoning, tool use |
| Anthropic | claude-sonnet-4-20250514 | Fast | Very high | Nuanced conversation, long context |
| Anthropic | claude-haiku-3-20250603 | Very fast | Good | Cost-optimized, simple tasks |
# Fast and cost-effective
llm=openai.LLM(model="gpt-4o-mini")
# High quality for complex conversations
llm=openai.LLM(model="gpt-4o")
# Using Anthropic models
from livekit.plugins import anthropic
llm=anthropic.LLM(model="claude-sonnet-4-20250514")
# With temperature control for more consistent responses
llm=openai.LLM(model="gpt-4o-mini", temperature=0.6)// Fast and cost-effective
const llm = new OpenAILLM({ model: "gpt-4o-mini" });
// High quality for complex conversations
const llm = new OpenAILLM({ model: "gpt-4o" });
// Using Anthropic models
import { AnthropicLLM } from "@livekit/agents-plugin-anthropic";
const llm = new AnthropicLLM({ model: "claude-sonnet-4-20250514" });
// With temperature control for more consistent responses
const llm = new OpenAILLM({ model: "gpt-4o-mini", temperature: 0.6 });For voice agents, first-token latency matters more than throughput. A model that generates the first token in 200ms but takes 3 seconds for the full response is better than one that takes 500ms for the first token but finishes in 2 seconds total. Why? Because TTS begins synthesizing as soon as the first complete sentence arrives. The user hears audio while the LLM is still generating.
Choosing your TTS
The TTS model determines how your agent sounds. It is often the most impactful choice for user experience — a natural-sounding voice with appropriate pacing makes the difference between an agent people want to talk to and one they hang up on.
| TTS Provider | Latency | Voice quality | Voice options | Best for |
|---|---|---|---|---|
| Cartesia | Very low | High | Large library | Lowest-latency production use |
| ElevenLabs | Low | Very high | Large library, voice cloning | Premium voice quality |
| OpenAI | Moderate | High | Limited set | Simple setup |
| Low | Good | Many languages | Multilingual |
# Low-latency with Cartesia
tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22")
# With speed and emotion control
tts=cartesia.TTS(
voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
speed="normal",
emotion=["positivity:high"],
)
# Premium quality with ElevenLabs
from livekit.plugins import elevenlabs
tts=elevenlabs.TTS(voice_id="pNInz6obpgDQGcFmaJgB")
# Simple setup with OpenAI
tts=openai.TTS(voice="nova")// Low-latency with Cartesia
const tts = new CartesiaTTS({ voice: "79a125e8-cd45-4c13-8a67-188112f4dd22" });
// With speed and emotion control
const tts = new CartesiaTTS({
voice: "79a125e8-cd45-4c13-8a67-188112f4dd22",
speed: "normal",
emotion: ["positivity:high"],
});
// Premium quality with ElevenLabs
import { ElevenLabsTTS } from "@livekit/agents-plugin-elevenlabs";
const tts = new ElevenLabsTTS({ voiceId: "pNInz6obpgDQGcFmaJgB" });
// Simple setup with OpenAI
import { OpenAITTS } from "@livekit/agents-plugin-openai";
const tts = new OpenAITTS({ voice: "nova" });Latency tuning strategies
Model selection is the biggest lever for latency, but several other techniques can shave off meaningful milliseconds.
1. Minimize instructions length
Every token in your system prompt adds to LLM processing time. Shorter instructions mean faster first-token latency.
# Too long — adds latency
agent=Agent(
instructions="""You are a customer support agent for Acme Corp, a leading
provider of innovative consumer products. Founded in 1985, Acme Corp has
been serving customers worldwide with a commitment to quality and service.
You should help customers with their orders, process returns according to
our 30-day return policy, answer product questions based on our catalog,
and escalate complex issues to human agents when necessary. Always maintain
a professional but friendly tone. Never use technical jargon unless the
customer uses it first. Always confirm order numbers before making changes.
...(500 more tokens)...""",
)
# Better — concise and specific
agent=Agent(
instructions=(
"You are Acme Corp customer support. Help with orders, returns, "
"and product questions. Two sentences max. No markdown."
),
)2. Use region-aware deployment
Deploy your agent in the same region as your model providers. A pipeline agent calling Deepgram (US), OpenAI (US), and Cartesia (US) from a server in Europe adds 100-200ms of network latency per hop — and there are three hops.
Three hops means three penalties
In a pipeline, network latency is paid three times — once for STT, once for LLM, once for TTS. If each hop adds 50ms of unnecessary latency due to cross-region calls, you lose 150ms total. Deploy in the same region as your providers, or use providers with edge endpoints.
3. Tune VAD sensitivity
Voice Activity Detection determines when the user has stopped speaking. Aggressive settings (shorter silence threshold) start the response faster but risk cutting off mid-sentence pauses. Conservative settings are safer but add latency.
from livekit.plugins import silero
# Default — balanced
vad=silero.VAD.load()
# Faster response — shorter silence threshold
# Good for quick Q&A style interactions
vad=silero.VAD.load(
min_silence_duration=0.3, # seconds of silence before end-of-turn
)import { SileroVAD } from "@livekit/agents-plugin-silero";
// Default — balanced
const vad = await SileroVAD.load();
// Faster response — shorter silence threshold
const vad = await SileroVAD.load({
minSilenceDuration: 0.3, // seconds of silence before end-of-turn
});4. Enable LLM streaming
Streaming is typically enabled by default, but verify it. Non-streaming LLM calls wait for the entire response before passing text to TTS — a latency catastrophe for voice agents.
With streaming enabled, the flow looks like this: the LLM emits tokens one at a time. As soon as enough tokens form a complete sentence (or a natural pause point), the TTS begins synthesizing audio for that sentence. The user hears the first sentence while the LLM is still generating the second. Without streaming, the user waits for the entire response to be generated before hearing anything.
Putting it all together: optimized pipeline
Here is a fully optimized pipeline agent that applies all the tuning strategies:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are Acme Corp support. Help with orders, returns, "
"and products. Two sentences max. No markdown."
),
),
room=session.room,
stt=deepgram.STT(
model="nova-3",
keywords=[("Acme", 3.0), ("SKU", 2.0)],
),
llm=openai.LLM(
model="gpt-4o-mini",
temperature=0.6,
),
tts=cartesia.TTS(
voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
speed="normal",
),
vad=silero.VAD.load(
min_silence_duration=0.3,
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are Acme Corp support. Help with orders, returns, " +
"and products. Two sentences max. No markdown.",
}),
room: session.room,
stt: new DeepgramSTT({
model: "nova-3",
keywords: [
{ word: "Acme", boost: 3.0 },
{ word: "SKU", boost: 2.0 },
],
}),
llm: new OpenAILLM({
model: "gpt-4o-mini",
temperature: 0.6,
}),
tts: new CartesiaTTS({
voice: "79a125e8-cd45-4c13-8a67-188112f4dd22",
speed: "normal",
}),
vad: await SileroVAD.load({
minSilenceDuration: 0.3,
}),
});
},
});This configuration targets a total end-to-end latency of 600-900ms: Deepgram Nova-3 for fast STT, GPT-4o-mini for fast first-token, Cartesia for fast audio synthesis, and aggressive VAD for quick turn detection.
Observing the pipeline
One major advantage of the pipeline architecture is observability. You can log what happens at every stage:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
@session.on("user_input_transcribed")
def on_transcript(transcript):
print(f"[STT] User said: {transcript.text}")
@session.on("agent_state_changed")
def on_state(state: str):
print(f"[Pipeline] Agent state: {state}")
@session.on("conversation_item_added")
def on_item(item):
print(f"[LLM] Conversation item: {item}")
await session.start(
agent=Agent(
instructions=(
"You are Acme Corp support. Help with orders, returns, "
"and products. Two sentences max. No markdown."
),
),
room=session.room,
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(),
vad=silero.VAD.load(),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { SileroVAD } from "@livekit/agents-plugin-silero";
export default defineAgent({
entry: async (session: RtcSession) => {
session.on("userInputTranscribed", (transcript) => {
console.log("[STT] User said:", transcript.text);
});
session.on("agentStateChanged", (state) => {
console.log("[Pipeline] Agent state:", state);
});
session.on("conversationItemAdded", (item) => {
console.log("[LLM] Conversation item:", item);
});
await session.start({
agent: new Agent({
instructions:
"You are Acme Corp support. Help with orders, returns, " +
"and products. Two sentences max. No markdown.",
}),
room: session.room,
stt: new DeepgramSTT({ model: "nova-3" }),
llm: new OpenAILLM({ model: "gpt-4o-mini" }),
tts: new CartesiaTTS(),
vad: await SileroVAD.load(),
});
},
});Every user utterance, every agent state transition, every conversation item — all visible. This transparency is invaluable for debugging, analytics, and quality monitoring. You will see in the next chapter how realtime models provide less visibility by default.
Test your knowledge
Question 1 of 3
Why is first-token latency more important than total generation time for LLMs in a voice AI pipeline?
What comes next
You have built an optimized pipeline agent with careful model selection, latency tuning, and observability. In the next chapter, you will implement the same agent using OpenAI's Realtime API — a single model that replaces the entire STT + LLM + TTS chain. The comparison will be illuminating.
Keep this agent running
As you build the realtime versions in Chapters 3 and 4, keep your pipeline agent available for comparison. In Chapter 6, you will benchmark all three implementations side by side.