RAG architecture for voice
RAG architecture for voice
Your voice agent can hold a conversation, but it only knows what the LLM was trained on. Ask it about your company's return policy or a patient's medication history and it will hallucinate or refuse. Retrieval-augmented generation fixes this by fetching relevant documents at query time and injecting them into the prompt. The catch: voice is not chat. A chat user waits a few seconds for a response without blinking. A voice user hears dead air. Every millisecond you spend retrieving context is a millisecond of silence.
What you'll build
By the end of this chapter you will have a complete voice RAG agent that retrieves documents from a vector store, filters by relevance, and responds with grounded answers — all within a 400ms retrieval budget. You will understand exactly when RAG is the right pattern and when tools or static instructions are better.
Why voice RAG is different
In a text chatbot, RAG adds maybe two seconds of latency. The user sees a spinner and waits. In a voice agent, the user is on the other end of a live audio stream. The full response cycle — STT transcription, retrieval, LLM generation, TTS synthesis — must feel conversational. That means under two seconds total, and retrieval gets at most 400ms of that.
STT transcription
The user finishes speaking and Deepgram (or your STT provider) produces a transcript. This takes 100-300ms depending on utterance length and model.
Retrieval window (your 400ms budget)
Embed the query, search the vector database, filter results, and inject context into the prompt. This is where your RAG pipeline lives.
LLM generation
The model generates a response using the augmented prompt. Streaming output means TTS can start before generation finishes.
TTS synthesis
Text-to-speech converts the response to audio and streams it back to the user.
The 400ms rule
Your entire retrieval pipeline — embedding the query, searching the vector store, filtering results — must complete within 400ms. Exceed this and the user hears an awkward pause. Pre-warm your database connections and embedding client before the first call arrives. Cold connections to pgvector or Pinecone can add 500ms on their own.
When to use RAG vs alternatives
RAG is not always the right tool. Before building a retrieval pipeline, check whether a simpler approach solves your problem:
| Approach | Best for | Example |
|---|---|---|
| Static instructions | Small, stable knowledge (under ~4K tokens) | Business hours, greeting scripts, basic FAQ |
| Tool calls | Structured, real-time data | Account balances, order status, booking availability |
| RAG | Large, semi-structured knowledge bases | Policy documents, product catalogs, internal wikis |
| RAG + tool calls | Complex queries needing both retrieval and action | "Find the return policy and then initiate my refund" |
Start simple
If your knowledge fits in the system prompt, put it there. RAG adds latency and complexity. Only reach for it when your knowledge base exceeds what the context window can hold or when the information changes frequently enough that static instructions become stale.
Build the retrieval hook
LiveKit Agents provides lifecycle hooks that intercept the conversation flow. The on_user_turn_completed hook fires after speech-to-text produces a transcript and before the LLM generates a response. This is your retrieval window.
from livekit.agents import Agent, AgentSession
class RAGAgent(Agent):
def __init__(self, vector_store):
super().__init__(
instructions=(
"You are a knowledge assistant. Answer questions using the "
"context provided. If the context does not contain the answer, "
"say you do not have that information."
),
)
self.vector_store = vector_store
async def on_user_turn_completed(self, turn_ctx):
query = turn_ctx.user_message
results = await self.vector_store.search(query, top_k=3)
context = "\n".join([r.text for r in results])
turn_ctx.add_system_message(f"Relevant context:\n{context}")
await Agent.default.on_user_turn_completed(self, turn_ctx)import { Agent, type AgentSession, type TurnContext } from "@livekit/agents";
class RAGAgent extends Agent {
private vectorStore: VectorStore;
constructor(vectorStore: VectorStore) {
super({
instructions:
"You are a knowledge assistant. Answer questions using the " +
"context provided. If the context does not contain the answer, " +
"say you do not have that information.",
});
this.vectorStore = vectorStore;
}
async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
const query = turnCtx.userMessage;
const results = await this.vectorStore.search(query, 3);
const context = results.map((r) => r.text).join("\n");
turnCtx.addSystemMessage(`Relevant context:\n${context}`);
await super.onUserTurnCompleted(turnCtx);
}
}The on_user_turn_completed hook intercepts the conversation between the user's speech and the LLM's response. By calling turn_ctx.add_system_message, you inject retrieved documents into the prompt without modifying the conversation history the user sees. The call to Agent.default.on_user_turn_completed (Python) or super.onUserTurnCompleted (TypeScript) ensures the default behavior — sending the augmented prompt to the LLM — still executes.
Vector store abstraction
Your agent should not care which vector database sits behind the scenes. Define a clean interface and swap implementations freely:
from dataclasses import dataclass
from abc import ABC, abstractmethod
from openai import AsyncOpenAI
@dataclass
class SearchResult:
text: str
source: str
score: float
class VectorStore(ABC):
@abstractmethod
async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
...
class PgVectorStore(VectorStore):
def __init__(self, conn, openai_client: AsyncOpenAI):
self.conn = conn
self.openai = openai_client
async def _embed(self, text: str) -> list[float]:
response = await self.openai.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
embedding = await self._embed(query)
cur = self.conn.cursor()
cur.execute(
"SELECT content, source, 1 - (embedding <=> %s) AS score "
"FROM documents ORDER BY embedding <=> %s LIMIT %s",
(embedding, embedding, top_k),
)
return [
SearchResult(text=row[0], source=row[1], score=row[2])
for row in cur.fetchall()
]import OpenAI from "openai";
interface SearchResult {
text: string;
source: string;
score: number;
}
interface VectorStore {
search(query: string, topK?: number): Promise<SearchResult[]>;
}
class PineconeVectorStore implements VectorStore {
private openai: OpenAI;
private index: any;
constructor(openai: OpenAI, index: any) {
this.openai = openai;
this.index = index;
}
private async embed(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
async search(query: string, topK = 3): Promise<SearchResult[]> {
const embedding = await this.embed(query);
const results = await this.index.query({
vector: embedding,
topK,
includeMetadata: true,
});
return results.matches.map((m: any) => ({
text: m.metadata.content,
source: m.metadata.source,
score: m.score,
}));
}
}Voice-optimized chunk sizes
When ingesting documents for a voice RAG agent, aim for chunks between 200 and 800 tokens. Smaller chunks give precise retrieval but may lack context. Larger chunks provide more context but risk including irrelevant information. For voice, prefer semantic chunking (splitting at paragraph or section boundaries) over fixed-size chunking — it produces more coherent results that the LLM can summarize conversationally.
Relevance filtering and hybrid search
Not every retrieval result is useful. Low-scoring results add noise to the prompt and confuse the model. Filter by a minimum similarity threshold, and consider combining keyword and semantic search for better coverage.
class FilteredRAGAgent(Agent):
def __init__(self, vector_store, min_score: float = 0.7):
super().__init__(
instructions="You are a knowledge assistant. Use provided context to answer questions.",
)
self.vector_store = vector_store
self.min_score = min_score
async def on_user_turn_completed(self, turn_ctx):
query = turn_ctx.user_message
results = await self.vector_store.search(query, top_k=5)
# Filter out low-relevance results
relevant = [r for r in results if r.score >= self.min_score]
if relevant:
context = "\n---\n".join(
[f"[Source: {r.source}]\n{r.text}" for r in relevant]
)
turn_ctx.add_system_message(f"Relevant context:\n{context}")
else:
turn_ctx.add_system_message(
"No relevant documents found. Answer based on general knowledge "
"and let the user know you could not find a specific source."
)
await Agent.default.on_user_turn_completed(self, turn_ctx)class FilteredRAGAgent extends Agent {
private vectorStore: VectorStore;
private minScore: number;
constructor(vectorStore: VectorStore, minScore = 0.7) {
super({
instructions:
"You are a knowledge assistant. Use provided context to answer questions.",
});
this.vectorStore = vectorStore;
this.minScore = minScore;
}
async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
const query = turnCtx.userMessage;
const results = await this.vectorStore.search(query, 5);
const relevant = results.filter((r) => r.score >= this.minScore);
if (relevant.length > 0) {
const context = relevant
.map((r) => `[Source: ${r.source}]\n${r.text}`)
.join("\n---\n");
turnCtx.addSystemMessage(`Relevant context:\n${context}`);
} else {
turnCtx.addSystemMessage(
"No relevant documents found. Answer based on general knowledge " +
"and let the user know you could not find a specific source."
);
}
await super.onUserTurnCompleted(turnCtx);
}
}Hybrid search: keyword + semantic
Pure vector search can miss exact matches. If a user asks about "SKU-4829," semantic search may return results about products in general. Keyword search (BM25) handles exact matches but fails at paraphrasing. Hybrid search combines both using reciprocal rank fusion (RRF), which merges ranked lists by summing 1/(k + rank) scores — no training or tuning required.
from rank_bm25 import BM25Okapi
import re
class HybridVectorStore(VectorStore):
def __init__(self, conn, openai_client, documents: list[dict]):
self.conn = conn
self.openai = openai_client
tokenized = [re.findall(r"\w+", doc["content"].lower()) for doc in documents]
self.bm25 = BM25Okapi(tokenized)
self.documents = documents
async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
# Semantic search
embedding = await self._embed(query)
cur = self.conn.cursor()
cur.execute(
"SELECT id, content, source, 1 - (embedding <=> %s) AS score "
"FROM documents ORDER BY embedding <=> %s LIMIT %s",
(embedding, embedding, top_k * 3),
)
semantic_results = [
{"id": row[0], "content": row[1], "source": row[2]}
for row in cur.fetchall()
]
# Keyword search
tokens = re.findall(r"\w+", query.lower())
bm25_scores = self.bm25.get_scores(tokens)
top_indices = bm25_scores.argsort()[-top_k * 3:][::-1]
keyword_results = [
{**self.documents[i], "id": str(i)}
for i in top_indices if bm25_scores[i] > 0
]
# Reciprocal rank fusion
fused = self._rrf([semantic_results, keyword_results])
return [
SearchResult(text=r["content"], source=r["source"], score=r["rrf_score"])
for r in fused[:top_k]
]
def _rrf(self, result_lists, k=60):
scores, docs = {}, {}
for results in result_lists:
for rank, doc in enumerate(results):
doc_id = doc["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
docs[doc_id] = doc
sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)
return [{**docs[did], "rrf_score": scores[did]} for did in sorted_ids]For even higher quality, add a cross-encoder reranker after fusion. Retrieve 10-20 candidates with fast hybrid search, then use a reranker like Cohere's rerank-v3.5 to select the top 3. This adds 50-150ms, so monitor whether it fits within your latency budget for voice.
Voice latency constraint
The entire retrieval pipeline — embedding, search, fusion, and optional reranking — must complete within your 400ms budget. If hybrid plus reranking exceeds this, drop the reranker for real-time voice and reserve it for text-based follow-ups.
Complete voice RAG agent
Putting it all together — here is a full working example with AgentSession, STT, LLM, and TTS:
from livekit.agents import AgentSession, RoomInputOptions, Agent
from livekit.plugins import deepgram, openai as oai
from openai import AsyncOpenAI
class VoiceRAGAgent(Agent):
def __init__(self, vector_store, min_score: float = 0.72):
super().__init__(
instructions=(
"You are a knowledge assistant. Answer questions using the "
"context provided. If the context does not contain the answer, "
"say you do not have that information. Cite your sources naturally — "
"'According to our return policy...' not '[Source 1]'."
),
)
self.vector_store = vector_store
self.min_score = min_score
async def on_user_turn_completed(self, turn_ctx):
query = turn_ctx.user_message
results = await self.vector_store.search(query, top_k=5)
relevant = [r for r in results if r.score >= self.min_score]
if relevant:
context = "\n---\n".join(
[f"[Source: {r.source}]\n{r.text}" for r in relevant]
)
turn_ctx.add_system_message(f"Relevant context:\n{context}")
else:
turn_ctx.add_system_message(
"No relevant documents found for this query. "
"Tell the user you do not have specific information on that topic."
)
await Agent.default.on_user_turn_completed(self, turn_ctx)
async def entrypoint(ctx):
# Pre-warm connections before the first call
openai_client = AsyncOpenAI()
vector_store = PgVectorStore(conn=db_connection, openai_client=openai_client)
agent = VoiceRAGAgent(vector_store=vector_store, min_score=0.72)
session = AgentSession(
stt=deepgram.STT(),
llm=oai.LLM(model="gpt-4o"),
tts=oai.TTS(),
)
await session.start(
agent=agent,
room=ctx.room,
room_input_options=RoomInputOptions(),
)import { AgentSession, RoomInputOptions, Agent, type TurnContext } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM, OpenAITTS } from "@livekit/agents-plugin-openai";
import OpenAI from "openai";
class VoiceRAGAgent extends Agent {
private vectorStore: VectorStore;
private minScore: number;
constructor(vectorStore: VectorStore, minScore = 0.72) {
super({
instructions:
"You are a knowledge assistant. Answer questions using the " +
"context provided. If the context does not contain the answer, " +
"say you do not have that information. Cite your sources naturally — " +
"'According to our return policy...' not '[Source 1]'.",
});
this.vectorStore = vectorStore;
this.minScore = minScore;
}
async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
const query = turnCtx.userMessage;
const results = await this.vectorStore.search(query, 5);
const relevant = results.filter((r) => r.score >= this.minScore);
if (relevant.length > 0) {
const context = relevant
.map((r) => `[Source: ${r.source}]\n${r.text}`)
.join("\n---\n");
turnCtx.addSystemMessage(`Relevant context:\n${context}`);
} else {
turnCtx.addSystemMessage(
"No relevant documents found for this query. " +
"Tell the user you do not have specific information on that topic."
);
}
await super.onUserTurnCompleted(turnCtx);
}
}
async function entrypoint(ctx: any) {
const openai = new OpenAI();
const vectorStore = new PineconeVectorStore(openai, pineconeIndex);
const agent = new VoiceRAGAgent(vectorStore, 0.72);
const session = new AgentSession({
stt: new DeepgramSTT(),
llm: new OpenAILLM({ model: "gpt-4o" }),
tts: new OpenAITTS(),
});
await session.start({
agent,
room: ctx.room,
roomInputOptions: new RoomInputOptions(),
});
}The complete agent wires together four components: Deepgram STT converts speech to text, the on_user_turn_completed hook triggers retrieval and context injection, the OpenAI LLM generates a grounded response, and OpenAI TTS converts it back to speech. Pre-warming the vector store connection and embedding client before the first call is critical — cold connections can blow your entire latency budget on the first turn.
Test your knowledge
Question 1 of 4
Why is 'on_user_turn_completed' the right hook for injecting retrieval context rather than during speech recognition or after LLM generation?
What you learned
- Voice RAG operates under a strict 400ms retrieval budget. Pre-warm connections and use async embedding clients.
- The
on_user_turn_completedhook is the integration point for injecting retrieved context between STT and LLM generation. - A clean
VectorStoreabstraction lets you swap between pgvector, Pinecone, or any backend without changing agent code. - Relevance filtering with a minimum score threshold prevents low-quality results from polluting the prompt.
- Hybrid search (BM25 + semantic with RRF) handles both exact term matches and paraphrased queries.
AgentSessionwires together STT, LLM, TTS, and your RAG agent into a complete voice pipeline.
Next up
In the next chapter, you will learn how to use MCP (Model Context Protocol) to connect your agent to external APIs and tool servers for structured data access.