Chapter 125m

RAG architecture for voice

RAG architecture for voice

Your voice agent can hold a conversation, but it only knows what the LLM was trained on. Ask it about your company's return policy or a patient's medication history and it will hallucinate or refuse. Retrieval-augmented generation fixes this by fetching relevant documents at query time and injecting them into the prompt. The catch: voice is not chat. A chat user waits a few seconds for a response without blinking. A voice user hears dead air. Every millisecond you spend retrieving context is a millisecond of silence.

Latency budgeton_user_turn_completedContext injectionRAG vs tools vs instructions

What you'll build

By the end of this chapter you will have a complete voice RAG agent that retrieves documents from a vector store, filters by relevance, and responds with grounded answers — all within a 400ms retrieval budget. You will understand exactly when RAG is the right pattern and when tools or static instructions are better.

Why voice RAG is different

In a text chatbot, RAG adds maybe two seconds of latency. The user sees a spinner and waits. In a voice agent, the user is on the other end of a live audio stream. The full response cycle — STT transcription, retrieval, LLM generation, TTS synthesis — must feel conversational. That means under two seconds total, and retrieval gets at most 400ms of that.

1

STT transcription

The user finishes speaking and Deepgram (or your STT provider) produces a transcript. This takes 100-300ms depending on utterance length and model.

2

Retrieval window (your 400ms budget)

Embed the query, search the vector database, filter results, and inject context into the prompt. This is where your RAG pipeline lives.

3

LLM generation

The model generates a response using the augmented prompt. Streaming output means TTS can start before generation finishes.

4

TTS synthesis

Text-to-speech converts the response to audio and streams it back to the user.

The 400ms rule

Your entire retrieval pipeline — embedding the query, searching the vector store, filtering results — must complete within 400ms. Exceed this and the user hears an awkward pause. Pre-warm your database connections and embedding client before the first call arrives. Cold connections to pgvector or Pinecone can add 500ms on their own.

When to use RAG vs alternatives

RAG is not always the right tool. Before building a retrieval pipeline, check whether a simpler approach solves your problem:

ApproachBest forExample
Static instructionsSmall, stable knowledge (under ~4K tokens)Business hours, greeting scripts, basic FAQ
Tool callsStructured, real-time dataAccount balances, order status, booking availability
RAGLarge, semi-structured knowledge basesPolicy documents, product catalogs, internal wikis
RAG + tool callsComplex queries needing both retrieval and action"Find the return policy and then initiate my refund"

Start simple

If your knowledge fits in the system prompt, put it there. RAG adds latency and complexity. Only reach for it when your knowledge base exceeds what the context window can hold or when the information changes frequently enough that static instructions become stale.

Build the retrieval hook

LiveKit Agents provides lifecycle hooks that intercept the conversation flow. The on_user_turn_completed hook fires after speech-to-text produces a transcript and before the LLM generates a response. This is your retrieval window.

rag_agent.pypython
from livekit.agents import Agent, AgentSession

class RAGAgent(Agent):
  def __init__(self, vector_store):
      super().__init__(
          instructions=(
              "You are a knowledge assistant. Answer questions using the "
              "context provided. If the context does not contain the answer, "
              "say you do not have that information."
          ),
      )
      self.vector_store = vector_store

  async def on_user_turn_completed(self, turn_ctx):
      query = turn_ctx.user_message
      results = await self.vector_store.search(query, top_k=3)
      context = "\n".join([r.text for r in results])
      turn_ctx.add_system_message(f"Relevant context:\n{context}")
      await Agent.default.on_user_turn_completed(self, turn_ctx)
rag_agent.tstypescript
import { Agent, type AgentSession, type TurnContext } from "@livekit/agents";

class RAGAgent extends Agent {
private vectorStore: VectorStore;

constructor(vectorStore: VectorStore) {
  super({
    instructions:
      "You are a knowledge assistant. Answer questions using the " +
      "context provided. If the context does not contain the answer, " +
      "say you do not have that information.",
  });
  this.vectorStore = vectorStore;
}

async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
  const query = turnCtx.userMessage;
  const results = await this.vectorStore.search(query, 3);
  const context = results.map((r) => r.text).join("\n");
  turnCtx.addSystemMessage(`Relevant context:\n${context}`);
  await super.onUserTurnCompleted(turnCtx);
}
}
What's happening

The on_user_turn_completed hook intercepts the conversation between the user's speech and the LLM's response. By calling turn_ctx.add_system_message, you inject retrieved documents into the prompt without modifying the conversation history the user sees. The call to Agent.default.on_user_turn_completed (Python) or super.onUserTurnCompleted (TypeScript) ensures the default behavior — sending the augmented prompt to the LLM — still executes.

Vector store abstraction

Your agent should not care which vector database sits behind the scenes. Define a clean interface and swap implementations freely:

vector_store.pypython
from dataclasses import dataclass
from abc import ABC, abstractmethod
from openai import AsyncOpenAI

@dataclass
class SearchResult:
  text: str
  source: str
  score: float

class VectorStore(ABC):
  @abstractmethod
  async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
      ...

class PgVectorStore(VectorStore):
  def __init__(self, conn, openai_client: AsyncOpenAI):
      self.conn = conn
      self.openai = openai_client

  async def _embed(self, text: str) -> list[float]:
      response = await self.openai.embeddings.create(
          model="text-embedding-3-small",
          input=text,
      )
      return response.data[0].embedding

  async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
      embedding = await self._embed(query)
      cur = self.conn.cursor()
      cur.execute(
          "SELECT content, source, 1 - (embedding <=> %s) AS score "
          "FROM documents ORDER BY embedding <=> %s LIMIT %s",
          (embedding, embedding, top_k),
      )
      return [
          SearchResult(text=row[0], source=row[1], score=row[2])
          for row in cur.fetchall()
      ]
vector_store.tstypescript
import OpenAI from "openai";

interface SearchResult {
text: string;
source: string;
score: number;
}

interface VectorStore {
search(query: string, topK?: number): Promise<SearchResult[]>;
}

class PineconeVectorStore implements VectorStore {
private openai: OpenAI;
private index: any;

constructor(openai: OpenAI, index: any) {
  this.openai = openai;
  this.index = index;
}

private async embed(text: string): Promise<number[]> {
  const response = await this.openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

async search(query: string, topK = 3): Promise<SearchResult[]> {
  const embedding = await this.embed(query);
  const results = await this.index.query({
    vector: embedding,
    topK,
    includeMetadata: true,
  });
  return results.matches.map((m: any) => ({
    text: m.metadata.content,
    source: m.metadata.source,
    score: m.score,
  }));
}
}

Voice-optimized chunk sizes

When ingesting documents for a voice RAG agent, aim for chunks between 200 and 800 tokens. Smaller chunks give precise retrieval but may lack context. Larger chunks provide more context but risk including irrelevant information. For voice, prefer semantic chunking (splitting at paragraph or section boundaries) over fixed-size chunking — it produces more coherent results that the LLM can summarize conversationally.

Relevance filtering and hybrid search

Not every retrieval result is useful. Low-scoring results add noise to the prompt and confuse the model. Filter by a minimum similarity threshold, and consider combining keyword and semantic search for better coverage.

filtered_rag.pypython
class FilteredRAGAgent(Agent):
  def __init__(self, vector_store, min_score: float = 0.7):
      super().__init__(
          instructions="You are a knowledge assistant. Use provided context to answer questions.",
      )
      self.vector_store = vector_store
      self.min_score = min_score

  async def on_user_turn_completed(self, turn_ctx):
      query = turn_ctx.user_message
      results = await self.vector_store.search(query, top_k=5)

      # Filter out low-relevance results
      relevant = [r for r in results if r.score >= self.min_score]

      if relevant:
          context = "\n---\n".join(
              [f"[Source: {r.source}]\n{r.text}" for r in relevant]
          )
          turn_ctx.add_system_message(f"Relevant context:\n{context}")
      else:
          turn_ctx.add_system_message(
              "No relevant documents found. Answer based on general knowledge "
              "and let the user know you could not find a specific source."
          )

      await Agent.default.on_user_turn_completed(self, turn_ctx)
filtered_rag.tstypescript
class FilteredRAGAgent extends Agent {
private vectorStore: VectorStore;
private minScore: number;

constructor(vectorStore: VectorStore, minScore = 0.7) {
  super({
    instructions:
      "You are a knowledge assistant. Use provided context to answer questions.",
  });
  this.vectorStore = vectorStore;
  this.minScore = minScore;
}

async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
  const query = turnCtx.userMessage;
  const results = await this.vectorStore.search(query, 5);

  const relevant = results.filter((r) => r.score >= this.minScore);

  if (relevant.length > 0) {
    const context = relevant
      .map((r) => `[Source: ${r.source}]\n${r.text}`)
      .join("\n---\n");
    turnCtx.addSystemMessage(`Relevant context:\n${context}`);
  } else {
    turnCtx.addSystemMessage(
      "No relevant documents found. Answer based on general knowledge " +
      "and let the user know you could not find a specific source."
    );
  }

  await super.onUserTurnCompleted(turnCtx);
}
}

Hybrid search: keyword + semantic

Pure vector search can miss exact matches. If a user asks about "SKU-4829," semantic search may return results about products in general. Keyword search (BM25) handles exact matches but fails at paraphrasing. Hybrid search combines both using reciprocal rank fusion (RRF), which merges ranked lists by summing 1/(k + rank) scores — no training or tuning required.

hybrid_search.pypython
from rank_bm25 import BM25Okapi
import re

class HybridVectorStore(VectorStore):
  def __init__(self, conn, openai_client, documents: list[dict]):
      self.conn = conn
      self.openai = openai_client
      tokenized = [re.findall(r"\w+", doc["content"].lower()) for doc in documents]
      self.bm25 = BM25Okapi(tokenized)
      self.documents = documents

  async def search(self, query: str, top_k: int = 3) -> list[SearchResult]:
      # Semantic search
      embedding = await self._embed(query)
      cur = self.conn.cursor()
      cur.execute(
          "SELECT id, content, source, 1 - (embedding <=> %s) AS score "
          "FROM documents ORDER BY embedding <=> %s LIMIT %s",
          (embedding, embedding, top_k * 3),
      )
      semantic_results = [
          {"id": row[0], "content": row[1], "source": row[2]}
          for row in cur.fetchall()
      ]

      # Keyword search
      tokens = re.findall(r"\w+", query.lower())
      bm25_scores = self.bm25.get_scores(tokens)
      top_indices = bm25_scores.argsort()[-top_k * 3:][::-1]
      keyword_results = [
          {**self.documents[i], "id": str(i)}
          for i in top_indices if bm25_scores[i] > 0
      ]

      # Reciprocal rank fusion
      fused = self._rrf([semantic_results, keyword_results])
      return [
          SearchResult(text=r["content"], source=r["source"], score=r["rrf_score"])
          for r in fused[:top_k]
      ]

  def _rrf(self, result_lists, k=60):
      scores, docs = {}, {}
      for results in result_lists:
          for rank, doc in enumerate(results):
              doc_id = doc["id"]
              scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
              docs[doc_id] = doc
      sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)
      return [{**docs[did], "rrf_score": scores[did]} for did in sorted_ids]

For even higher quality, add a cross-encoder reranker after fusion. Retrieve 10-20 candidates with fast hybrid search, then use a reranker like Cohere's rerank-v3.5 to select the top 3. This adds 50-150ms, so monitor whether it fits within your latency budget for voice.

Voice latency constraint

The entire retrieval pipeline — embedding, search, fusion, and optional reranking — must complete within your 400ms budget. If hybrid plus reranking exceeds this, drop the reranker for real-time voice and reserve it for text-based follow-ups.

Complete voice RAG agent

Putting it all together — here is a full working example with AgentSession, STT, LLM, and TTS:

main.pypython
from livekit.agents import AgentSession, RoomInputOptions, Agent
from livekit.plugins import deepgram, openai as oai
from openai import AsyncOpenAI

class VoiceRAGAgent(Agent):
  def __init__(self, vector_store, min_score: float = 0.72):
      super().__init__(
          instructions=(
              "You are a knowledge assistant. Answer questions using the "
              "context provided. If the context does not contain the answer, "
              "say you do not have that information. Cite your sources naturally — "
              "'According to our return policy...' not '[Source 1]'."
          ),
      )
      self.vector_store = vector_store
      self.min_score = min_score

  async def on_user_turn_completed(self, turn_ctx):
      query = turn_ctx.user_message
      results = await self.vector_store.search(query, top_k=5)
      relevant = [r for r in results if r.score >= self.min_score]

      if relevant:
          context = "\n---\n".join(
              [f"[Source: {r.source}]\n{r.text}" for r in relevant]
          )
          turn_ctx.add_system_message(f"Relevant context:\n{context}")
      else:
          turn_ctx.add_system_message(
              "No relevant documents found for this query. "
              "Tell the user you do not have specific information on that topic."
          )

      await Agent.default.on_user_turn_completed(self, turn_ctx)

async def entrypoint(ctx):
  # Pre-warm connections before the first call
  openai_client = AsyncOpenAI()
  vector_store = PgVectorStore(conn=db_connection, openai_client=openai_client)

  agent = VoiceRAGAgent(vector_store=vector_store, min_score=0.72)

  session = AgentSession(
      stt=deepgram.STT(),
      llm=oai.LLM(model="gpt-4o"),
      tts=oai.TTS(),
  )

  await session.start(
      agent=agent,
      room=ctx.room,
      room_input_options=RoomInputOptions(),
  )
main.tstypescript
import { AgentSession, RoomInputOptions, Agent, type TurnContext } from "@livekit/agents";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { OpenAILLM, OpenAITTS } from "@livekit/agents-plugin-openai";
import OpenAI from "openai";

class VoiceRAGAgent extends Agent {
private vectorStore: VectorStore;
private minScore: number;

constructor(vectorStore: VectorStore, minScore = 0.72) {
  super({
    instructions:
      "You are a knowledge assistant. Answer questions using the " +
      "context provided. If the context does not contain the answer, " +
      "say you do not have that information. Cite your sources naturally — " +
      "'According to our return policy...' not '[Source 1]'.",
  });
  this.vectorStore = vectorStore;
  this.minScore = minScore;
}

async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
  const query = turnCtx.userMessage;
  const results = await this.vectorStore.search(query, 5);
  const relevant = results.filter((r) => r.score >= this.minScore);

  if (relevant.length > 0) {
    const context = relevant
      .map((r) => `[Source: ${r.source}]\n${r.text}`)
      .join("\n---\n");
    turnCtx.addSystemMessage(`Relevant context:\n${context}`);
  } else {
    turnCtx.addSystemMessage(
      "No relevant documents found for this query. " +
      "Tell the user you do not have specific information on that topic."
    );
  }

  await super.onUserTurnCompleted(turnCtx);
}
}

async function entrypoint(ctx: any) {
const openai = new OpenAI();
const vectorStore = new PineconeVectorStore(openai, pineconeIndex);

const agent = new VoiceRAGAgent(vectorStore, 0.72);

const session = new AgentSession({
  stt: new DeepgramSTT(),
  llm: new OpenAILLM({ model: "gpt-4o" }),
  tts: new OpenAITTS(),
});

await session.start({
  agent,
  room: ctx.room,
  roomInputOptions: new RoomInputOptions(),
});
}
What's happening

The complete agent wires together four components: Deepgram STT converts speech to text, the on_user_turn_completed hook triggers retrieval and context injection, the OpenAI LLM generates a grounded response, and OpenAI TTS converts it back to speech. Pre-warming the vector store connection and embedding client before the first call is critical — cold connections can blow your entire latency budget on the first turn.

Test your knowledge

Question 1 of 4

Why is 'on_user_turn_completed' the right hook for injecting retrieval context rather than during speech recognition or after LLM generation?

What you learned

  • Voice RAG operates under a strict 400ms retrieval budget. Pre-warm connections and use async embedding clients.
  • The on_user_turn_completed hook is the integration point for injecting retrieved context between STT and LLM generation.
  • A clean VectorStore abstraction lets you swap between pgvector, Pinecone, or any backend without changing agent code.
  • Relevance filtering with a minimum score threshold prevents low-quality results from polluting the prompt.
  • Hybrid search (BM25 + semantic with RRF) handles both exact term matches and paraphrased queries.
  • AgentSession wires together STT, LLM, TTS, and your RAG agent into a complete voice pipeline.

Next up

In the next chapter, you will learn how to use MCP (Model Context Protocol) to connect your agent to external APIs and tool servers for structured data access.

Concepts covered
Latency budgeton_user_turn_completedContext injectionRAG vs tools vs instructions