Chapter 320m

Understanding AgentSession

Understanding AgentSession

In the previous chapter, you built a working agent with default model configuration. In this chapter, you will take explicit control of the voice pipeline by choosing your STT, LLM, and TTS models, setting up voice activity detection, and learning about the events that let you react to everything happening in a conversation.

What you will learn: The AgentSession lifecycle, how to configure each model in the pipeline, how VAD works, and the key events you can listen for.

What you will build: An upgraded dental receptionist with Deepgram STT, GPT-4o-mini, Cartesia TTS, and Silero VAD — all explicitly configured.

Session lifecycleSilero VADTurn detectionModel configEventsLiveKit Inference

The AgentSession lifecycle

Every conversation your agent has follows a predictable lifecycle. Understanding it helps you know when to configure things, when to react, and when cleanup happens.

1

Created

A user connects to a LiveKit Room. LiveKit sees that the room needs an agent and dispatches a new session. Your @server.rtc_session function is called with a fresh AgentSession object. At this point, the session exists but the voice pipeline is not running.

2

start() called

You call session.start() with your Agent configuration, models, and room reference. This is where you wire everything together — STT, LLM, TTS, VAD, and the agent's instructions. The pipeline begins: audio capture starts, STT begins transcribing, and the agent is ready to process speech.

3

Running

The session is fully active. Audio flows bidirectionally over WebRTC. The user speaks, STT transcribes, the LLM generates responses, TTS synthesizes speech, and the user hears the reply. Events fire as the conversation progresses. This is the steady state for the entire call.

4

Closed

The user disconnects, or you explicitly end the session. The audio pipeline shuts down, resources are released, and your entrypoint function returns. Each session is independent — closing one has no effect on others.

What's happening

The lifecycle is deliberately simple. You configure everything in a single session.start() call, the framework runs the pipeline, and cleanup is automatic. There is no manual teardown, no resource management, no state machines to maintain. Focus on your agent's logic, not infrastructure plumbing.

Configuring models explicitly

In Chapter 2, you called session.start() without specifying models. The framework used sensible defaults. Now let us take explicit control. Each stage of the pipeline — STT, LLM, TTS — is configured with a plugin.

agent.py
from livekit.agents import AgentServer, rtc_session, Agent, AgentSession
from livekit.plugins import openai, silero, deepgram, cartesia

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a friendly receptionist at Bright Smile Dental clinic. "
              "Greet callers warmly and help them with appointment inquiries. "
              "Keep your responses short and conversational — one to two sentences at most."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
  )


if __name__ == "__main__":
  server.run()

Three new parameters: stt, llm, and tts. Each one instantiates a plugin with model-specific configuration. Let us examine each.

Speech-to-Text: Deepgram Nova-3

agent.py (excerpt)python
stt=deepgram.STT(model="nova-3")

Deepgram Nova-3 is a streaming STT model optimized for real-time transcription. "Streaming" means it produces partial transcripts as audio arrives — it does not wait for the user to finish speaking. This is critical for low latency: the LLM can begin processing before the user completes their sentence.

You can also configure language, keywords, and other Deepgram-specific options:

agent.py (excerpt)python
stt=deepgram.STT(
  model="nova-3",
  language="en",
  keywords=[("Bright Smile Dental", 3.0), ("Dr. Chen", 2.0)],
)

The keywords parameter boosts recognition of domain-specific terms. The number is a weight — higher means stronger boost. This is useful for proper nouns that STT models often misrecognize.

Keyword boosting for domain terms

Dental terminology, doctor names, and clinic names are frequently misrecognized by general-purpose STT. Adding them as boosted keywords dramatically improves accuracy. Add any terms your callers will say that are specific to your business.

Large Language Model: GPT-4o-mini

agent.py (excerpt)python
llm=openai.LLM(model="gpt-4o-mini")

GPT-4o-mini is a fast, capable model well-suited for voice agents. It balances quality and speed — the first token arrives quickly, which is critical when every millisecond counts. For production dental receptionists, this model handles appointment inquiries, FAQ responses, and conversational flow with ease.

You can swap in other models: gpt-4o for more complex reasoning, or use different providers entirely — anthropic.LLM(model="claude-sonnet-4-20250514") if you prefer Claude.

Text-to-Speech: Cartesia

agent.py (excerpt)python
tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22")

Cartesia provides low-latency streaming TTS with natural-sounding voices. The voice parameter is a voice ID from the Cartesia voice library. Each voice has distinct characteristics — warmth, pace, pitch, accent.

Finding voice IDs

Browse available voices at play.cartesia.ai. Each voice has an ID you can copy. For a dental receptionist, choose a warm, professional voice. Try several — the voice is one of the most impactful choices you will make for your agent's personality.

You can also configure speech speed and emotion:

agent.py (excerpt)python
tts=cartesia.TTS(
  voice="79a125e8-cd45-4c13-8a67-188112f4dd22",
  speed="normal",
  emotion=["positivity:high", "curiosity:medium"],
)

Voice Activity Detection (VAD)

Voice Activity Detection determines when the user is speaking and when they have stopped. It is the foundation of turn-taking — the agent needs to know when to listen and when to respond.

agent.py (excerpt)python
from livekit.plugins import silero

# In your entrypoint, before session.start():
await session.start(
  agent=Agent(instructions="..."),
  room=session.room,
  stt=deepgram.STT(model="nova-3"),
  llm=openai.LLM(model="gpt-4o-mini"),
  tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
  vad=silero.VAD.load(),
)

Silero VAD is a lightweight neural network that classifies audio frames as speech or non-speech. It runs locally alongside your agent — no network call required.

What's happening

Without VAD, the system would not know when the user finishes a thought. It would either interrupt constantly (responding to every pause) or wait too long (waiting for an arbitrary timeout). VAD gives the pipeline the ability to detect natural turn boundaries: "the user has stopped speaking, it is time to respond."

VAD alone is not enough for natural conversation

Silero VAD detects speech vs silence, but silence does not always mean the user is done. Someone pausing to think, or taking a breath mid-sentence, produces silence that VAD detects. In Chapter 9 (Turn Detection Deep Dive), you will learn about STT-based endpointing and multilingual turn detection that handle these nuances. For now, VAD with default settings works well for basic conversations.

Turn detection

Turn detection builds on top of VAD to decide when the user has actually finished their turn and the agent should respond. The default turn detection uses VAD silence duration — if the user is silent for a threshold (typically around 300-500ms), the agent begins its response.

agent.py (excerpt)python
from livekit.agents import AgentServer, rtc_session, Agent, AgentSession, MultimodalModel

# For more sophisticated turn detection using the LLM itself:
await session.start(
  agent=Agent(instructions="..."),
  room=session.room,
  turn_detection=MultimodalModel(),
)

MultimodalModel uses the LLM to help determine when the user is done speaking. It considers context, not just silence, which reduces false turn detections. You will explore this in depth in Chapter 9.

Key events

The AgentSession emits events throughout the conversation. These are your hooks for monitoring, logging, and reacting to what happens during a call.

agent.py
from livekit.agents import AgentServer, rtc_session, Agent, AgentSession
from livekit.plugins import openai, silero, deepgram, cartesia

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  @session.on("agent_state_changed")
  def on_agent_state(state: str):
      print(f"Agent state: {state}")  # "listening", "thinking", "speaking"

  @session.on("user_state_changed")
  def on_user_state(state: str):
      print(f"User state: {state}")  # "speaking", "listening"

  @session.on("user_input_transcribed")
  def on_transcript(transcript):
      print(f"User said: {transcript.text}")

  @session.on("conversation_item_added")
  def on_item(item):
      print(f"New conversation item: {item}")

  await session.start(
      agent=Agent(
          instructions=(
              "You are a friendly receptionist at Bright Smile Dental clinic. "
              "Greet callers warmly and help them with appointment inquiries. "
              "Keep your responses short and conversational — one to two sentences at most."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
      vad=silero.VAD.load(),
  )


if __name__ == "__main__":
  server.run()

Let us walk through each event:

1

agent_state_changed

Fires when the agent transitions between states: listening (waiting for user input), thinking (LLM is generating a response), and speaking (TTS is playing audio). Use this for UI updates — showing a thinking indicator, for example — or for logging conversation flow.

2

user_state_changed

Fires when the user starts or stops speaking. The states are speaking and listening. Combined with agent_state_changed, you can track the full conversational rhythm.

3

user_input_transcribed

Fires when the STT produces a transcript of the user's speech. The transcript object includes the text, whether it is a partial or final transcript, and timing information. Use this for logging what callers say.

4

conversation_item_added

Fires when a new item is added to the conversation history — user messages, agent responses, and tool calls. This gives you a complete record of the conversation as it unfolds.

Events are optional

You do not need to register any event handlers for the agent to work. The pipeline runs independently. Events are for observation and side effects — logging, analytics, UI updates, or triggering external actions. Add them as you need them.

The complete upgraded agent

Here is your dental receptionist with all the configuration from this chapter combined:

agent.py
from livekit.agents import AgentServer, rtc_session, Agent, AgentSession
from livekit.plugins import openai, silero, deepgram, cartesia

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  @session.on("user_input_transcribed")
  def on_transcript(transcript):
      print(f"Caller said: {transcript.text}")

  await session.start(
      agent=Agent(
          instructions=(
              "You are a friendly receptionist at Bright Smile Dental clinic. "
              "Greet callers warmly and help them with appointment inquiries. "
              "Keep your responses short and conversational — one to two sentences at most."
          ),
      ),
      room=session.room,
      stt=deepgram.STT(
          model="nova-3",
          language="en",
          keywords=[("Bright Smile Dental", 3.0)],
      ),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
      vad=silero.VAD.load(),
  )


if __name__ == "__main__":
  server.run()

Test the difference

Run your updated agent:

terminalbash
python agent.py dev

Open the Playground, connect, and try these tests:

Try saying: "Hello, I'm calling about an appointment."

Listen to the voice. This is the Cartesia voice you configured. It should sound different from the default TTS you heard in Chapter 2 — more natural, with a distinct character.

Try saying: "I was referred by Dr. Chen at Bright Smile Dental."

Watch the terminal output. The user_input_transcribed event handler prints the transcript. Check whether "Bright Smile Dental" and "Dr. Chen" are transcribed correctly — keyword boosting should help.

Try saying: "Actually, never mind." (while the agent is speaking)

This tests interruption handling. With VAD active, the agent should detect that you started speaking and stop its current response. The agent_state_changed event will show the transition from speaking back to listening.

What's happening

By explicitly configuring each model, you have full control over the quality, latency, and cost of your voice pipeline. You can swap Deepgram for another STT provider, change LLM models, or try different TTS voices — all by changing a single line. The AgentSession abstraction keeps the pipeline consistent regardless of which providers you choose.

Test your knowledge

Question 1 of 3

Why is Silero VAD alone insufficient for natural conversation turn-taking?

What changed from Chapter 2

Your agent went from implicit defaults to explicit configuration:

  • STT: Default became Deepgram Nova-3 with keyword boosting
  • LLM: Default became GPT-4o-mini (explicitly chosen)
  • TTS: Default became Cartesia with a specific voice
  • VAD: Added Silero VAD for voice activity detection
  • Events: Added transcript logging to observe the pipeline

The agent's behavior is the same — greet callers and help with appointments — but the quality of each stage is now under your control. In the next chapter, you will write proper instructions that transform this basic receptionist into a polished, professional agent.

Looking ahead

In the next chapter, you will learn how voice prompting differs from text prompting and write a comprehensive set of instructions for the dental receptionist. You will also add an initial greeting using on_enter() so the agent speaks first when someone calls.

Concepts covered
Session lifecycleSilero VADTurn detectionModel configEventsLiveKit Inference