Chapter 115m

Turn detection fundamentals

Turn detection fundamentals

In human conversation, we instinctively know when someone has finished speaking and it is our turn to respond. Voice AI agents have no such instinct. They rely on turn detection — a system of algorithms that decides when a user has stopped talking and expects a reply. Get it wrong and your agent either interrupts users mid-sentence or sits in awkward silence.

VADEndpointingTurn boundaries

What you'll learn

  • What turn detection is and why it is critical for voice AI quality
  • How Voice Activity Detection (VAD) identifies speech in an audio stream
  • How endpointing determines when a speaker has finished their turn
  • The core tradeoff between responsiveness and interruption avoidance
  • How to configure TurnDetectionOptions in LiveKit Agents

The turn detection problem

When two humans talk, turn-taking feels effortless. Research shows that average turn gaps in conversation are around 200 milliseconds — faster than human reaction time. We start planning our response while the other person is still speaking, using linguistic and prosodic cues to predict when they will finish.

Voice AI agents do not have this luxury. They must solve two problems sequentially:

  1. Is the user speaking right now? This is Voice Activity Detection (VAD).
  2. Has the user finished their turn? This is endpointing.
What's happening

Turn detection is the combination of VAD and endpointing that together determine turn boundaries — the points in a conversation where control passes from the user to the agent and back again. Every voice AI system needs turn detection, and its quality directly determines how natural a conversation feels.

Voice Activity Detection (VAD)

VAD is a binary classifier that analyzes audio frames and decides: speech or not speech. LiveKit Agents uses Silero VAD, a neural network model that runs locally and processes audio in real time.

VAD answers a simple question on every audio frame: "Is someone talking?" It outputs a probability between 0 and 1, and when that probability crosses a threshold, the system transitions between speech and silence states.

agent.pypython
from livekit.plugins.silero import VAD

# Load VAD with default settings
vad = VAD.load()

# VAD processes audio frames and emits events:
# - SPEECH_STARTED: user began speaking
# - SPEECH_FINISHED: user stopped speaking

VAD is not transcription

VAD does not understand what the user is saying — it only detects that someone is speaking. It cannot distinguish between "hello" and a cough. That is the job of the Speech-to-Text (STT) engine, which runs after VAD determines that speech is present.

Endpointing

VAD tells you when speech stops, but silence does not always mean a turn is over. People pause mid-sentence to think, breathe, or gather their thoughts. Endpointing is the logic that decides whether a silence represents a finished turn or a mid-utterance pause.

LiveKit Agents provides TurnDetectionOptions to control endpointing behavior:

agent.pypython
from livekit.agents import AgentSession, TurnDetectionOptions

session = AgentSession(
  turn_detection=TurnDetectionOptions(
      # Minimum silence before considering the turn complete
      min_endpointing_delay=0.5,
      # Maximum silence — turn is always complete after this
      max_endpointing_delay=1.0,
  ),
)
agent.tstypescript
import { AgentSession, TurnDetectionOptions } from "@livekit/agents";

const session = new AgentSession({
turnDetection: {
  // Minimum silence before considering the turn complete
  minEndpointingDelay: 0.5,
  // Maximum silence — turn is always complete after this
  maxEndpointingDelay: 1.0,
},
});

The endpointing delay is the silence duration the system waits after VAD reports speech has ended before it commits to a turn boundary. Shorter delays make the agent more responsive but risk cutting off users. Longer delays feel more patient but introduce latency.

The responsiveness tradeoff

Turn detection is fundamentally a tradeoff between two failure modes:

SettingToo aggressiveToo passive
SymptomAgent interrupts users mid-sentenceLong pauses before agent responds
User experienceFrustrating, feels like the agent is not listeningAwkward, feels like the agent is slow
MetricHigh interruption rateHigh average response time

There is no perfect setting

The right turn detection configuration depends on your use case. A fast-paced game companion needs short delays. A medical intake agent should be patient and never interrupt. There is no universal default — you must tune for your specific application.

Here is how different use cases map to endpointing delays:

profiles.pypython
from livekit.agents import TurnDetectionOptions

# Fast-paced: gaming, quick Q&A
fast_turn_detection = TurnDetectionOptions(
  min_endpointing_delay=0.3,
  max_endpointing_delay=0.6,
)

# Balanced: general customer service
balanced_turn_detection = TurnDetectionOptions(
  min_endpointing_delay=0.5,
  max_endpointing_delay=1.0,
)

# Patient: medical intake, elderly users, complex questions
patient_turn_detection = TurnDetectionOptions(
  min_endpointing_delay=0.8,
  max_endpointing_delay=1.5,
)
profiles.tstypescript
// Fast-paced: gaming, quick Q&A
const fastTurnDetection = {
minEndpointingDelay: 0.3,
maxEndpointingDelay: 0.6,
};

// Balanced: general customer service
const balancedTurnDetection = {
minEndpointingDelay: 0.5,
maxEndpointingDelay: 1.0,
};

// Patient: medical intake, elderly users, complex questions
const patientTurnDetection = {
minEndpointingDelay: 0.8,
maxEndpointingDelay: 1.5,
};

The turn detection pipeline

To summarize, here is how the pieces fit together in the LiveKit Agents pipeline:

Audio frames arrive from the user's microphone via WebRTC.

VAD (Silero) analyzes each frame and detects speech start/stop events.

When VAD reports speech has ended, the endpointing timer starts.

If the silence duration exceeds the endpointing delay, a turn boundary is declared.

The agent's pipeline (STT, LLM, TTS) begins processing the user's turn.

Throughout this course, you will learn to tune every part of this pipeline — from VAD sensitivity to endpointing delays to adaptive interruption handling — so your agent can hold conversations that feel natural.

Test your knowledge

Question 1 of 3

What is the fundamental difference between VAD and endpointing in the turn detection pipeline?

What you learned

  • Turn detection combines VAD (detecting speech) and endpointing (detecting turn completion) to identify turn boundaries
  • Silero VAD runs locally and classifies audio frames as speech or silence in real time
  • Endpointing uses silence duration to decide when a user has finished speaking
  • TurnDetectionOptions controls endpointing with min_endpointing_delay and max_endpointing_delay
  • The core tradeoff is responsiveness (short delays) vs. patience (long delays), and the right balance depends on your use case

Next up

In the next chapter, you will dive deep into Silero VAD configuration — tuning sensitivity, padding, prefix buffers, and buffered speech limits to get VAD working precisely for your application.

Concepts covered
VADEndpointingTurn boundaries