Turn detection fundamentals
Turn detection fundamentals
In human conversation, we instinctively know when someone has finished speaking and it is our turn to respond. Voice AI agents have no such instinct. They rely on turn detection — a system of algorithms that decides when a user has stopped talking and expects a reply. Get it wrong and your agent either interrupts users mid-sentence or sits in awkward silence.
What you'll learn
- What turn detection is and why it is critical for voice AI quality
- How Voice Activity Detection (VAD) identifies speech in an audio stream
- How endpointing determines when a speaker has finished their turn
- The core tradeoff between responsiveness and interruption avoidance
- How to configure
TurnDetectionOptionsin LiveKit Agents
The turn detection problem
When two humans talk, turn-taking feels effortless. Research shows that average turn gaps in conversation are around 200 milliseconds — faster than human reaction time. We start planning our response while the other person is still speaking, using linguistic and prosodic cues to predict when they will finish.
Voice AI agents do not have this luxury. They must solve two problems sequentially:
- Is the user speaking right now? This is Voice Activity Detection (VAD).
- Has the user finished their turn? This is endpointing.
Turn detection is the combination of VAD and endpointing that together determine turn boundaries — the points in a conversation where control passes from the user to the agent and back again. Every voice AI system needs turn detection, and its quality directly determines how natural a conversation feels.
Voice Activity Detection (VAD)
VAD is a binary classifier that analyzes audio frames and decides: speech or not speech. LiveKit Agents uses Silero VAD, a neural network model that runs locally and processes audio in real time.
VAD answers a simple question on every audio frame: "Is someone talking?" It outputs a probability between 0 and 1, and when that probability crosses a threshold, the system transitions between speech and silence states.
from livekit.plugins.silero import VAD
# Load VAD with default settings
vad = VAD.load()
# VAD processes audio frames and emits events:
# - SPEECH_STARTED: user began speaking
# - SPEECH_FINISHED: user stopped speakingVAD is not transcription
VAD does not understand what the user is saying — it only detects that someone is speaking. It cannot distinguish between "hello" and a cough. That is the job of the Speech-to-Text (STT) engine, which runs after VAD determines that speech is present.
Endpointing
VAD tells you when speech stops, but silence does not always mean a turn is over. People pause mid-sentence to think, breathe, or gather their thoughts. Endpointing is the logic that decides whether a silence represents a finished turn or a mid-utterance pause.
LiveKit Agents provides TurnDetectionOptions to control endpointing behavior:
from livekit.agents import AgentSession, TurnDetectionOptions
session = AgentSession(
turn_detection=TurnDetectionOptions(
# Minimum silence before considering the turn complete
min_endpointing_delay=0.5,
# Maximum silence — turn is always complete after this
max_endpointing_delay=1.0,
),
)import { AgentSession, TurnDetectionOptions } from "@livekit/agents";
const session = new AgentSession({
turnDetection: {
// Minimum silence before considering the turn complete
minEndpointingDelay: 0.5,
// Maximum silence — turn is always complete after this
maxEndpointingDelay: 1.0,
},
});The endpointing delay is the silence duration the system waits after VAD reports speech has ended before it commits to a turn boundary. Shorter delays make the agent more responsive but risk cutting off users. Longer delays feel more patient but introduce latency.
The responsiveness tradeoff
Turn detection is fundamentally a tradeoff between two failure modes:
| Setting | Too aggressive | Too passive |
|---|---|---|
| Symptom | Agent interrupts users mid-sentence | Long pauses before agent responds |
| User experience | Frustrating, feels like the agent is not listening | Awkward, feels like the agent is slow |
| Metric | High interruption rate | High average response time |
There is no perfect setting
The right turn detection configuration depends on your use case. A fast-paced game companion needs short delays. A medical intake agent should be patient and never interrupt. There is no universal default — you must tune for your specific application.
Here is how different use cases map to endpointing delays:
from livekit.agents import TurnDetectionOptions
# Fast-paced: gaming, quick Q&A
fast_turn_detection = TurnDetectionOptions(
min_endpointing_delay=0.3,
max_endpointing_delay=0.6,
)
# Balanced: general customer service
balanced_turn_detection = TurnDetectionOptions(
min_endpointing_delay=0.5,
max_endpointing_delay=1.0,
)
# Patient: medical intake, elderly users, complex questions
patient_turn_detection = TurnDetectionOptions(
min_endpointing_delay=0.8,
max_endpointing_delay=1.5,
)// Fast-paced: gaming, quick Q&A
const fastTurnDetection = {
minEndpointingDelay: 0.3,
maxEndpointingDelay: 0.6,
};
// Balanced: general customer service
const balancedTurnDetection = {
minEndpointingDelay: 0.5,
maxEndpointingDelay: 1.0,
};
// Patient: medical intake, elderly users, complex questions
const patientTurnDetection = {
minEndpointingDelay: 0.8,
maxEndpointingDelay: 1.5,
};The turn detection pipeline
To summarize, here is how the pieces fit together in the LiveKit Agents pipeline:
Throughout this course, you will learn to tune every part of this pipeline — from VAD sensitivity to endpointing delays to adaptive interruption handling — so your agent can hold conversations that feel natural.
Test your knowledge
Question 1 of 3
What is the fundamental difference between VAD and endpointing in the turn detection pipeline?
What you learned
- Turn detection combines VAD (detecting speech) and endpointing (detecting turn completion) to identify turn boundaries
- Silero VAD runs locally and classifies audio frames as speech or silence in real time
- Endpointing uses silence duration to decide when a user has finished speaking
TurnDetectionOptionscontrols endpointing withmin_endpointing_delayandmax_endpointing_delay- The core tradeoff is responsiveness (short delays) vs. patience (long delays), and the right balance depends on your use case
Next up
In the next chapter, you will dive deep into Silero VAD configuration — tuning sensitivity, padding, prefix buffers, and buffered speech limits to get VAD working precisely for your application.