Chapter 920m

Turn detection deep dive

Turn detection deep dive

In this chapter, you will learn the three approaches to detecting when a speaker has finished their turn, how to configure endpointing delays, how to handle interruptions gracefully, and how to enable backchanneling so that filler words like "mmhmm" do not derail the conversation. By the end, your dental receptionist will feel patient-friendly — tolerant of pauses, resilient to false interruptions, and natural in conversation flow.

TurnHandlingOptionsEndpointingAdaptive interruptionsFalse interruptionsBackchanneling

Why turn detection is hard

Human speech is messy. People pause mid-sentence to think. They say "um" and "uh" without intending to yield the floor. Background noise — a dog barking, a door closing — can sound like speech onset. And sometimes a person genuinely wants to interrupt to correct something the agent said.

A voice AI agent must distinguish between all of these cases in real time, with no ability to look ahead. Get it wrong in one direction and the agent cuts people off. Get it wrong in the other direction and there are awkward silences after every sentence. Turn detection is the difference between an agent that feels conversational and one that feels robotic.

Three approaches to turn detection

LiveKit Agents provides three turn detection strategies, each with different tradeoffs.

VAD-only relies on Voice Activity Detection — specifically Silero VAD — to detect when the user stops speaking. When the VAD model reports silence for a configured duration, the system triggers an end-of-turn. This is fast and language-agnostic, but it cannot distinguish between a mid-sentence pause and a genuine turn boundary.

STT endpointing lets the speech-to-text provider decide when the user has finished. STT models have linguistic context — they can tell that "I'd like to book an appointment for..." is not a complete utterance. This is smarter than pure VAD but adds the latency of the STT model's own endpointing logic.

Multilingual turn detector is a purpose-built model that analyzes both audio features and partial transcriptions to predict turn boundaries. It understands conversational patterns across languages and is the most accurate option for natural-sounding conversations.

Start with the multilingual model

For most voice AI applications, the multilingual turn detector provides the best balance of accuracy and latency. It is what we will use for the dental receptionist.

Configuring turn handling

Turn detection is configured through TurnHandlingOptions on the Agent class. Here is the basic setup using the multilingual model:

agent.pypython
from livekit.agents import Agent, TurnHandlingOptions, MultilingualModel

agent = Agent(
  instructions="You are a dental clinic receptionist...",
  turn_handling=TurnHandlingOptions(
      turn_detection=MultilingualModel(),
  ),
)

This replaces the default VAD-only detection with the multilingual model. But the real power comes from the additional parameters that control timing and interruption behavior.

Endpointing: when to trigger end-of-turn

Endpointing controls how long the system waits after detecting potential speech completion before committing to an end-of-turn decision. Two parameters matter:

  • min_delay — the minimum silence duration before an end-of-turn can fire. Even if the model is confident the user is done, it waits at least this long. This prevents the agent from jumping in during natural breathing pauses.
  • max_delay — the maximum silence duration before an end-of-turn fires regardless of model confidence. This is the safety net that prevents indefinite waiting.
agent.pypython
from livekit.agents import Agent, TurnHandlingOptions, MultilingualModel

agent = Agent(
  instructions="You are a dental clinic receptionist...",
  turn_handling=TurnHandlingOptions(
      turn_detection=MultilingualModel(),
      min_endpointing_delay=0.5,
      max_endpointing_delay=1.5,
  ),
)

For a dental receptionist, slightly longer delays are appropriate. Patients calling a dental office are often nervous, thinking about their symptoms, or checking their calendar. A half-second minimum delay gives them room to breathe without the agent jumping in prematurely.

What's happening

The gap between min_delay and max_delay defines a window where the turn detection model's confidence determines the exact trigger point. With 0.5 and 1.5, a clearly finished sentence ("I'd like Tuesday at 3.") fires near 0.5 seconds, while a trailing thought ("I was thinking maybe... Tuesday?") gets the full 1.5 seconds of patience.

Interruption handling

When a user speaks while the agent is talking, that is an interruption. The question is: did the user genuinely want to interrupt, or was it a cough, a background noise, or an accidental "um"?

LiveKit provides two interruption modes:

VAD mode treats any detected voice activity as a genuine interruption. The agent stops speaking immediately. This is responsive but aggressive — a cough in the background will cut the agent off mid-sentence.

Adaptive mode uses the turn detection model to assess whether the interruption is intentional. Brief noise or filler words are ignored. Only sustained, intentional speech causes the agent to yield. This feels dramatically more natural.

agent.pypython
from livekit.agents import Agent, TurnHandlingOptions, MultilingualModel

agent = Agent(
  instructions="You are a dental clinic receptionist...",
  turn_handling=TurnHandlingOptions(
      turn_detection=MultilingualModel(),
      min_endpointing_delay=0.5,
      max_endpointing_delay=1.5,
      interruption_mode="adaptive",
  ),
)

Adaptive mode requires the multilingual model

Adaptive interruption mode only works with the multilingual turn detector. If you are using VAD-only turn detection, you are limited to VAD interruption mode.

False interruption recovery

Even with adaptive mode, false interruptions happen. A patient might say "mmhmm" while the agent is reading back appointment details, causing the agent to stop. Two parameters handle this gracefully:

false_interruption_timeout sets how long the system waits after an interruption before deciding whether the user actually wants to speak. If the user goes silent within this window, the interruption is classified as false.

resume_false_interruption controls whether the agent picks up where it left off after a false interruption. When enabled, the agent resumes its interrupted speech rather than starting a new response.

agent.pypython
from livekit.agents import Agent, TurnHandlingOptions, MultilingualModel

agent = Agent(
  instructions="You are a dental clinic receptionist...",
  turn_handling=TurnHandlingOptions(
      turn_detection=MultilingualModel(),
      min_endpointing_delay=0.5,
      max_endpointing_delay=1.5,
      interruption_mode="adaptive",
      false_interruption_timeout=0.6,
      resume_false_interruption=True,
  ),
)

With this configuration, if a patient says "uh huh" while the agent is confirming their booking, the agent pauses briefly, determines the patient was not actually taking the floor, and continues the confirmation from where it stopped. Without this, the agent would either restart its entire response or generate a confused follow-up.

Backchanneling

Backchanneling is the conversational habit of making small sounds — "mmhmm", "uh huh", "yeah", "ok" — to signal that you are listening without intending to take the floor. In human conversation, these sounds are essential. In voice AI, they are historically a source of chaos, because the system interprets them as a new turn.

The multilingual turn detector recognizes backchannel signals and does not treat them as turn attempts. This means a patient can say "mmhmm" while the agent describes available appointment slots, and the agent will continue uninterrupted.

No additional configuration is needed — backchanneling awareness is built into the multilingual model. It is one of the key advantages over VAD-only detection, which has no way to distinguish "mmhmm" from "Actually, wait."

The complete dental receptionist turn handling config

Bringing it all together, here is the patient-friendly turn handling configuration for the dental receptionist:

agent.pypython
from livekit.agents import Agent, TurnHandlingOptions, MultilingualModel

class DentalReceptionist(Agent):
  def __init__(self):
      super().__init__(
          instructions="""You are a friendly receptionist at Bright Smile Dental Clinic.
          Help patients check availability and book appointments.
          Be patient — callers may be nervous or distracted.""",
          turn_handling=TurnHandlingOptions(
              turn_detection=MultilingualModel(),
              min_endpointing_delay=0.5,
              max_endpointing_delay=1.5,
              interruption_mode="adaptive",
              false_interruption_timeout=0.6,
              resume_false_interruption=True,
          ),
      )

  async def on_enter(self):
      await self.session.generate_reply(
          instructions="Greet the caller warmly and ask how you can help."
      )

This configuration says: use the multilingual turn detector, wait at least half a second before deciding the patient is done speaking, allow up to 1.5 seconds for trailing thoughts, use adaptive interruption mode so background noise does not cut off the agent, wait 0.6 seconds to classify ambiguous interruptions, and resume speech if the interruption was false.

Test it

Connect to your agent in Playground and try the following:

Test endpointing patience. Say "I'd like to book an appointment for..." and pause for a full second before continuing with "...Tuesday." The agent should wait for you rather than jumping in during the pause.

Test adaptive interruptions. Let the agent start speaking, then cough or make a brief noise. With adaptive mode, the agent should continue talking. Now try genuinely interrupting with "Actually, wait" — the agent should stop and yield to you.

Test false interruption recovery. Let the agent begin reading back appointment details. Say "mmhmm" or "ok" partway through. The agent should briefly pause, then resume its confirmation from where it left off.

Test backchanneling. While the agent lists available time slots, periodically say "uh huh." The agent should continue listing slots without treating each acknowledgment as a new turn.

Tuning is iterative

The delay values above are a starting point. Every application has different conversational dynamics. A fast-paced sales agent might use min_delay of 0.3. A medical intake agent handling elderly patients might go to 0.8. Test with real users and adjust.

Test your knowledge

Question 1 of 3

What is the key advantage of adaptive interruption mode over VAD interruption mode?

The difference between a well-tuned turn detection configuration and the defaults is immediately obvious. In the next chapter, you will learn how to manage the conversation context that accumulates across all of these turns — reading it, injecting into it, and trimming it when conversations run long.

Concepts covered
TurnHandlingOptionsEndpointingAdaptive interruptionsFalse interruptionsBackchanneling