Chapter 320m

Multilingual turn detection

Multilingual turn detection

Not every conversation happens in English. Japanese speakers pause longer between phrases. Spanish speakers overlap more freely. A turn detector trained on English pause patterns will misfire constantly when your users speak Mandarin or Arabic. LiveKit provides a multilingual turn detection model that adapts to the rhythmic and prosodic patterns of different languages.

MultilingualModelLanguage detectionCross-lingual

What you'll learn

  • Why language-specific pause patterns break single-language turn detectors
  • How to enable and configure the multilingual turn detection model
  • How to set language hints for known-language scenarios
  • How to handle cross-lingual conversations where users switch languages mid-session

The problem with universal pause thresholds

English conversational pauses average around 200 milliseconds. Japanese speakers regularly pause 500 milliseconds or more between clauses without yielding their turn. Finnish speakers are comfortable with silences that would feel awkward in Italian. A single endpointing threshold cannot serve all languages.

What's happening

The multilingual turn detection model is trained on conversation data from multiple languages. Instead of relying on a fixed silence threshold, it uses acoustic and temporal features that account for language-specific turn-taking patterns. This means it can tolerate longer pauses in Japanese while remaining responsive in Spanish.

Enabling the multilingual model

You can switch from the default Silero VAD to the multilingual model by configuring your AgentSession with the appropriate turn detection settings.

agent.pypython
from livekit.agents import AgentSession, TurnDetectionOptions

session = AgentSession(
  turn_detection=TurnDetectionOptions(
      multilingual_model=True,
      # Optional: hint the expected language
      language="ja",
  ),
)
agent.tstypescript
import { AgentSession } from "@livekit/agents";

const session = new AgentSession({
turnDetection: {
  multilingualModel: true,
  // Optional: hint the expected language
  language: "ja",
},
});

Language hint is optional

If you know the user's language ahead of time (from their locale, account settings, or an IVR menu selection), pass it as a hint. If you do not, the model will attempt to adapt based on the acoustic features of the conversation itself.

Language-specific tuning

Even with the multilingual model enabled, you may want to adjust endpointing delays per language. Here are recommended starting points based on conversational research.

language_profiles.pypython
from livekit.agents import TurnDetectionOptions

# English — standard timing
english_config = TurnDetectionOptions(
  multilingual_model=True,
  language="en",
  min_endpointing_delay=0.5,
  max_endpointing_delay=1.0,
)

# Japanese — longer natural pauses
japanese_config = TurnDetectionOptions(
  multilingual_model=True,
  language="ja",
  min_endpointing_delay=0.8,
  max_endpointing_delay=1.5,
)

# Spanish — faster turn-taking
spanish_config = TurnDetectionOptions(
  multilingual_model=True,
  language="es",
  min_endpointing_delay=0.3,
  max_endpointing_delay=0.8,
)

# Arabic — moderate pauses, different conversational pacing
arabic_config = TurnDetectionOptions(
  multilingual_model=True,
  language="ar",
  min_endpointing_delay=0.6,
  max_endpointing_delay=1.2,
)
language_profiles.tstypescript
// English — standard timing
const englishConfig = {
multilingualModel: true,
language: "en",
minEndpointingDelay: 0.5,
maxEndpointingDelay: 1.0,
};

// Japanese — longer natural pauses
const japaneseConfig = {
multilingualModel: true,
language: "ja",
minEndpointingDelay: 0.8,
maxEndpointingDelay: 1.5,
};

// Spanish — faster turn-taking
const spanishConfig = {
multilingualModel: true,
language: "es",
minEndpointingDelay: 0.3,
maxEndpointingDelay: 0.8,
};

// Arabic — moderate pauses
const arabicConfig = {
multilingualModel: true,
language: "ar",
minEndpointingDelay: 0.6,
maxEndpointingDelay: 1.2,
};

Cross-lingual conversations

In some applications, users switch languages mid-conversation — a bilingual support agent, for example, where a user starts in English and switches to Spanish. The multilingual model handles this more gracefully than a fixed-language detector.

cross_lingual.pypython
from livekit.agents import AgentSession, TurnDetectionOptions

# Do not set a language hint — let the model adapt
session = AgentSession(
  turn_detection=TurnDetectionOptions(
      multilingual_model=True,
      # No language hint — model adapts dynamically
  ),
)

Cross-lingual detection has limits

The multilingual model adapts over time, but rapid language switching within a single sentence can still confuse it. If your use case involves heavy code-switching (mixing two languages in one utterance), you may need to combine the multilingual model with longer endpointing delays to compensate.

Combining multilingual detection with VAD tuning

The multilingual model works alongside Silero VAD. You can still tune VAD parameters independently. The multilingual model primarily affects endpointing decisions, while VAD handles the raw speech/silence classification.

Silero VAD detects speech onset and offset in the audio stream.

The multilingual model analyzes the acoustic context when silence is detected.

Based on language-specific patterns, the model decides if the silence is a mid-turn pause or a turn boundary.

If a turn boundary is detected, the agent pipeline begins processing.
combined.pypython
from livekit.plugins.silero import VAD
from livekit.agents import AgentSession, TurnDetectionOptions

# Tune VAD for noisy environment
vad = VAD.load(
  min_speaking_duration=0.3,
  padding_duration=0.4,
)

# Enable multilingual turn detection
session = AgentSession(
  vad=vad,
  turn_detection=TurnDetectionOptions(
      multilingual_model=True,
      language="ja",
      min_endpointing_delay=0.8,
      max_endpointing_delay=1.5,
  ),
)

Test your knowledge

Question 1 of 2

Why would a single fixed endpointing threshold cause problems for a Japanese-speaking user?

What you learned

  • Different languages have fundamentally different pause patterns and turn-taking conventions
  • The multilingual model adapts endpointing decisions based on language-specific acoustic features
  • Language hints improve accuracy when you know the user's language in advance
  • For cross-lingual conversations, omit the language hint and let the model adapt dynamically
  • The multilingual model works alongside Silero VAD — they handle different aspects of turn detection

Next up

In the next chapter, you will learn to fine-tune endpointing configuration — the delays that determine exactly how long the agent waits after speech ends before responding.

Concepts covered
MultilingualModelLanguage detectionCross-lingual