Multilingual turn detection
Multilingual turn detection
Not every conversation happens in English. Japanese speakers pause longer between phrases. Spanish speakers overlap more freely. A turn detector trained on English pause patterns will misfire constantly when your users speak Mandarin or Arabic. LiveKit provides a multilingual turn detection model that adapts to the rhythmic and prosodic patterns of different languages.
What you'll learn
- Why language-specific pause patterns break single-language turn detectors
- How to enable and configure the multilingual turn detection model
- How to set language hints for known-language scenarios
- How to handle cross-lingual conversations where users switch languages mid-session
The problem with universal pause thresholds
English conversational pauses average around 200 milliseconds. Japanese speakers regularly pause 500 milliseconds or more between clauses without yielding their turn. Finnish speakers are comfortable with silences that would feel awkward in Italian. A single endpointing threshold cannot serve all languages.
The multilingual turn detection model is trained on conversation data from multiple languages. Instead of relying on a fixed silence threshold, it uses acoustic and temporal features that account for language-specific turn-taking patterns. This means it can tolerate longer pauses in Japanese while remaining responsive in Spanish.
Enabling the multilingual model
You can switch from the default Silero VAD to the multilingual model by configuring your AgentSession with the appropriate turn detection settings.
from livekit.agents import AgentSession, TurnDetectionOptions
session = AgentSession(
turn_detection=TurnDetectionOptions(
multilingual_model=True,
# Optional: hint the expected language
language="ja",
),
)import { AgentSession } from "@livekit/agents";
const session = new AgentSession({
turnDetection: {
multilingualModel: true,
// Optional: hint the expected language
language: "ja",
},
});Language hint is optional
If you know the user's language ahead of time (from their locale, account settings, or an IVR menu selection), pass it as a hint. If you do not, the model will attempt to adapt based on the acoustic features of the conversation itself.
Language-specific tuning
Even with the multilingual model enabled, you may want to adjust endpointing delays per language. Here are recommended starting points based on conversational research.
from livekit.agents import TurnDetectionOptions
# English — standard timing
english_config = TurnDetectionOptions(
multilingual_model=True,
language="en",
min_endpointing_delay=0.5,
max_endpointing_delay=1.0,
)
# Japanese — longer natural pauses
japanese_config = TurnDetectionOptions(
multilingual_model=True,
language="ja",
min_endpointing_delay=0.8,
max_endpointing_delay=1.5,
)
# Spanish — faster turn-taking
spanish_config = TurnDetectionOptions(
multilingual_model=True,
language="es",
min_endpointing_delay=0.3,
max_endpointing_delay=0.8,
)
# Arabic — moderate pauses, different conversational pacing
arabic_config = TurnDetectionOptions(
multilingual_model=True,
language="ar",
min_endpointing_delay=0.6,
max_endpointing_delay=1.2,
)// English — standard timing
const englishConfig = {
multilingualModel: true,
language: "en",
minEndpointingDelay: 0.5,
maxEndpointingDelay: 1.0,
};
// Japanese — longer natural pauses
const japaneseConfig = {
multilingualModel: true,
language: "ja",
minEndpointingDelay: 0.8,
maxEndpointingDelay: 1.5,
};
// Spanish — faster turn-taking
const spanishConfig = {
multilingualModel: true,
language: "es",
minEndpointingDelay: 0.3,
maxEndpointingDelay: 0.8,
};
// Arabic — moderate pauses
const arabicConfig = {
multilingualModel: true,
language: "ar",
minEndpointingDelay: 0.6,
maxEndpointingDelay: 1.2,
};Cross-lingual conversations
In some applications, users switch languages mid-conversation — a bilingual support agent, for example, where a user starts in English and switches to Spanish. The multilingual model handles this more gracefully than a fixed-language detector.
from livekit.agents import AgentSession, TurnDetectionOptions
# Do not set a language hint — let the model adapt
session = AgentSession(
turn_detection=TurnDetectionOptions(
multilingual_model=True,
# No language hint — model adapts dynamically
),
)Cross-lingual detection has limits
The multilingual model adapts over time, but rapid language switching within a single sentence can still confuse it. If your use case involves heavy code-switching (mixing two languages in one utterance), you may need to combine the multilingual model with longer endpointing delays to compensate.
Combining multilingual detection with VAD tuning
The multilingual model works alongside Silero VAD. You can still tune VAD parameters independently. The multilingual model primarily affects endpointing decisions, while VAD handles the raw speech/silence classification.
from livekit.plugins.silero import VAD
from livekit.agents import AgentSession, TurnDetectionOptions
# Tune VAD for noisy environment
vad = VAD.load(
min_speaking_duration=0.3,
padding_duration=0.4,
)
# Enable multilingual turn detection
session = AgentSession(
vad=vad,
turn_detection=TurnDetectionOptions(
multilingual_model=True,
language="ja",
min_endpointing_delay=0.8,
max_endpointing_delay=1.5,
),
)Test your knowledge
Question 1 of 2
Why would a single fixed endpointing threshold cause problems for a Japanese-speaking user?
What you learned
- Different languages have fundamentally different pause patterns and turn-taking conventions
- The multilingual model adapts endpointing decisions based on language-specific acoustic features
- Language hints improve accuracy when you know the user's language in advance
- For cross-lingual conversations, omit the language hint and let the model adapt dynamically
- The multilingual model works alongside Silero VAD — they handle different aspects of turn detection
Next up
In the next chapter, you will learn to fine-tune endpointing configuration — the delays that determine exactly how long the agent waits after speech ends before responding.