VAD configuration
VAD configuration
Silero VAD is the neural network that decides whether someone is speaking on every audio frame. Out of the box it works well, but real-world environments — noisy call centers, whispering users, speakerphone echo — demand tuning. This chapter gives you full control over every VAD parameter so your agent hears exactly what it should.
What you'll learn
- How Silero VAD processes audio frames and transitions between speech and silence states
- What each configuration parameter controls and how it affects detection behavior
- How to tune VAD for noisy environments, quiet speakers, and long-form dictation
- Practical tuning profiles for common scenarios
Loading and configuring Silero VAD
Silero VAD ships as a LiveKit plugin. You load it once and pass it into your AgentSession. Every parameter has a sensible default, but understanding each one lets you adapt to any scenario.
from livekit.plugins.silero import VAD
from livekit.agents import AgentSession
vad = VAD.load(
min_speaking_duration=0.2,
padding_duration=0.3,
prefix_padding_duration=0.5,
max_buffered_speech=30.0,
)
session = AgentSession(vad=vad)import { VAD } from "@livekit/plugins-silero";
import { AgentSession } from "@livekit/agents";
const vad = VAD.load({
minSpeakingDuration: 0.2,
paddingDuration: 0.3,
prefixPaddingDuration: 0.5,
maxBufferedSpeech: 30.0,
});
const session = new AgentSession({ vad });Silero VAD outputs a probability between 0 and 1 on each audio frame. When the probability exceeds the activation threshold, the system transitions to a "speaking" state. These parameters control how that transition behaves — how long speech must last to count, how much silence to tolerate within a turn, and how much audio to keep before speech was detected.
Parameter deep dive
Each parameter shapes a different aspect of the detection behavior. Here is what they do and when you should change them.
min_speaking_duration
The minimum duration of speech (in seconds) before VAD fires a SPEECH_STARTED event. Short bursts of sound below this threshold are ignored. Increase this value to filter out coughs, clicks, and brief noise. Decrease it if your users speak in short bursts and the agent misses the start of their utterances.
padding_duration
After VAD detects silence, it waits this long before firing SPEECH_FINISHED. This bridges short pauses within a sentence — a user thinking mid-thought should not trigger a turn boundary. Longer padding means fewer false endpoints but slower reaction time.
prefix_padding_duration
How much audio before the speech start to include in the captured segment. This ensures the STT engine receives the full onset of speech, including plosive consonants and quiet beginnings that precede the VAD trigger. If your transcripts are missing the first word or syllable, increase this value.
max_buffered_speech
The maximum duration of continuous speech (in seconds) the VAD will buffer. This acts as a safety valve for long monologues, preventing unbounded memory consumption. For most conversational agents 30 seconds is generous. For dictation or storytelling use cases you may need to increase it.
Padding is not endpointing
VAD padding controls how long VAD itself waits before declaring speech finished. Endpointing delay (configured in TurnDetectionOptions) is a separate timer that runs after VAD declares silence. These two timers stack — total silence before agent response is padding_duration plus the endpointing delay.
Tuning profiles for common scenarios
Different environments demand different settings. Here are three proven starting points.
from livekit.plugins.silero import VAD
# Noisy call center — aggressive filtering
noisy_vad = VAD.load(
min_speaking_duration=0.3,
padding_duration=0.4,
prefix_padding_duration=0.6,
max_buffered_speech=30.0,
)
# Quiet environment — sensitive detection
quiet_vad = VAD.load(
min_speaking_duration=0.1,
padding_duration=0.2,
prefix_padding_duration=0.4,
max_buffered_speech=30.0,
)
# Long-form dictation — patient, large buffer
dictation_vad = VAD.load(
min_speaking_duration=0.2,
padding_duration=0.6,
prefix_padding_duration=0.5,
max_buffered_speech=120.0,
)import { VAD } from "@livekit/plugins-silero";
// Noisy call center — aggressive filtering
const noisyVad = VAD.load({
minSpeakingDuration: 0.3,
paddingDuration: 0.4,
prefixPaddingDuration: 0.6,
maxBufferedSpeech: 30.0,
});
// Quiet environment — sensitive detection
const quietVad = VAD.load({
minSpeakingDuration: 0.1,
paddingDuration: 0.2,
prefixPaddingDuration: 0.4,
maxBufferedSpeech: 30.0,
});
// Long-form dictation — patient, large buffer
const dictationVad = VAD.load({
minSpeakingDuration: 0.2,
paddingDuration: 0.6,
prefixPaddingDuration: 0.5,
maxBufferedSpeech: 120.0,
});Start with defaults, then tune
The default Silero VAD settings work for most conversational agents. Only change parameters when you observe a specific problem — missing speech onsets, false triggers from background noise, or truncated long utterances. Tune one parameter at a time and test with real audio.
Debugging VAD behavior
When VAD is not behaving as expected, listen to what it hears. You can subscribe to VAD events to log transitions and measure timing.
import logging
logger = logging.getLogger("vad-debug")
@session.on("vad_speech_started")
def on_speech_started():
logger.info("Speech started")
@session.on("vad_speech_finished")
def on_speech_finished():
logger.info("Speech finished")If you see rapid toggling between started and finished events, increase min_speaking_duration and padding_duration. If speech events fire too late, reduce min_speaking_duration and check your prefix_padding_duration.
Test your knowledge
Question 1 of 2
Your STT transcripts are consistently missing the first word or syllable of user speech. Which VAD parameter should you adjust?
What you learned
- Silero VAD is a neural network that classifies audio frames as speech or silence in real time
min_speaking_durationfilters short noise bursts by requiring speech to persist before triggeringpadding_durationbridges mid-sentence pauses so they do not split a single utterance into multiple eventsprefix_padding_durationcaptures audio before the speech trigger to preserve word onsets for STTmax_buffered_speechcaps memory usage during long monologues- VAD padding and endpointing delay are separate timers that stack together
Next up
Next, you will explore multilingual turn detection — how different languages have different pause patterns and how the multilingual model adapts detection accordingly.