Chapter 220m

VAD configuration

VAD configuration

Silero VAD is the neural network that decides whether someone is speaking on every audio frame. Out of the box it works well, but real-world environments — noisy call centers, whispering users, speakerphone echo — demand tuning. This chapter gives you full control over every VAD parameter so your agent hears exactly what it should.

Silero VADSensitivityPaddingPrefix

What you'll learn

  • How Silero VAD processes audio frames and transitions between speech and silence states
  • What each configuration parameter controls and how it affects detection behavior
  • How to tune VAD for noisy environments, quiet speakers, and long-form dictation
  • Practical tuning profiles for common scenarios

Loading and configuring Silero VAD

Silero VAD ships as a LiveKit plugin. You load it once and pass it into your AgentSession. Every parameter has a sensible default, but understanding each one lets you adapt to any scenario.

agent.pypython
from livekit.plugins.silero import VAD
from livekit.agents import AgentSession

vad = VAD.load(
  min_speaking_duration=0.2,
  padding_duration=0.3,
  prefix_padding_duration=0.5,
  max_buffered_speech=30.0,
)

session = AgentSession(vad=vad)
agent.tstypescript
import { VAD } from "@livekit/plugins-silero";
import { AgentSession } from "@livekit/agents";

const vad = VAD.load({
minSpeakingDuration: 0.2,
paddingDuration: 0.3,
prefixPaddingDuration: 0.5,
maxBufferedSpeech: 30.0,
});

const session = new AgentSession({ vad });
What's happening

Silero VAD outputs a probability between 0 and 1 on each audio frame. When the probability exceeds the activation threshold, the system transitions to a "speaking" state. These parameters control how that transition behaves — how long speech must last to count, how much silence to tolerate within a turn, and how much audio to keep before speech was detected.

Parameter deep dive

Each parameter shapes a different aspect of the detection behavior. Here is what they do and when you should change them.

min_speaking_duration

The minimum duration of speech (in seconds) before VAD fires a SPEECH_STARTED event. Short bursts of sound below this threshold are ignored. Increase this value to filter out coughs, clicks, and brief noise. Decrease it if your users speak in short bursts and the agent misses the start of their utterances.

padding_duration

After VAD detects silence, it waits this long before firing SPEECH_FINISHED. This bridges short pauses within a sentence — a user thinking mid-thought should not trigger a turn boundary. Longer padding means fewer false endpoints but slower reaction time.

prefix_padding_duration

How much audio before the speech start to include in the captured segment. This ensures the STT engine receives the full onset of speech, including plosive consonants and quiet beginnings that precede the VAD trigger. If your transcripts are missing the first word or syllable, increase this value.

max_buffered_speech

The maximum duration of continuous speech (in seconds) the VAD will buffer. This acts as a safety valve for long monologues, preventing unbounded memory consumption. For most conversational agents 30 seconds is generous. For dictation or storytelling use cases you may need to increase it.

Padding is not endpointing

VAD padding controls how long VAD itself waits before declaring speech finished. Endpointing delay (configured in TurnDetectionOptions) is a separate timer that runs after VAD declares silence. These two timers stack — total silence before agent response is padding_duration plus the endpointing delay.

Tuning profiles for common scenarios

Different environments demand different settings. Here are three proven starting points.

profiles.pypython
from livekit.plugins.silero import VAD

# Noisy call center — aggressive filtering
noisy_vad = VAD.load(
  min_speaking_duration=0.3,
  padding_duration=0.4,
  prefix_padding_duration=0.6,
  max_buffered_speech=30.0,
)

# Quiet environment — sensitive detection
quiet_vad = VAD.load(
  min_speaking_duration=0.1,
  padding_duration=0.2,
  prefix_padding_duration=0.4,
  max_buffered_speech=30.0,
)

# Long-form dictation — patient, large buffer
dictation_vad = VAD.load(
  min_speaking_duration=0.2,
  padding_duration=0.6,
  prefix_padding_duration=0.5,
  max_buffered_speech=120.0,
)
profiles.tstypescript
import { VAD } from "@livekit/plugins-silero";

// Noisy call center — aggressive filtering
const noisyVad = VAD.load({
minSpeakingDuration: 0.3,
paddingDuration: 0.4,
prefixPaddingDuration: 0.6,
maxBufferedSpeech: 30.0,
});

// Quiet environment — sensitive detection
const quietVad = VAD.load({
minSpeakingDuration: 0.1,
paddingDuration: 0.2,
prefixPaddingDuration: 0.4,
maxBufferedSpeech: 30.0,
});

// Long-form dictation — patient, large buffer
const dictationVad = VAD.load({
minSpeakingDuration: 0.2,
paddingDuration: 0.6,
prefixPaddingDuration: 0.5,
maxBufferedSpeech: 120.0,
});

Start with defaults, then tune

The default Silero VAD settings work for most conversational agents. Only change parameters when you observe a specific problem — missing speech onsets, false triggers from background noise, or truncated long utterances. Tune one parameter at a time and test with real audio.

Debugging VAD behavior

When VAD is not behaving as expected, listen to what it hears. You can subscribe to VAD events to log transitions and measure timing.

debug_vad.pypython
import logging

logger = logging.getLogger("vad-debug")

@session.on("vad_speech_started")
def on_speech_started():
  logger.info("Speech started")

@session.on("vad_speech_finished")
def on_speech_finished():
  logger.info("Speech finished")

If you see rapid toggling between started and finished events, increase min_speaking_duration and padding_duration. If speech events fire too late, reduce min_speaking_duration and check your prefix_padding_duration.

Test your knowledge

Question 1 of 2

Your STT transcripts are consistently missing the first word or syllable of user speech. Which VAD parameter should you adjust?

What you learned

  • Silero VAD is a neural network that classifies audio frames as speech or silence in real time
  • min_speaking_duration filters short noise bursts by requiring speech to persist before triggering
  • padding_duration bridges mid-sentence pauses so they do not split a single utterance into multiple events
  • prefix_padding_duration captures audio before the speech trigger to preserve word onsets for STT
  • max_buffered_speech caps memory usage during long monologues
  • VAD padding and endpointing delay are separate timers that stack together

Next up

Next, you will explore multilingual turn detection — how different languages have different pause patterns and how the multilingual model adapts detection accordingly.

Concepts covered
Silero VADSensitivityPaddingPrefix