Chapter 715m

A/B testing & quality metrics

A/B testing & quality metrics

You have learned how to configure VAD, endpointing, and interruption handling. But how do you know which configuration is actually better? Intuition is not enough — you need data. This chapter shows you how to A/B test turn detection configurations and measure conversation quality with concrete metrics.

A/B testingQuality metricsConversation analysis

What you'll learn

  • Key metrics for measuring turn detection quality
  • How to set up A/B tests for different configurations
  • How to analyze results and pick the winning configuration
  • A simple comparison framework you can run in CI

Defining quality metrics

Before you can compare configurations, you need to define what "better" means. Here are the metrics that matter most for turn detection:

MetricWhat it measuresHow to compute
Interruption rateHow often the agent gets cut off mid-sentenceAgent speech events that stop before completion / total agent speech events
False interruption rateHow often the agent stops for non-meaningful inputAgent stops where user input was < 3 words / total agent stops
Response latencyTime from user finishing to agent startingTimestamp of first agent audio - timestamp of VAD end-of-speech
Turn overlapHow much user and agent speech overlapDuration of simultaneous audio / total conversation duration
Completion rateHow often the agent finishes its full responseCompleted responses / total responses started
What's happening

No single metric tells the full story. A low interruption rate is good, but not if it comes with high response latency (the agent waits too long to speak). The goal is to find the configuration that optimizes across all metrics — low interruptions, low latency, minimal overlap, high completion.

Setting up an A/B test

The simplest approach is to create two agent configurations and randomly assign sessions to each:

agent.pypython
import random
from livekit.agents import AgentSession, TurnDetectionOptions

# Configuration A: conservative (fewer interruptions, higher latency)
CONFIG_A = TurnDetectionOptions(
  min_endpointing_delay=0.7,
  max_endpointing_delay=1.5,
  false_interruption_timeout=0.5,
  interruption_mode="adaptive",
)

# Configuration B: responsive (more interruptions, lower latency)
CONFIG_B = TurnDetectionOptions(
  min_endpointing_delay=0.3,
  max_endpointing_delay=0.8,
  false_interruption_timeout=0.2,
  interruption_mode="adaptive",
)

def create_session() -> AgentSession:
  config = random.choice([CONFIG_A, CONFIG_B])
  variant = "A" if config is CONFIG_A else "B"

  session = AgentSession(
      turn_detection=config,
  )
  # Tag the session for later analysis
  session.userdata["ab_variant"] = variant
  return session
agent.tstypescript
import { AgentSession } from "@livekit/agents";

const CONFIG_A = {
minEndpointingDelay: 0.7,
maxEndpointingDelay: 1.5,
falseInterruptionTimeout: 0.5,
interruptionMode: "adaptive" as const,
};

const CONFIG_B = {
minEndpointingDelay: 0.3,
maxEndpointingDelay: 0.8,
falseInterruptionTimeout: 0.2,
interruptionMode: "adaptive" as const,
};

function createSession(): AgentSession {
const config = Math.random() > 0.5 ? CONFIG_A : CONFIG_B;
const variant = config === CONFIG_A ? "A" : "B";

const session = new AgentSession({
  turnDetection: config,
});
session.userdata.abVariant = variant;
return session;
}

Collecting metrics

Log metrics during each conversation so you can compare variants after the test:

metrics.pypython
import time
import json

class TurnMetrics:
  def __init__(self, variant: str):
      self.variant = variant
      self.interruptions = 0
      self.false_interruptions = 0
      self.response_latencies: list[float] = []
      self.responses_started = 0
      self.responses_completed = 0

  def record_interruption(self, user_words: int):
      self.interruptions += 1
      if user_words < 3:
          self.false_interruptions += 1

  def record_response(self, latency: float, completed: bool):
      self.response_latencies.append(latency)
      self.responses_started += 1
      if completed:
          self.responses_completed += 1

  def summary(self) -> dict:
      return {
          "variant": self.variant,
          "interruption_rate": self.interruptions / max(self.responses_started, 1),
          "false_interruption_rate": self.false_interruptions / max(self.interruptions, 1),
          "avg_response_latency": sum(self.response_latencies) / max(len(self.response_latencies), 1),
          "completion_rate": self.responses_completed / max(self.responses_started, 1),
      }

Analyzing results

After running enough sessions (aim for at least 100 per variant), compare the metrics:

MetricConfig A (conservative)Config B (responsive)Better
Interruption rate5%18%A
False interruption rate2%12%A
Avg response latency850ms420msB
Completion rate95%78%A

Pick based on your use case

If your agent gives long, detailed answers (like explaining medical procedures), Config A is better — fewer interruptions means the user hears the full answer. If your agent handles quick, transactional queries (like checking order status), Config B is better — the faster response time matters more than completion rate.

Course summary

Over this course, you have built a deep understanding of how turn detection shapes conversation quality:

  • VAD detects when speech starts and stops, with tunable sensitivity and padding
  • Endpointing determines when a turn is truly over, balancing responsiveness against patience
  • Adaptive interruption handling uses the LLM to distinguish real interruptions from noise and backchannels
  • Backchanneling keeps conversations flowing naturally through acknowledgments
  • A/B testing gives you data to make informed configuration decisions

The default settings work well for most applications, but the difference between a good voice agent and a great one often comes down to these details. Use the metrics and testing framework from this chapter to find the configuration that makes your agent feel most natural to your specific users.

Reference

See the Turn detection docs for the complete API reference and default values for all parameters.

Test your knowledge

Question 1 of 2

Why is it insufficient to optimize for a single metric like interruption rate when evaluating turn detection configurations?

What you learned

  • Five key metrics for measuring turn detection quality: interruption rate, false interruption rate, response latency, turn overlap, and completion rate
  • How to set up A/B tests by randomly assigning sessions to different configurations
  • How to collect and compare metrics across variants
  • Configuration choice depends on your use case — optimize for the metrics that matter most to your users
Concepts covered
A/B testingQuality metricsConversation analysis