Chapter 120m

Testing strategy for voice AI

Testing strategy for voice AI

Traditional software tests check exact outputs: given input X, expect output Y. Voice AI agents are non-deterministic -- the same prompt can produce different words every time. This means you need a fundamentally different testing approach. In this chapter, you will learn the test pyramid for voice AI and why behavioral testing is the foundation of agent quality.

Test pyramidBehavioral testsEvaluation

What you'll learn

  • Why voice AI testing differs from traditional software testing
  • The three layers of the voice AI test pyramid
  • What behavioral testing means and how it works
  • How LiveKit's testing framework fits into your workflow

Why voice AI testing is different

When you test a REST API, you send a request and check the response matches a schema. When you test a voice AI agent, you send "I'd like to book an appointment" and the agent might say "Sure, what day works for you?" or "I'd be happy to help with that! What type of appointment are you looking for?" or any of a hundred other valid responses.

This non-determinism breaks traditional assertion-based testing. You cannot write assert response == "Sure, what day works for you?" because that will fail most of the time even when the agent is working perfectly.

There are three categories of things that can go wrong with a voice AI agent:

  • Functional failures -- a tool does not get called, or gets called with wrong arguments
  • Behavioral failures -- the agent says something inappropriate, ignores instructions, or goes off-script
  • Quality regressions -- the agent still works but responses are slower, less helpful, or less natural

Each category needs a different testing approach, which leads to the voice AI test pyramid.

What's happening

Think of voice AI testing like grading an essay rather than grading a math test. In math, there is one right answer. In an essay, you evaluate whether the student addressed the prompt, supported their argument, and wrote clearly -- not whether they used the exact words you expected.

The voice AI test pyramid

The test pyramid for voice AI has three layers, from fastest and most deterministic at the bottom to slowest and most holistic at the top.

Layer 1: Unit tests (tools and functions)

At the base of the pyramid, you test the deterministic parts of your agent: tool functions, data transformations, and utility logic. These are traditional unit tests.

tests/test_tools.pypython
import pytest
from datetime import date, time
from receptionist.tools import format_appointment_time, validate_phone_number

def test_format_appointment_time():
  result = format_appointment_time(date(2026, 3, 30), time(14, 0))
  assert result == "Monday, March 30 at 2:00 PM"

def test_validate_phone_number_valid():
  assert validate_phone_number("+15551234567") is True

def test_validate_phone_number_invalid():
  assert validate_phone_number("not-a-number") is False
tests/tools.test.tstypescript
import { describe, test, expect } from "vitest";
import { formatAppointmentTime, validatePhoneNumber } from "../src/tools";

describe("tool functions", () => {
test("formats appointment time correctly", () => {
  const result = formatAppointmentTime(new Date("2026-03-30"), "14:00");
  expect(result).toBe("Monday, March 30 at 2:00 PM");
});

test("validates correct phone number", () => {
  expect(validatePhoneNumber("+15551234567")).toBe(true);
});

test("rejects invalid phone number", () => {
  expect(validatePhoneNumber("not-a-number")).toBe(false);
});
});

Unit tests run in milliseconds, require no LLM calls, and catch bugs in your tool logic before they reach the agent.

Layer 2: Behavioral tests (conversation scenarios)

The middle layer tests the agent's behavior in simulated conversations. You define what the agent should do given certain inputs, not the exact words it should say.

tests/test_behavior.pypython
from livekit.agents.testing import AgentTest

async def test_greeting():
  test = AgentTest(DentalReceptionist())
  response = await test.say("Hi, I'd like to book an appointment")
  assert test.judge(
      response,
      "Agent should greet the caller and ask about the type of appointment"
  )
tests/behavior.test.tstypescript
import { AgentTest } from "@livekit/agents/testing";

test("greeting", async () => {
const agentTest = new AgentTest(new DentalReceptionist());
const response = await agentTest.say("Hi, I'd like to book an appointment");
expect(
  await agentTest.judge(response, "Agent should greet and ask about appointment type")
).toBe(true);
});

Behavioral tests use an LLM to judge whether the agent's response meets the criteria. They are slower than unit tests (seconds, not milliseconds) but catch the most important class of bugs: the agent not doing what it is supposed to do.

Layer 3: Evaluation (quality metrics)

At the top of the pyramid, evaluation suites measure the overall quality of your agent across many dimensions: response relevance, tone, helpfulness, and conversation flow. These run against large test sets and produce scores rather than pass/fail results.

Evaluation is not a gate

Evaluation suites are not typically used to block deployments. They produce scores and trends that help you track quality over time. You use them to answer "is this version better or worse than the last one?" rather than "does this specific test pass?"

Behavioral testing explained

Behavioral testing is the most important layer for voice AI agents. Here is how it works:

1

Define a scenario

Describe what the user says or does. For example: "User calls and asks to reschedule an appointment."

2

Run the conversation

Use the testing framework to simulate the user's input and capture the agent's response.

3

Judge the behavior

Instead of checking exact words, describe what the agent should have done: "Agent should confirm the existing appointment details before asking about a new time." An LLM evaluates whether the response meets this criterion.

4

Assert the judgment

The judge returns a boolean or score. You assert on that result just like any other test.

This approach lets you test intent and behavior without being brittle about exact wording. The agent can rephrase, add pleasantries, or vary its sentence structure -- as long as it does the right thing, the test passes.

LiveKit's testing framework

LiveKit provides a built-in testing framework designed for voice AI agents. The core class is AgentTest, which lets you simulate conversations without a real audio connection or LiveKit room.

tests/test_receptionist.pypython
from livekit.agents.testing import AgentTest
from receptionist.agent import DentalReceptionist

async def test_full_booking_flow():
  test = AgentTest(DentalReceptionist())

  # Simulate a multi-turn conversation
  r1 = await test.say("Hi, I need to book a cleaning")
  assert test.judge(r1, "Agent should acknowledge and ask about preferred date/time")

  r2 = await test.say("Next Tuesday at 2pm")
  assert test.judge(r2, "Agent should check availability or confirm the time slot")

  r3 = await test.say("Yes, that works")
  assert test.judge(r3, "Agent should confirm the booking details and provide a summary")
tests/receptionist.test.tstypescript
import { AgentTest } from "@livekit/agents/testing";
import { DentalReceptionist } from "../src/agent";

test("full booking flow", async () => {
const agentTest = new AgentTest(new DentalReceptionist());

const r1 = await agentTest.say("Hi, I need to book a cleaning");
expect(
  await agentTest.judge(r1, "Agent should acknowledge and ask about preferred date/time")
).toBe(true);

const r2 = await agentTest.say("Next Tuesday at 2pm");
expect(
  await agentTest.judge(r2, "Agent should check availability or confirm the time slot")
).toBe(true);

const r3 = await agentTest.say("Yes, that works");
expect(
  await agentTest.judge(r3, "Agent should confirm the booking details and provide a summary")
).toBe(true);
});

No room or audio required

AgentTest runs your agent in an isolated environment. There is no LiveKit room, no audio stream, and no network connection. The agent receives text input and produces text output, which makes tests fast and repeatable.

Test your knowledge

Question 1 of 3

Why can't you use assert response == 'Sure, what day works for you?' to test a voice AI agent?

What you learned

  • Voice AI testing requires a different approach because agent responses are non-deterministic
  • The test pyramid has three layers: unit tests for tools, behavioral tests for conversation scenarios, and evaluation for quality metrics
  • Behavioral testing judges whether the agent did the right thing, not whether it said the exact right words
  • LiveKit's AgentTest class lets you simulate conversations and make assertions without a room or audio

Next up

You have the strategy. In the next chapter, you will dive deep into behavioral tests -- writing comprehensive test suites that cover greetings, error handling, edge cases, and multi-turn conversations.

Concepts covered
Test pyramidBehavioral testsEvaluation