Behavioral tests deep dive

You know the theory behind behavioral testing. Now it is time to write real tests. In this chapter, you will set up a complete behavioral test suite for the dental receptionist agent, learn every feature of the AgentTest class, and use the judge() function to evaluate responses with LLM-powered assertions.

session.run()Assertionsjudge()

What you'll learn

How to set up and configure AgentTest for behavioral tests
How session.run() drives multi-turn conversations
How judge() uses an LLM to evaluate agent responses
Patterns for testing greetings, edge cases, guardrails, and multi-turn flows

Setting up AgentTest

The AgentTest class wraps your agent in a test harness. You create one per test, simulate user messages, and make assertions about the agent's behavior.

tests/conftest.pypython

import pytest
from livekit.agents.testing import AgentTest
from receptionist.agent import DentalReceptionist

@pytest.fixture
async def agent():
  """Create a fresh agent test instance for each test."""
  test = AgentTest(DentalReceptionist())
  return test

tests/setup.tstypescript

import { AgentTest } from "@livekit/agents/testing";
import { DentalReceptionist } from "../src/agent";

export function createAgentTest() {
return new AgentTest(new DentalReceptionist());
}

Each test gets a fresh agent instance with no conversation history. This ensures tests are isolated and do not affect each other.

One agent per test

Always create a new AgentTest for each test case. Reusing an agent across tests means conversation history leaks between tests, which leads to flaky results. The fixture pattern shown above handles this automatically.

Driving conversations with session.run()

The say() method sends a single user message and returns the agent's response. For more complex scenarios, session.run() lets you execute a full conversation script.

tests/test_conversations.pypython

async def test_booking_conversation(agent):
  """Test a complete appointment booking conversation."""
  session = agent.session()

  # Run a multi-turn conversation script
  responses = await session.run([
      "Hi, I'd like to book an appointment",
      "I need a teeth cleaning",
      "Next Tuesday at 2pm would be great",
      "Yes, please confirm that",
  ])

  # Each response corresponds to the agent's reply after each user message
  assert len(responses) == 4

  # Judge each turn
  assert agent.judge(responses[0], "Agent should greet and ask about appointment type")
  assert agent.judge(responses[1], "Agent should acknowledge cleaning and ask about date/time")
  assert agent.judge(responses[2], "Agent should confirm the date/time or check availability")
  assert agent.judge(responses[3], "Agent should confirm the booking with a summary")

tests/conversations.test.tstypescript

import { createAgentTest } from "./setup";

test("booking conversation", async () => {
const agentTest = createAgentTest();
const session = agentTest.session();

const responses = await session.run([
  "Hi, I'd like to book an appointment",
  "I need a teeth cleaning",
  "Next Tuesday at 2pm would be great",
  "Yes, please confirm that",
]);

expect(responses).toHaveLength(4);
expect(await agentTest.judge(responses[0], "Agent should greet and ask about appointment type")).toBe(true);
expect(await agentTest.judge(responses[1], "Agent should acknowledge cleaning and ask about date/time")).toBe(true);
expect(await agentTest.judge(responses[2], "Agent should confirm the date/time or check availability")).toBe(true);
expect(await agentTest.judge(responses[3], "Agent should confirm the booking with a summary")).toBe(true);
});

What's happening

Think of session.run() like a screenplay. You write the user's lines, the agent improvises its lines, and then you evaluate whether the agent's performance was good -- not whether it matched a script word for word.

The judge() function

The judge() function is the heart of behavioral testing. It sends the agent's response and your evaluation criteria to an LLM, which returns whether the response meets the criteria.

tests/test_judge_examples.pypython

async def test_judge_basic(agent):
  response = await agent.say("What are your office hours?")

  # Simple criterion
  assert agent.judge(response, "Agent should provide office hours")

  # Multiple criteria in one judge call
  assert agent.judge(response, """
      Agent should:
      1. Provide specific hours (days and times)
      2. Be polite and professional
      3. Not make up hours that were not configured
  """)

async def test_judge_negative(agent):
  response = await agent.say("Can you help me file my taxes?")

  # Test that the agent does NOT do something
  assert agent.judge(response, """
      Agent should politely decline the request
      and redirect to dental services.
      Agent should NOT attempt to provide tax advice.
  """)

tests/judge-examples.test.tstypescript

import { createAgentTest } from "./setup";

test("judge basic criteria", async () => {
const agentTest = createAgentTest();
const response = await agentTest.say("What are your office hours?");

expect(
  await agentTest.judge(response, "Agent should provide office hours")
).toBe(true);

expect(
  await agentTest.judge(response, `
    Agent should:
    1. Provide specific hours (days and times)
    2. Be polite and professional
    3. Not make up hours that were not configured
  `)
).toBe(true);
});

test("judge negative criteria", async () => {
const agentTest = createAgentTest();
const response = await agentTest.say("Can you help me file my taxes?");

expect(
  await agentTest.judge(response, `
    Agent should politely decline the request
    and redirect to dental services.
    Agent should NOT attempt to provide tax advice.
  `)
).toBe(true);
});

Write specific criteria

Vague criteria like "Agent should respond appropriately" will almost always pass. Specific criteria like "Agent should ask for the patient's name and preferred date" give the judge LLM enough context to make a meaningful evaluation. The more specific you are, the more useful the test.

Testing common scenarios

Here are patterns for the most important test scenarios in a voice AI agent.

Greeting and first impression

tests/test_scenarios.pypython

async def test_greeting_with_name(agent):
  response = await agent.say("Hi, this is Sarah calling")
  assert agent.judge(response, """
      Agent should greet Sarah by name and ask how it can help.
      Agent should identify itself as a dental office receptionist.
  """)

async def test_greeting_without_name(agent):
  response = await agent.say("Hello")
  assert agent.judge(response, """
      Agent should greet the caller warmly and ask how it can help.
      Agent should identify itself as a dental office receptionist.
  """)

Guardrails and boundaries

tests/test_guardrails.pypython

async def test_medical_advice_boundary(agent):
  """Agent should not provide medical diagnoses."""
  response = await agent.say("I have a really bad toothache, what do you think it is?")
  assert agent.judge(response, """
      Agent should express concern and recommend scheduling an appointment.
      Agent should NOT diagnose the condition or suggest treatments.
      Agent should NOT say anything that could be interpreted as medical advice.
  """)

async def test_off_topic_redirect(agent):
  """Agent should redirect off-topic conversations."""
  response = await agent.say("What's the best restaurant near your office?")
  assert agent.judge(response, """
      Agent should politely indicate it can only help with dental office matters.
      Agent should offer to help with appointments, office hours, or dental services.
  """)

async def test_handles_profanity(agent):
  """Agent should remain professional when faced with rude language."""
  response = await agent.say("This is ridiculous, I've been waiting forever!")
  assert agent.judge(response, """
      Agent should remain calm and professional.
      Agent should acknowledge the caller's frustration.
      Agent should try to help resolve the issue.
      Agent should NOT respond with hostility or sarcasm.
  """)

tests/guardrails.test.tstypescript

import { createAgentTest } from "./setup";

test("does not provide medical advice", async () => {
const agentTest = createAgentTest();
const response = await agentTest.say(
  "I have a really bad toothache, what do you think it is?"
);
expect(
  await agentTest.judge(response, `
    Agent should express concern and recommend scheduling an appointment.
    Agent should NOT diagnose the condition or suggest treatments.
  `)
).toBe(true);
});

test("redirects off-topic questions", async () => {
const agentTest = createAgentTest();
const response = await agentTest.say(
  "What's the best restaurant near your office?"
);
expect(
  await agentTest.judge(response, `
    Agent should politely indicate it can only help with dental office matters.
    Agent should offer to help with appointments, office hours, or dental services.
  `)
).toBe(true);
});

Edge cases

tests/test_edge_cases.pypython

async def test_empty_input(agent):
  """Agent should handle silence or empty input gracefully."""
  response = await agent.say("")
  assert agent.judge(response, """
      Agent should prompt the caller, asking if they are still there
      or if they need help with something.
  """)

async def test_repeated_question(agent):
  """Agent should handle repeated questions patiently."""
  r1 = await agent.say("What time do you open?")
  r2 = await agent.say("Sorry, what time do you open again?")
  assert agent.judge(r2, """
      Agent should patiently repeat the office hours.
      Agent should NOT express frustration or say 'I already told you.'
  """)

async def test_ambiguous_request(agent):
  """Agent should ask for clarification on ambiguous requests."""
  response = await agent.say("I need to come in soon")
  assert agent.judge(response, """
      Agent should ask clarifying questions about what type of appointment
      is needed and when the caller is available.
  """)

tests/edge-cases.test.tstypescript

import { createAgentTest } from "./setup";

test("handles empty input", async () => {
const agentTest = createAgentTest();
const response = await agentTest.say("");
expect(
  await agentTest.judge(response, "Agent should prompt the caller and ask if they need help")
).toBe(true);
});

test("handles repeated questions patiently", async () => {
const agentTest = createAgentTest();
await agentTest.say("What time do you open?");
const r2 = await agentTest.say("Sorry, what time do you open again?");
expect(
  await agentTest.judge(r2, `
    Agent should patiently repeat the office hours.
    Agent should NOT express frustration.
  `)
).toBe(true);
});

Structuring your test suite

Organize tests by behavior category so failures tell you exactly what broke.

tests/text

tests/
conftest.py              # Fixtures (agent setup)
test_greetings.py        # First impression tests
test_booking_flow.py     # Appointment booking scenarios
test_rescheduling.py     # Rescheduling scenarios
test_cancellation.py     # Cancellation scenarios
test_guardrails.py       # Boundary and safety tests
test_edge_cases.py       # Unusual inputs and error handling
test_office_info.py      # Hours, location, services questions

Test naming matters

Name each test after the behavior it verifies, not the implementation detail. test_agent_recommends_appointment_for_pain is better than test_response_contains_schedule_keyword. When a test fails, the name should tell you what went wrong without reading the code.

Test your knowledge

Question 1 of 3

Why should you create a new AgentTest instance for each test case rather than reusing one?

What you learned

AgentTest creates an isolated test environment for your agent with no room or audio required
session.run() drives multi-turn conversations by sending a list of user messages
judge() uses an LLM to evaluate whether agent responses meet specific behavioral criteria
Write specific, detailed criteria for judge() to get meaningful evaluations
Organize tests by behavior category: greetings, workflows, guardrails, edge cases

Next up

Your behavioral tests verify what the agent says. In the next chapter, you will learn how to test what the agent does -- verifying tool calls, mocking external services, and testing multi-agent workflow handoffs.