Evaluation framework
Evaluation framework
Behavioral tests answer "did the agent do the right thing?" Evaluation answers a different question: "how good is the agent overall?" In this chapter, you will build an evaluation framework with custom metrics, scoring rubrics, and benchmark tracking that measures your dental receptionist's quality across every dimension that matters.
What you'll learn
- How to define evaluation metrics for voice AI agents
- How to build scoring rubrics that produce consistent, meaningful scores
- How to run evaluation suites across large conversation sets
- How to track benchmarks over time and detect quality trends
Metrics that matter
Not every metric matters equally. For a dental receptionist, these five dimensions capture the full picture of agent quality:
- Response relevance -- Does the agent's reply address what the caller actually asked?
- Tool accuracy -- Does the agent call the right tools with correct arguments?
- Conversation flow -- Does the conversation progress naturally, or does the agent ask redundant questions or loop?
- Tone and professionalism -- Is the agent warm, professional, and appropriately empathetic?
- Task completion -- Does the agent successfully complete the caller's request end to end?
Think of these metrics like a restaurant review. Food quality (relevance), correct order (tool accuracy), pacing of courses (flow), service attitude (tone), and whether you left satisfied (task completion). Each dimension can be good or bad independently.
Building scoring rubrics
A rubric turns subjective quality into a repeatable score. For each metric, define what a 1, 3, and 5 look like on a 5-point scale.
RUBRICS = {
"response_relevance": {
"description": "Does the response address the caller's question or request?",
"scale": {
1: "Response is completely off-topic or ignores the caller's request",
2: "Response partially addresses the request but misses key details",
3: "Response addresses the request but could be more specific or helpful",
4: "Response clearly addresses the request with relevant details",
5: "Response perfectly addresses the request with all relevant context",
}
},
"tool_accuracy": {
"description": "Were the correct tools called with accurate arguments?",
"scale": {
1: "Wrong tools called or critical arguments missing/incorrect",
2: "Right tools called but with significant argument errors",
3: "Right tools called with mostly correct arguments",
4: "Right tools called with correct arguments, minor issues only",
5: "Perfect tool selection and argument extraction",
}
},
"conversation_flow": {
"description": "Does the conversation progress naturally toward the goal?",
"scale": {
1: "Agent loops, repeats questions, or conversation goes nowhere",
2: "Conversation progresses but with unnecessary steps or confusion",
3: "Conversation flows reasonably but could be more efficient",
4: "Conversation flows naturally with clear progression",
5: "Conversation flows perfectly, concise and efficient",
}
},
"tone": {
"description": "Is the agent professional, warm, and appropriately empathetic?",
"scale": {
1: "Agent is rude, cold, or inappropriate",
2: "Agent is functional but robotic or impersonal",
3: "Agent is polite but generic, lacks warmth",
4: "Agent is warm and professional, appropriate tone throughout",
5: "Agent is exceptionally warm, empathetic, and professional",
}
},
"task_completion": {
"description": "Did the agent complete the caller's request successfully?",
"scale": {
1: "Task not completed, caller's need unmet",
2: "Task partially completed, significant steps missing",
3: "Task completed but with unnecessary friction or missing details",
4: "Task completed successfully with minor improvement opportunities",
5: "Task completed perfectly, caller fully satisfied",
}
},
}export const RUBRICS = {
response_relevance: {
description: "Does the response address the caller's question or request?",
scale: {
1: "Response is completely off-topic or ignores the caller's request",
2: "Response partially addresses the request but misses key details",
3: "Response addresses the request but could be more specific or helpful",
4: "Response clearly addresses the request with relevant details",
5: "Response perfectly addresses the request with all relevant context",
},
},
tool_accuracy: {
description: "Were the correct tools called with accurate arguments?",
scale: {
1: "Wrong tools called or critical arguments missing/incorrect",
2: "Right tools called but with significant argument errors",
3: "Right tools called with mostly correct arguments",
4: "Right tools called with correct arguments, minor issues only",
5: "Perfect tool selection and argument extraction",
},
},
conversation_flow: {
description: "Does the conversation progress naturally toward the goal?",
scale: {
1: "Agent loops, repeats questions, or conversation goes nowhere",
2: "Conversation progresses but with unnecessary steps or confusion",
3: "Conversation flows reasonably but could be more efficient",
4: "Conversation flows naturally with clear progression",
5: "Conversation flows perfectly, concise and efficient",
},
},
tone: {
description: "Is the agent professional, warm, and appropriately empathetic?",
scale: {
1: "Agent is rude, cold, or inappropriate",
2: "Agent is functional but robotic or impersonal",
3: "Agent is polite but generic, lacks warmth",
4: "Agent is warm and professional, appropriate tone throughout",
5: "Agent is exceptionally warm, empathetic, and professional",
},
},
task_completion: {
description: "Did the agent complete the caller's request successfully?",
scale: {
1: "Task not completed, caller's need unmet",
2: "Task partially completed, significant steps missing",
3: "Task completed but with unnecessary friction or missing details",
4: "Task completed successfully with minor improvement opportunities",
5: "Task completed perfectly, caller fully satisfied",
},
},
} as const;Building the evaluation runner
The evaluation runner takes a set of test conversations, runs them through the agent, and scores each one against the rubrics.
from livekit.agents.testing import AgentTest
from evaluation.rubrics import RUBRICS
import json
# Define test scenarios as conversation scripts
TEST_SCENARIOS = [
{
"name": "simple_booking",
"messages": [
"Hi, I'd like to book a teeth cleaning",
"Next Monday at 9am",
"Yes, that works",
],
"expected_tools": ["check_availability", "book_appointment"],
},
{
"name": "reschedule",
"messages": [
"I need to reschedule my appointment",
"I'm Sarah Johnson, appointment is on Friday",
"Can we move it to next Wednesday at 2pm?",
"Yes, confirm that please",
],
"expected_tools": ["lookup_patient", "find_appointment", "reschedule_appointment"],
},
{
"name": "office_hours_inquiry",
"messages": [
"What are your office hours?",
],
"expected_tools": [],
},
]
async def evaluate_agent(agent_class, scenarios=TEST_SCENARIOS):
"""Run all scenarios and score them against rubrics."""
results = []
for scenario in scenarios:
test = AgentTest(agent_class())
session = test.session()
# Run the conversation
responses = await session.run(scenario["messages"])
# Score each rubric dimension
scores = {}
for metric_name, rubric in RUBRICS.items():
score = await test.evaluate(
conversation=session.history(),
rubric=rubric["description"],
scale=rubric["scale"],
)
scores[metric_name] = score
results.append({
"scenario": scenario["name"],
"scores": scores,
"average": sum(scores.values()) / len(scores),
})
return results
async def print_evaluation_report(results):
"""Print a formatted evaluation report."""
print("\n=== Agent Evaluation Report ===\n")
overall_scores = {metric: [] for metric in RUBRICS}
for result in results:
print(f"Scenario: {result['scenario']}")
for metric, score in result["scores"].items():
print(f" {metric}: {score}/5")
overall_scores[metric].append(score)
print(f" Average: {result['average']:.1f}/5")
print()
print("Overall Averages:")
for metric, scores in overall_scores.items():
avg = sum(scores) / len(scores)
print(f" {metric}: {avg:.1f}/5")import { AgentTest } from "@livekit/agents/testing";
import { RUBRICS } from "./rubrics";
interface Scenario {
name: string;
messages: string[];
expectedTools: string[];
}
const TEST_SCENARIOS: Scenario[] = [
{
name: "simple_booking",
messages: [
"Hi, I'd like to book a teeth cleaning",
"Next Monday at 9am",
"Yes, that works",
],
expectedTools: ["check_availability", "book_appointment"],
},
{
name: "reschedule",
messages: [
"I need to reschedule my appointment",
"I'm Sarah Johnson, appointment is on Friday",
"Can we move it to next Wednesday at 2pm?",
"Yes, confirm that please",
],
expectedTools: ["lookup_patient", "find_appointment", "reschedule_appointment"],
},
{
name: "office_hours_inquiry",
messages: ["What are your office hours?"],
expectedTools: [],
},
];
interface EvalResult {
scenario: string;
scores: Record<string, number>;
average: number;
}
export async function evaluateAgent(
AgentClass: new () => any,
scenarios = TEST_SCENARIOS
): Promise<EvalResult[]> {
const results: EvalResult[] = [];
for (const scenario of scenarios) {
const agentTest = new AgentTest(new AgentClass());
const session = agentTest.session();
await session.run(scenario.messages);
const scores: Record<string, number> = {};
for (const [metricName, rubric] of Object.entries(RUBRICS)) {
const score = await agentTest.evaluate({
conversation: session.history(),
rubric: rubric.description,
scale: rubric.scale,
});
scores[metricName] = score;
}
const avg =
Object.values(scores).reduce((a, b) => a + b, 0) / Object.values(scores).length;
results.push({ scenario: scenario.name, scores, average: avg });
}
return results;
}Running evaluations
Define your scenarios
Write conversation scripts that represent real caller interactions. Include happy paths, error cases, and edge cases. Aim for 20-50 scenarios for a thorough evaluation.
Run the evaluation suite
Execute the runner against your agent. Each scenario produces scores across all rubric dimensions.
Review the report
Look for patterns. If tone scores are consistently low, your system prompt needs work. If tool accuracy is low, your function descriptions may be unclear.
Save the results
Store results with a timestamp and agent version so you can track trends over time.
import asyncio
import json
from datetime import datetime
from receptionist.agent import DentalReceptionist
from evaluation.runner import evaluate_agent, print_evaluation_report
async def main():
results = await evaluate_agent(DentalReceptionist)
await print_evaluation_report(results)
# Save results for benchmarking
report = {
"timestamp": datetime.now().isoformat(),
"agent_version": "1.2.0",
"results": results,
}
with open(f"evaluation/reports/{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
json.dump(report, f, indent=2)
print(f"\nReport saved to evaluation/reports/")
asyncio.run(main())Run evaluations on every meaningful change
Prompt changes, model upgrades, and tool modifications all affect agent quality. Run the evaluation suite before and after each change to see the impact. Even a one-word change to the system prompt can shift scores.
Tracking benchmarks
Store evaluation results over time to spot trends and catch regressions.
import json
from pathlib import Path
def load_benchmark_history(reports_dir="evaluation/reports"):
"""Load all evaluation reports and compute trend data."""
reports = []
for filepath in sorted(Path(reports_dir).glob("*.json")):
with open(filepath) as f:
reports.append(json.load(f))
return reports
def compare_to_baseline(current_results, baseline_results, threshold=0.5):
"""Compare current results to a baseline and flag regressions."""
regressions = []
for current in current_results:
baseline = next(
(b for b in baseline_results if b["scenario"] == current["scenario"]),
None,
)
if not baseline:
continue
for metric in current["scores"]:
current_score = current["scores"][metric]
baseline_score = baseline["scores"][metric]
diff = current_score - baseline_score
if diff < -threshold:
regressions.append({
"scenario": current["scenario"],
"metric": metric,
"baseline": baseline_score,
"current": current_score,
"change": diff,
})
return regressionsimport { readdir, readFile } from "fs/promises";
import { join } from "path";
interface Report {
timestamp: string;
agent_version: string;
results: { scenario: string; scores: Record<string, number>; average: number }[];
}
export async function loadBenchmarkHistory(
reportsDir = "evaluation/reports"
): Promise<Report[]> {
const files = await readdir(reportsDir);
const jsonFiles = files.filter((f) => f.endsWith(".json")).sort();
const reports: Report[] = [];
for (const file of jsonFiles) {
const content = await readFile(join(reportsDir, file), "utf-8");
reports.push(JSON.parse(content));
}
return reports;
}
interface Regression {
scenario: string;
metric: string;
baseline: number;
current: number;
change: number;
}
export function compareToBaseline(
current: Report["results"],
baseline: Report["results"],
threshold = 0.5
): Regression[] {
const regressions: Regression[] = [];
for (const curr of current) {
const base = baseline.find((b) => b.scenario === curr.scenario);
if (!base) continue;
for (const metric of Object.keys(curr.scores)) {
const diff = curr.scores[metric] - base.scores[metric];
if (diff < -threshold) {
regressions.push({
scenario: curr.scenario,
metric,
baseline: base.scores[metric],
current: curr.scores[metric],
change: diff,
});
}
}
}
return regressions;
}Set realistic thresholds
LLM-based scoring has natural variance. A score drop of 0.2 points may just be noise. Start with a regression threshold of 0.5 points on a 5-point scale and adjust based on your observed variance. Running the same evaluation three times and averaging helps reduce noise.
Test your knowledge
Question 1 of 3
Why do scoring rubrics define what each score level (1 through 5) looks like rather than just providing a single description per metric?
What you learned
- Five key metrics for voice AI evaluation: response relevance, tool accuracy, conversation flow, tone, and task completion
- Scoring rubrics define what each score level means, making evaluations consistent and meaningful
- The evaluation runner scores conversations against rubrics and produces structured reports
- Benchmark tracking lets you compare agent versions over time and detect quality trends
Next up
Evaluation tells you how good your agent is right now. In the next chapter, you will build regression test suites that automatically detect when a new version of your agent gets worse.