Production evaluation
Production evaluation
Tests catch problems before deployment. But what about problems that only appear with real callers -- unexpected accents, noisy backgrounds, questions nobody anticipated? In this final chapter, you will set up live evaluation of production conversations, A/B testing for agent configurations, and quality gates that alert you when performance degrades.
What you'll learn
- How to evaluate production conversations in real time
- How to A/B test different agent configurations
- How to set up quality gates that flag degraded performance
- How the complete testing and evaluation strategy fits together
Live evaluation of production conversations
Production evaluation samples real conversations and scores them against your rubrics. Unlike pre-deployment tests that use simulated inputs, this evaluates the agent's performance with real callers.
import asyncio
import json
import random
from datetime import datetime
from livekit.agents.testing import ConversationEvaluator
from evaluation.rubrics import RUBRICS
class LiveEvaluator:
def __init__(self, sample_rate=0.1):
"""Evaluate a sample of production conversations.
Args:
sample_rate: Fraction of conversations to evaluate (0.1 = 10%)
"""
self.sample_rate = sample_rate
self.evaluator = ConversationEvaluator(rubrics=RUBRICS)
async def on_conversation_end(self, conversation_id, history):
"""Called when a production conversation ends."""
if random.random() > self.sample_rate:
return # Skip this conversation
# Score the conversation
scores = await self.evaluator.evaluate(history)
# Store the results
result = {
"conversation_id": conversation_id,
"timestamp": datetime.now().isoformat(),
"scores": scores,
"average": sum(scores.values()) / len(scores),
"turn_count": len([m for m in history if m["role"] == "user"]),
}
await self.store_result(result)
await self.check_quality_gates(result)
async def store_result(self, result):
"""Store evaluation result for trend analysis."""
with open("monitoring/results.jsonl", "a") as f:
f.write(json.dumps(result) + "\n")
async def check_quality_gates(self, result):
"""Check if the conversation meets quality thresholds."""
alerts = []
if result["average"] < 3.0:
alerts.append(f"Low overall quality: {result['average']:.1f}/5")
for metric, score in result["scores"].items():
if score < 2.0:
alerts.append(f"Critical: {metric} scored {score}/5")
if alerts:
await self.send_alert(result["conversation_id"], alerts)
async def send_alert(self, conversation_id, alerts):
"""Send alert for quality issues."""
print(f"QUALITY ALERT for conversation {conversation_id}:")
for alert in alerts:
print(f" - {alert}")import { ConversationEvaluator } from "@livekit/agents/testing";
import { RUBRICS } from "../evaluation/rubrics";
import { appendFile } from "fs/promises";
interface EvalResult {
conversationId: string;
timestamp: string;
scores: Record<string, number>;
average: number;
turnCount: number;
}
export class LiveEvaluator {
private evaluator: ConversationEvaluator;
private sampleRate: number;
constructor(sampleRate = 0.1) {
this.sampleRate = sampleRate;
this.evaluator = new ConversationEvaluator({ rubrics: RUBRICS });
}
async onConversationEnd(conversationId: string, history: any[]) {
if (Math.random() > this.sampleRate) return;
const scores = await this.evaluator.evaluate(history);
const avg =
Object.values(scores).reduce((a, b) => a + b, 0) /
Object.values(scores).length;
const result: EvalResult = {
conversationId,
timestamp: new Date().toISOString(),
scores,
average: avg,
turnCount: history.filter((m) => m.role === "user").length,
};
await this.storeResult(result);
await this.checkQualityGates(result);
}
private async storeResult(result: EvalResult) {
await appendFile("monitoring/results.jsonl", JSON.stringify(result) + "\n");
}
private async checkQualityGates(result: EvalResult) {
const alerts: string[] = [];
if (result.average < 3.0) {
alerts.push(`Low overall quality: ${result.average.toFixed(1)}/5`);
}
for (const [metric, score] of Object.entries(result.scores)) {
if (score < 2.0) {
alerts.push(`Critical: ${metric} scored ${score}/5`);
}
}
if (alerts.length > 0) {
console.log(`QUALITY ALERT for conversation ${result.conversationId}:`);
alerts.forEach((a) => console.log(` - ${a}`));
}
}
}Sample rate matters
Evaluating every conversation is expensive. Start with a 10% sample rate and adjust based on your traffic volume and budget. For critical launch periods, temporarily increase to 50% or 100% until you are confident in agent quality.
Wiring the evaluator into your agent
Connect the live evaluator to your agent's conversation lifecycle.
from monitoring.live_eval import LiveEvaluator
evaluator = LiveEvaluator(sample_rate=0.1)
class DentalReceptionist:
async def on_session_end(self, session):
"""Called when a caller hangs up or the session ends."""
await evaluator.on_conversation_end(
conversation_id=session.id,
history=session.conversation_history(),
)import { LiveEvaluator } from "../monitoring/live-eval";
const evaluator = new LiveEvaluator(0.1);
class DentalReceptionist {
async onSessionEnd(session: any) {
await evaluator.onConversationEnd(
session.id,
session.conversationHistory()
);
}
}A/B testing agent configurations
A/B testing lets you compare two agent configurations side by side with real traffic. This is how you validate that a prompt change or model upgrade actually improves quality.
from receptionist.agent import DentalReceptionist
class ABTestRouter:
def __init__(self, variant_a_config, variant_b_config, traffic_split=0.5):
"""Route callers to one of two agent configurations.
Args:
variant_a_config: Config for the control group
variant_b_config: Config for the experiment group
traffic_split: Fraction of traffic sent to variant B
"""
self.variant_a_config = variant_a_config
self.variant_b_config = variant_b_config
self.traffic_split = traffic_split
def get_agent(self, session_id):
"""Return an agent instance for this session."""
# Deterministic assignment based on session ID ensures
# the same caller gets the same variant if they call back
is_variant_b = hash(session_id) % 100 < (self.traffic_split * 100)
if is_variant_b:
variant = "B"
config = self.variant_b_config
else:
variant = "A"
config = self.variant_a_config
agent = DentalReceptionist(config=config)
agent.ab_variant = variant # Tag for evaluation
return agent
# Example usage
router = ABTestRouter(
variant_a_config={
"model": "gpt-4o",
"system_prompt": "You are a friendly dental receptionist...",
},
variant_b_config={
"model": "gpt-4o",
"system_prompt": "You are a warm and professional dental office assistant...",
},
traffic_split=0.2, # Send 20% of traffic to variant B
)import { DentalReceptionist } from "../src/agent";
interface AgentConfig {
model: string;
systemPrompt: string;
}
export class ABTestRouter {
constructor(
private variantAConfig: AgentConfig,
private variantBConfig: AgentConfig,
private trafficSplit = 0.5
) {}
getAgent(sessionId: string) {
const hash = this.hashCode(sessionId);
const isVariantB = hash % 100 < this.trafficSplit * 100;
const config = isVariantB ? this.variantBConfig : this.variantAConfig;
const variant = isVariantB ? "B" : "A";
const agent = new DentalReceptionist(config);
(agent as any).abVariant = variant;
return agent;
}
private hashCode(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = (hash << 5) - hash + str.charCodeAt(i);
hash |= 0;
}
return Math.abs(hash);
}
}
// Example usage
const router = new ABTestRouter(
{ model: "gpt-4o", systemPrompt: "You are a friendly dental receptionist..." },
{ model: "gpt-4o", systemPrompt: "You are a warm and professional dental office assistant..." },
0.2
);A/B testing voice AI is like A/B testing a website, but the metric is conversation quality instead of click-through rate. You split traffic between two configurations, evaluate both with the same rubrics, and compare scores to see which performs better.
Analyzing A/B test results
import json
from collections import defaultdict
def analyze_ab_results(results_path="monitoring/results.jsonl"):
"""Compare quality scores between A/B variants."""
variant_scores = defaultdict(lambda: defaultdict(list))
with open(results_path) as f:
for line in f:
result = json.loads(line)
variant = result.get("ab_variant", "A")
for metric, score in result["scores"].items():
variant_scores[variant][metric].append(score)
print("A/B Test Results")
print("=" * 50)
for variant in sorted(variant_scores.keys()):
scores_list = list(variant_scores[variant].values())[0]
print(f"\nVariant {variant} ({len(scores_list)} conversations):")
for metric, scores in variant_scores[variant].items():
avg = sum(scores) / len(scores)
print(f" {metric}: {avg:.2f}/5")Start with small traffic splits
When testing a new configuration, start by sending only 10-20% of traffic to the new variant. If scores are comparable or better after 50-100 conversations, gradually increase the split. If scores drop, roll back immediately.
Quality gates
Quality gates are automated thresholds that trigger alerts or actions when agent quality drops below acceptable levels.
class QualityGateMonitor:
def __init__(self, window_minutes=60, min_conversations=10):
self.window_minutes = window_minutes
self.min_conversations = min_conversations
# Define quality thresholds
self.gates = {
"overall_average": {"min": 3.5, "severity": "critical"},
"task_completion": {"min": 3.0, "severity": "critical"},
"tone": {"min": 3.0, "severity": "warning"},
"tool_accuracy": {"min": 3.5, "severity": "critical"},
"response_relevance": {"min": 3.5, "severity": "warning"},
}
def check_gates(self, recent_results):
"""Check quality gates against recent evaluation results."""
if len(recent_results) < self.min_conversations:
return [] # Not enough data
violations = []
# Check overall average
overall_avg = sum(r["average"] for r in recent_results) / len(recent_results)
gate = self.gates["overall_average"]
if overall_avg < gate["min"]:
violations.append({
"gate": "overall_average",
"threshold": gate["min"],
"actual": overall_avg,
"severity": gate["severity"],
})
# Check individual metrics
for metric in ["task_completion", "tone", "tool_accuracy", "response_relevance"]:
scores = [r["scores"][metric] for r in recent_results if metric in r["scores"]]
if not scores:
continue
avg = sum(scores) / len(scores)
gate = self.gates.get(metric)
if gate and avg < gate["min"]:
violations.append({
"gate": metric,
"threshold": gate["min"],
"actual": avg,
"severity": gate["severity"],
})
return violationsinterface QualityGate {
min: number;
severity: "warning" | "critical";
}
interface Violation {
gate: string;
threshold: number;
actual: number;
severity: string;
}
export class QualityGateMonitor {
private gates: Record<string, QualityGate> = {
overall_average: { min: 3.5, severity: "critical" },
task_completion: { min: 3.0, severity: "critical" },
tone: { min: 3.0, severity: "warning" },
tool_accuracy: { min: 3.5, severity: "critical" },
response_relevance: { min: 3.5, severity: "warning" },
};
constructor(
private windowMinutes = 60,
private minConversations = 10
) {}
checkGates(recentResults: any[]): Violation[] {
if (recentResults.length < this.minConversations) return [];
const violations: Violation[] = [];
const overallAvg =
recentResults.reduce((sum, r) => sum + r.average, 0) / recentResults.length;
if (overallAvg < this.gates.overall_average.min) {
violations.push({
gate: "overall_average",
threshold: this.gates.overall_average.min,
actual: overallAvg,
severity: this.gates.overall_average.severity,
});
}
for (const metric of ["task_completion", "tone", "tool_accuracy", "response_relevance"]) {
const scores = recentResults
.filter((r) => r.scores[metric] !== undefined)
.map((r) => r.scores[metric]);
if (scores.length === 0) continue;
const avg = scores.reduce((a, b) => a + b, 0) / scores.length;
const gate = this.gates[metric];
if (gate && avg < gate.min) {
violations.push({
gate: metric,
threshold: gate.min,
actual: avg,
severity: gate.severity,
});
}
}
return violations;
}
}Critical gates should trigger immediate action
When a critical quality gate is violated, your team should be notified immediately through Slack, PagerDuty, or your alerting system of choice. A sustained drop in task completion means callers are not getting their appointments booked -- that is a business problem, not just a quality problem.
The complete testing and evaluation strategy
Here is how all the pieces from this course fit together:
Development: Write tests as you build
Unit tests for tool functions. Behavioral tests for conversation scenarios. Golden tests for critical guardrails.
Pre-commit: Run fast tests locally
Unit tests and golden tests run in seconds. Catch obvious breaks before pushing.
CI pipeline: Automated test suite
GitHub Actions runs the full test suite on every push. Behavioral tests, golden tests, and regression checks gate pull requests.
Pre-deployment: Regression evaluation
Compare the new version against the baseline. Flag any quality drops above the threshold. Update the baseline only for intentional changes.
Production: Live monitoring
Sample and evaluate real conversations. A/B test new configurations with small traffic splits. Quality gates alert on degradation.
Test your knowledge
Question 1 of 3
Why does the LiveEvaluator use a sample rate (e.g., 10%) rather than evaluating every production conversation?
What you learned
- Live evaluation samples production conversations and scores them against rubrics to catch real-world issues
- A/B testing lets you compare agent configurations with real traffic using deterministic session routing
- Quality gates define thresholds that trigger alerts when agent performance drops
- The complete strategy spans development through production: unit tests, behavioral tests, golden tests, regression checks, CI/CD, live evaluation, A/B testing, and quality gates
Course summary
In this course, you built a comprehensive testing and evaluation system for voice AI agents:
- Testing strategy -- The voice AI test pyramid with unit, behavioral, and evaluation layers
- Behavioral tests --
AgentTest,session.run(), andjudge()for conversation testing - Tool testing --
mock_tools(), tool assertions, and workflow validation - Evaluation framework -- Metrics, rubrics, scoring, and benchmark tracking
- Regression testing -- Baselines, golden tests, and automated regression detection
- CI/CD integration -- GitHub Actions pipelines with gated deployments
- Production evaluation -- Live monitoring, A/B testing, and quality gates
You now have every layer of confidence needed to ship voice AI agents that work reliably, improve over time, and never silently regress.