Chapter 715m

Production evaluation

Production evaluation

Tests catch problems before deployment. But what about problems that only appear with real callers -- unexpected accents, noisy backgrounds, questions nobody anticipated? In this final chapter, you will set up live evaluation of production conversations, A/B testing for agent configurations, and quality gates that alert you when performance degrades.

Live monitoringA/B testingQuality gates

What you'll learn

  • How to evaluate production conversations in real time
  • How to A/B test different agent configurations
  • How to set up quality gates that flag degraded performance
  • How the complete testing and evaluation strategy fits together

Live evaluation of production conversations

Production evaluation samples real conversations and scores them against your rubrics. Unlike pre-deployment tests that use simulated inputs, this evaluates the agent's performance with real callers.

monitoring/live_eval.pypython
import asyncio
import json
import random
from datetime import datetime
from livekit.agents.testing import ConversationEvaluator
from evaluation.rubrics import RUBRICS

class LiveEvaluator:
  def __init__(self, sample_rate=0.1):
      """Evaluate a sample of production conversations.

      Args:
          sample_rate: Fraction of conversations to evaluate (0.1 = 10%)
      """
      self.sample_rate = sample_rate
      self.evaluator = ConversationEvaluator(rubrics=RUBRICS)

  async def on_conversation_end(self, conversation_id, history):
      """Called when a production conversation ends."""
      if random.random() > self.sample_rate:
          return  # Skip this conversation

      # Score the conversation
      scores = await self.evaluator.evaluate(history)

      # Store the results
      result = {
          "conversation_id": conversation_id,
          "timestamp": datetime.now().isoformat(),
          "scores": scores,
          "average": sum(scores.values()) / len(scores),
          "turn_count": len([m for m in history if m["role"] == "user"]),
      }

      await self.store_result(result)
      await self.check_quality_gates(result)

  async def store_result(self, result):
      """Store evaluation result for trend analysis."""
      with open("monitoring/results.jsonl", "a") as f:
          f.write(json.dumps(result) + "\n")

  async def check_quality_gates(self, result):
      """Check if the conversation meets quality thresholds."""
      alerts = []

      if result["average"] < 3.0:
          alerts.append(f"Low overall quality: {result['average']:.1f}/5")

      for metric, score in result["scores"].items():
          if score < 2.0:
              alerts.append(f"Critical: {metric} scored {score}/5")

      if alerts:
          await self.send_alert(result["conversation_id"], alerts)

  async def send_alert(self, conversation_id, alerts):
      """Send alert for quality issues."""
      print(f"QUALITY ALERT for conversation {conversation_id}:")
      for alert in alerts:
          print(f"  - {alert}")
monitoring/live-eval.tstypescript
import { ConversationEvaluator } from "@livekit/agents/testing";
import { RUBRICS } from "../evaluation/rubrics";
import { appendFile } from "fs/promises";

interface EvalResult {
conversationId: string;
timestamp: string;
scores: Record<string, number>;
average: number;
turnCount: number;
}

export class LiveEvaluator {
private evaluator: ConversationEvaluator;
private sampleRate: number;

constructor(sampleRate = 0.1) {
  this.sampleRate = sampleRate;
  this.evaluator = new ConversationEvaluator({ rubrics: RUBRICS });
}

async onConversationEnd(conversationId: string, history: any[]) {
  if (Math.random() > this.sampleRate) return;

  const scores = await this.evaluator.evaluate(history);
  const avg =
    Object.values(scores).reduce((a, b) => a + b, 0) /
    Object.values(scores).length;

  const result: EvalResult = {
    conversationId,
    timestamp: new Date().toISOString(),
    scores,
    average: avg,
    turnCount: history.filter((m) => m.role === "user").length,
  };

  await this.storeResult(result);
  await this.checkQualityGates(result);
}

private async storeResult(result: EvalResult) {
  await appendFile("monitoring/results.jsonl", JSON.stringify(result) + "\n");
}

private async checkQualityGates(result: EvalResult) {
  const alerts: string[] = [];

  if (result.average < 3.0) {
    alerts.push(`Low overall quality: ${result.average.toFixed(1)}/5`);
  }

  for (const [metric, score] of Object.entries(result.scores)) {
    if (score < 2.0) {
      alerts.push(`Critical: ${metric} scored ${score}/5`);
    }
  }

  if (alerts.length > 0) {
    console.log(`QUALITY ALERT for conversation ${result.conversationId}:`);
    alerts.forEach((a) => console.log(`  - ${a}`));
  }
}
}

Sample rate matters

Evaluating every conversation is expensive. Start with a 10% sample rate and adjust based on your traffic volume and budget. For critical launch periods, temporarily increase to 50% or 100% until you are confident in agent quality.

Wiring the evaluator into your agent

Connect the live evaluator to your agent's conversation lifecycle.

receptionist/agent.pypython
from monitoring.live_eval import LiveEvaluator

evaluator = LiveEvaluator(sample_rate=0.1)

class DentalReceptionist:
  async def on_session_end(self, session):
      """Called when a caller hangs up or the session ends."""
      await evaluator.on_conversation_end(
          conversation_id=session.id,
          history=session.conversation_history(),
      )
src/agent.tstypescript
import { LiveEvaluator } from "../monitoring/live-eval";

const evaluator = new LiveEvaluator(0.1);

class DentalReceptionist {
async onSessionEnd(session: any) {
  await evaluator.onConversationEnd(
    session.id,
    session.conversationHistory()
  );
}
}

A/B testing agent configurations

A/B testing lets you compare two agent configurations side by side with real traffic. This is how you validate that a prompt change or model upgrade actually improves quality.

monitoring/ab_test.pypython
from receptionist.agent import DentalReceptionist

class ABTestRouter:
  def __init__(self, variant_a_config, variant_b_config, traffic_split=0.5):
      """Route callers to one of two agent configurations.

      Args:
          variant_a_config: Config for the control group
          variant_b_config: Config for the experiment group
          traffic_split: Fraction of traffic sent to variant B
      """
      self.variant_a_config = variant_a_config
      self.variant_b_config = variant_b_config
      self.traffic_split = traffic_split

  def get_agent(self, session_id):
      """Return an agent instance for this session."""
      # Deterministic assignment based on session ID ensures
      # the same caller gets the same variant if they call back
      is_variant_b = hash(session_id) % 100 < (self.traffic_split * 100)

      if is_variant_b:
          variant = "B"
          config = self.variant_b_config
      else:
          variant = "A"
          config = self.variant_a_config

      agent = DentalReceptionist(config=config)
      agent.ab_variant = variant  # Tag for evaluation
      return agent

# Example usage
router = ABTestRouter(
  variant_a_config={
      "model": "gpt-4o",
      "system_prompt": "You are a friendly dental receptionist...",
  },
  variant_b_config={
      "model": "gpt-4o",
      "system_prompt": "You are a warm and professional dental office assistant...",
  },
  traffic_split=0.2,  # Send 20% of traffic to variant B
)
monitoring/ab-test.tstypescript
import { DentalReceptionist } from "../src/agent";

interface AgentConfig {
model: string;
systemPrompt: string;
}

export class ABTestRouter {
constructor(
  private variantAConfig: AgentConfig,
  private variantBConfig: AgentConfig,
  private trafficSplit = 0.5
) {}

getAgent(sessionId: string) {
  const hash = this.hashCode(sessionId);
  const isVariantB = hash % 100 < this.trafficSplit * 100;

  const config = isVariantB ? this.variantBConfig : this.variantAConfig;
  const variant = isVariantB ? "B" : "A";

  const agent = new DentalReceptionist(config);
  (agent as any).abVariant = variant;
  return agent;
}

private hashCode(str: string): number {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    hash = (hash << 5) - hash + str.charCodeAt(i);
    hash |= 0;
  }
  return Math.abs(hash);
}
}

// Example usage
const router = new ABTestRouter(
{ model: "gpt-4o", systemPrompt: "You are a friendly dental receptionist..." },
{ model: "gpt-4o", systemPrompt: "You are a warm and professional dental office assistant..." },
0.2
);
What's happening

A/B testing voice AI is like A/B testing a website, but the metric is conversation quality instead of click-through rate. You split traffic between two configurations, evaluate both with the same rubrics, and compare scores to see which performs better.

Analyzing A/B test results

monitoring/ab_analysis.pypython
import json
from collections import defaultdict

def analyze_ab_results(results_path="monitoring/results.jsonl"):
  """Compare quality scores between A/B variants."""
  variant_scores = defaultdict(lambda: defaultdict(list))

  with open(results_path) as f:
      for line in f:
          result = json.loads(line)
          variant = result.get("ab_variant", "A")
          for metric, score in result["scores"].items():
              variant_scores[variant][metric].append(score)

  print("A/B Test Results")
  print("=" * 50)

  for variant in sorted(variant_scores.keys()):
      scores_list = list(variant_scores[variant].values())[0]
      print(f"\nVariant {variant} ({len(scores_list)} conversations):")
      for metric, scores in variant_scores[variant].items():
          avg = sum(scores) / len(scores)
          print(f"  {metric}: {avg:.2f}/5")

Start with small traffic splits

When testing a new configuration, start by sending only 10-20% of traffic to the new variant. If scores are comparable or better after 50-100 conversations, gradually increase the split. If scores drop, roll back immediately.

Quality gates

Quality gates are automated thresholds that trigger alerts or actions when agent quality drops below acceptable levels.

monitoring/quality_gates.pypython
class QualityGateMonitor:
  def __init__(self, window_minutes=60, min_conversations=10):
      self.window_minutes = window_minutes
      self.min_conversations = min_conversations

      # Define quality thresholds
      self.gates = {
          "overall_average": {"min": 3.5, "severity": "critical"},
          "task_completion": {"min": 3.0, "severity": "critical"},
          "tone": {"min": 3.0, "severity": "warning"},
          "tool_accuracy": {"min": 3.5, "severity": "critical"},
          "response_relevance": {"min": 3.5, "severity": "warning"},
      }

  def check_gates(self, recent_results):
      """Check quality gates against recent evaluation results."""
      if len(recent_results) < self.min_conversations:
          return []  # Not enough data

      violations = []

      # Check overall average
      overall_avg = sum(r["average"] for r in recent_results) / len(recent_results)
      gate = self.gates["overall_average"]
      if overall_avg < gate["min"]:
          violations.append({
              "gate": "overall_average",
              "threshold": gate["min"],
              "actual": overall_avg,
              "severity": gate["severity"],
          })

      # Check individual metrics
      for metric in ["task_completion", "tone", "tool_accuracy", "response_relevance"]:
          scores = [r["scores"][metric] for r in recent_results if metric in r["scores"]]
          if not scores:
              continue

          avg = sum(scores) / len(scores)
          gate = self.gates.get(metric)
          if gate and avg < gate["min"]:
              violations.append({
                  "gate": metric,
                  "threshold": gate["min"],
                  "actual": avg,
                  "severity": gate["severity"],
              })

      return violations
monitoring/quality-gates.tstypescript
interface QualityGate {
min: number;
severity: "warning" | "critical";
}

interface Violation {
gate: string;
threshold: number;
actual: number;
severity: string;
}

export class QualityGateMonitor {
private gates: Record<string, QualityGate> = {
  overall_average: { min: 3.5, severity: "critical" },
  task_completion: { min: 3.0, severity: "critical" },
  tone: { min: 3.0, severity: "warning" },
  tool_accuracy: { min: 3.5, severity: "critical" },
  response_relevance: { min: 3.5, severity: "warning" },
};

constructor(
  private windowMinutes = 60,
  private minConversations = 10
) {}

checkGates(recentResults: any[]): Violation[] {
  if (recentResults.length < this.minConversations) return [];

  const violations: Violation[] = [];

  const overallAvg =
    recentResults.reduce((sum, r) => sum + r.average, 0) / recentResults.length;

  if (overallAvg < this.gates.overall_average.min) {
    violations.push({
      gate: "overall_average",
      threshold: this.gates.overall_average.min,
      actual: overallAvg,
      severity: this.gates.overall_average.severity,
    });
  }

  for (const metric of ["task_completion", "tone", "tool_accuracy", "response_relevance"]) {
    const scores = recentResults
      .filter((r) => r.scores[metric] !== undefined)
      .map((r) => r.scores[metric]);

    if (scores.length === 0) continue;

    const avg = scores.reduce((a, b) => a + b, 0) / scores.length;
    const gate = this.gates[metric];

    if (gate && avg < gate.min) {
      violations.push({
        gate: metric,
        threshold: gate.min,
        actual: avg,
        severity: gate.severity,
      });
    }
  }

  return violations;
}
}

Critical gates should trigger immediate action

When a critical quality gate is violated, your team should be notified immediately through Slack, PagerDuty, or your alerting system of choice. A sustained drop in task completion means callers are not getting their appointments booked -- that is a business problem, not just a quality problem.

The complete testing and evaluation strategy

Here is how all the pieces from this course fit together:

1

Development: Write tests as you build

Unit tests for tool functions. Behavioral tests for conversation scenarios. Golden tests for critical guardrails.

2

Pre-commit: Run fast tests locally

Unit tests and golden tests run in seconds. Catch obvious breaks before pushing.

3

CI pipeline: Automated test suite

GitHub Actions runs the full test suite on every push. Behavioral tests, golden tests, and regression checks gate pull requests.

4

Pre-deployment: Regression evaluation

Compare the new version against the baseline. Flag any quality drops above the threshold. Update the baseline only for intentional changes.

5

Production: Live monitoring

Sample and evaluate real conversations. A/B test new configurations with small traffic splits. Quality gates alert on degradation.

Test your knowledge

Question 1 of 3

Why does the LiveEvaluator use a sample rate (e.g., 10%) rather than evaluating every production conversation?

What you learned

  • Live evaluation samples production conversations and scores them against rubrics to catch real-world issues
  • A/B testing lets you compare agent configurations with real traffic using deterministic session routing
  • Quality gates define thresholds that trigger alerts when agent performance drops
  • The complete strategy spans development through production: unit tests, behavioral tests, golden tests, regression checks, CI/CD, live evaluation, A/B testing, and quality gates

Course summary

In this course, you built a comprehensive testing and evaluation system for voice AI agents:

  1. Testing strategy -- The voice AI test pyramid with unit, behavioral, and evaluation layers
  2. Behavioral tests -- AgentTest, session.run(), and judge() for conversation testing
  3. Tool testing -- mock_tools(), tool assertions, and workflow validation
  4. Evaluation framework -- Metrics, rubrics, scoring, and benchmark tracking
  5. Regression testing -- Baselines, golden tests, and automated regression detection
  6. CI/CD integration -- GitHub Actions pipelines with gated deployments
  7. Production evaluation -- Live monitoring, A/B testing, and quality gates

You now have every layer of confidence needed to ship voice AI agents that work reliably, improve over time, and never silently regress.

Concepts covered
Live monitoringA/B testingQuality gates