Alerting & incident response
Alerting for voice AI systems
Your dashboards show what happened. Alerts tell you what is happening right now and demand action. A voice agent that silently fails for twenty minutes before anyone notices is twenty minutes of callers hearing dead air. In this chapter, you will set up alerts for agent failures, high latency, and error rates, integrate them with PagerDuty and Slack, and build incident response runbooks so your team knows exactly what to do when the pager fires.
What you'll learn
- How to define alert rules for agent health, latency, and error rates
- How to integrate alerts with PagerDuty and Slack
- How to write incident response runbooks that reduce mean time to resolution
- How to avoid alert fatigue with proper severity tiers
Defining alert rules
Voice agents have failure modes that traditional web services do not. A 500ms spike in API latency is annoying for a website but devastating for a real-time conversation. Your alert thresholds must reflect conversational tolerances.
groups:
- name: voice-agent-health
rules:
- alert: AgentHighErrorRate
expr: |
rate(agent_requests_total{status="error"}[5m])
/ rate(agent_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Agent error rate above 5%"
runbook_url: "https://wiki.example.com/runbooks/agent-error-rate"
- alert: AgentHighLatency
expr: |
histogram_quantile(0.95,
rate(agent_response_seconds_bucket[5m])
) > 1.5
for: 3m
labels:
severity: warning
annotations:
summary: "Agent p95 latency above 1.5s"
runbook_url: "https://wiki.example.com/runbooks/agent-latency"
- alert: STTFailureRate
expr: |
rate(stt_transcription_errors_total[5m])
/ rate(stt_transcription_requests_total[5m]) > 0.02
for: 2m
labels:
severity: critical
annotations:
summary: "STT failure rate above 2%"
- alert: NoActiveAgents
expr: agent_workers_active == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No active agent workers detected"Each rule has a for duration that prevents transient spikes from firing alerts. The severity label controls routing — critical pages on-call engineers while warning sends to Slack. The runbook_url annotation links directly to the response procedure.
Integrating PagerDuty and Slack
Alert rules fire, but someone needs to receive them. Configure Alertmanager to route critical alerts to PagerDuty and warnings to a Slack channel.
global:
resolve_timeout: 5m
route:
receiver: slack-warnings
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 15m
- match:
severity: warning
receiver: slack-warnings
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key_file: /etc/alertmanager/pagerduty-key
description: '{{ .CommonAnnotations.summary }}'
details:
runbook: '{{ .CommonAnnotations.runbook_url }}'
firing: '{{ .Alerts.Firing | len }}'
- name: slack-warnings
slack_configs:
- api_url_file: /etc/alertmanager/slack-webhook
channel: '#voice-agent-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.alertname }}* - {{ .Annotations.summary }}
{{ end }}Avoid alert fatigue
Route only genuinely urgent alerts to PagerDuty. If your on-call engineer gets paged for a warning-level latency bump at 3 AM, they will start ignoring pages. Reserve critical severity for situations where callers are actively affected.
Emitting custom metrics from your agent
Your Prometheus rules need metrics to query. Here is how to emit custom counters and histograms from a LiveKit agent.
from prometheus_client import Counter, Histogram, start_http_server
from livekit.agents import Agent, AgentSession, function_tool, RunContext
REQUEST_COUNT = Counter(
"agent_requests_total",
"Total agent requests",
["status"],
)
RESPONSE_LATENCY = Histogram(
"agent_response_seconds",
"Time from user speech end to first agent audio",
buckets=[0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 5.0],
)
# Start metrics server on port 9090
start_http_server(9090)
class MonitoredAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant.",
)
async def on_user_turn_completed(self, turn_ctx):
import time
start = time.monotonic()
try:
await Agent.default.on_user_turn_completed(self, turn_ctx)
REQUEST_COUNT.labels(status="success").inc()
except Exception:
REQUEST_COUNT.labels(status="error").inc()
raise
finally:
RESPONSE_LATENCY.observe(time.monotonic() - start)import { Agent, AgentSession } from "@livekit/agents";
import { Counter, Histogram, Registry, collectDefaultMetrics } from "prom-client";
const register = new Registry();
collectDefaultMetrics({ register });
const requestCount = new Counter({
name: "agent_requests_total",
help: "Total agent requests",
labelNames: ["status"] as const,
registers: [register],
});
const responseLatency = new Histogram({
name: "agent_response_seconds",
help: "Time from user speech end to first agent audio",
buckets: [0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 5.0],
registers: [register],
});
class MonitoredAgent extends Agent {
constructor() {
super({ instructions: "You are a helpful voice assistant." });
}
override async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
const start = performance.now();
try {
await super.onUserTurnCompleted(turnCtx);
requestCount.labels({ status: "success" }).inc();
} catch (err) {
requestCount.labels({ status: "error" }).inc();
throw err;
} finally {
responseLatency.observe((performance.now() - start) / 1000);
}
}
}Writing incident response runbooks
An alert without a runbook is a question without an answer. Every alert rule should link to a runbook that tells the responder exactly what to check and what to do.
Start with the symptom
Describe what the alert means in plain language. "Agent error rate above 5% means more than 1 in 20 caller interactions are failing."
List diagnostic commands
Provide exact commands to run. Check agent logs, query Prometheus for the current error rate, verify upstream dependencies (STT, LLM, TTS provider status pages).
Define escalation paths
If the responder cannot resolve the issue within 10 minutes, who do they escalate to? Include names, roles, and contact methods.
Document common fixes
List the top three causes you have seen before and their resolutions. "If the LLM provider returns 429, check rate limits and switch to the fallback model."
Runbooks are living documents
After every incident, update the runbook with what you learned. A stale runbook is almost worse than no runbook because it gives false confidence. Add a "Last reviewed" date at the top of every runbook and review them monthly.
Good alerting is a three-legged stool: the right thresholds so you catch real problems, the right routing so the right person sees them, and the right runbooks so that person can act. Missing any leg means incidents take longer than they should.
Test your knowledge
Question 1 of 3
Why do alert rules include a 'for' duration (e.g., 'for: 2m') rather than firing immediately when a threshold is crossed?