Alerting for voice AI systems

Your dashboards show what happened. Alerts tell you what is happening right now and demand action. A voice agent that silently fails for twenty minutes before anyone notices is twenty minutes of callers hearing dead air. In this chapter, you will set up alerts for agent failures, high latency, and error rates, integrate them with PagerDuty and Slack, and build incident response runbooks so your team knows exactly what to do when the pager fires.

AlertsPagerDutyRunbooks

What you'll learn

How to define alert rules for agent health, latency, and error rates
How to integrate alerts with PagerDuty and Slack
How to write incident response runbooks that reduce mean time to resolution
How to avoid alert fatigue with proper severity tiers

Defining alert rules

Voice agents have failure modes that traditional web services do not. A 500ms spike in API latency is annoying for a website but devastating for a real-time conversation. Your alert thresholds must reflect conversational tolerances.

alerts/voice-agent-rules.ymlyaml

groups:
- name: voice-agent-health
  rules:
    - alert: AgentHighErrorRate
      expr: |
        rate(agent_requests_total{status="error"}[5m])
        / rate(agent_requests_total[5m]) > 0.05
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Agent error rate above 5%"
        runbook_url: "https://wiki.example.com/runbooks/agent-error-rate"

    - alert: AgentHighLatency
      expr: |
        histogram_quantile(0.95,
          rate(agent_response_seconds_bucket[5m])
        ) > 1.5
      for: 3m
      labels:
        severity: warning
      annotations:
        summary: "Agent p95 latency above 1.5s"
        runbook_url: "https://wiki.example.com/runbooks/agent-latency"

    - alert: STTFailureRate
      expr: |
        rate(stt_transcription_errors_total[5m])
        / rate(stt_transcription_requests_total[5m]) > 0.02
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "STT failure rate above 2%"

    - alert: NoActiveAgents
      expr: agent_workers_active == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "No active agent workers detected"

What's happening

Each rule has a for duration that prevents transient spikes from firing alerts. The severity label controls routing — critical pages on-call engineers while warning sends to Slack. The runbook_url annotation links directly to the response procedure.

Integrating PagerDuty and Slack

Alert rules fire, but someone needs to receive them. Configure Alertmanager to route critical alerts to PagerDuty and warnings to a Slack channel.

alertmanager/config.ymlyaml

global:
resolve_timeout: 5m

route:
receiver: slack-warnings
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
  - match:
      severity: critical
    receiver: pagerduty-critical
    repeat_interval: 15m
  - match:
      severity: warning
    receiver: slack-warnings

receivers:
- name: pagerduty-critical
  pagerduty_configs:
    - service_key_file: /etc/alertmanager/pagerduty-key
      description: '{{ .CommonAnnotations.summary }}'
      details:
        runbook: '{{ .CommonAnnotations.runbook_url }}'
        firing: '{{ .Alerts.Firing | len }}'

- name: slack-warnings
  slack_configs:
    - api_url_file: /etc/alertmanager/slack-webhook
      channel: '#voice-agent-alerts'
      title: '{{ .CommonAnnotations.summary }}'
      text: >-
        {{ range .Alerts }}
        *{{ .Labels.alertname }}* - {{ .Annotations.summary }}
        {{ end }}

Avoid alert fatigue

Route only genuinely urgent alerts to PagerDuty. If your on-call engineer gets paged for a warning-level latency bump at 3 AM, they will start ignoring pages. Reserve critical severity for situations where callers are actively affected.

Emitting custom metrics from your agent

Your Prometheus rules need metrics to query. Here is how to emit custom counters and histograms from a LiveKit agent.

agent.pypython

from prometheus_client import Counter, Histogram, start_http_server
from livekit.agents import Agent, AgentSession, function_tool, RunContext

REQUEST_COUNT = Counter(
  "agent_requests_total",
  "Total agent requests",
  ["status"],
)
RESPONSE_LATENCY = Histogram(
  "agent_response_seconds",
  "Time from user speech end to first agent audio",
  buckets=[0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 5.0],
)

# Start metrics server on port 9090
start_http_server(9090)


class MonitoredAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful voice assistant.",
      )

  async def on_user_turn_completed(self, turn_ctx):
      import time
      start = time.monotonic()
      try:
          await Agent.default.on_user_turn_completed(self, turn_ctx)
          REQUEST_COUNT.labels(status="success").inc()
      except Exception:
          REQUEST_COUNT.labels(status="error").inc()
          raise
      finally:
          RESPONSE_LATENCY.observe(time.monotonic() - start)

agent.tstypescript

import { Agent, AgentSession } from "@livekit/agents";
import { Counter, Histogram, Registry, collectDefaultMetrics } from "prom-client";

const register = new Registry();
collectDefaultMetrics({ register });

const requestCount = new Counter({
name: "agent_requests_total",
help: "Total agent requests",
labelNames: ["status"] as const,
registers: [register],
});

const responseLatency = new Histogram({
name: "agent_response_seconds",
help: "Time from user speech end to first agent audio",
buckets: [0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 5.0],
registers: [register],
});

class MonitoredAgent extends Agent {
constructor() {
  super({ instructions: "You are a helpful voice assistant." });
}

override async onUserTurnCompleted(turnCtx: TurnContext): Promise<void> {
  const start = performance.now();
  try {
    await super.onUserTurnCompleted(turnCtx);
    requestCount.labels({ status: "success" }).inc();
  } catch (err) {
    requestCount.labels({ status: "error" }).inc();
    throw err;
  } finally {
    responseLatency.observe((performance.now() - start) / 1000);
  }
}
}

Writing incident response runbooks

An alert without a runbook is a question without an answer. Every alert rule should link to a runbook that tells the responder exactly what to check and what to do.

Start with the symptom

Describe what the alert means in plain language. "Agent error rate above 5% means more than 1 in 20 caller interactions are failing."

List diagnostic commands

Provide exact commands to run. Check agent logs, query Prometheus for the current error rate, verify upstream dependencies (STT, LLM, TTS provider status pages).

Define escalation paths

If the responder cannot resolve the issue within 10 minutes, who do they escalate to? Include names, roles, and contact methods.

Document common fixes

List the top three causes you have seen before and their resolutions. "If the LLM provider returns 429, check rate limits and switch to the fallback model."

Runbooks are living documents

After every incident, update the runbook with what you learned. A stale runbook is almost worse than no runbook because it gives false confidence. Add a "Last reviewed" date at the top of every runbook and review them monthly.

What's happening

Good alerting is a three-legged stool: the right thresholds so you catch real problems, the right routing so the right person sees them, and the right runbooks so that person can act. Missing any leg means incidents take longer than they should.

Test your knowledge

Question 1 of 3

Why do alert rules include a 'for' duration (e.g., 'for: 2m') rather than firing immediately when a threshold is crossed?

Alerting & incident response