Auto-scaling voice agent workers

Voice traffic is spiky. A dental clinic gets a flood of calls Monday morning, a pizza chain peaks Friday evening, and a crisis hotline surges unpredictably. If you provision for peak, you waste money at idle. If you provision for average, callers get dropped during spikes. Auto-scaling solves this by adding and removing agent workers based on real-time demand.

Scaling policiesLoad balancingCapacity

What you'll learn

How to configure auto-scaling for agent workers on LiveKit Cloud
How to set up Kubernetes HPA for self-hosted deployments
How to define scaling policies based on concurrent sessions
How to plan capacity so scaling has headroom to work

Scaling on LiveKit Cloud

LiveKit Cloud handles infrastructure scaling automatically. Your job is to configure the agent-level scaling behavior: how many sessions each worker handles, when to spawn new workers, and the upper bound.

livekit-cloud-agent.ymlyaml

agent:
name: dental-receptionist
scaling:
  min_workers: 2
  max_workers: 50
  sessions_per_worker: 5
  scale_up_threshold: 0.8    # Add workers when 80% of capacity is in use
  scale_down_threshold: 0.3  # Remove workers when below 30% utilization
  scale_down_delay: 300s     # Wait 5 min before scaling down
  warmup_time: 30s           # Time for a new worker to become ready

What's happening

The scale_down_delay is critical for voice workloads. Scaling down too aggressively means you pay the cold-start penalty again when the next burst arrives. A five-minute delay absorbs typical traffic fluctuations without keeping idle workers running for hours.

Self-hosted scaling with Kubernetes HPA

For self-hosted deployments, use a Kubernetes Horizontal Pod Autoscaler with a custom metric: active sessions per worker.

k8s/agent-hpa.ymlyaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent-hpa
namespace: livekit
spec:
scaleTargetRef:
  apiVersion: apps/v1
  kind: Deployment
  name: voice-agent
minReplicas: 2
maxReplicas: 50
behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    policies:
      - type: Pods
        value: 4
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Pods
        value: 2
        periodSeconds: 120
metrics:
  - type: Pods
    pods:
      metric:
        name: agent_active_sessions
      target:
        type: AverageValue
        averageValue: "4"

Why custom metrics over CPU?

CPU utilization is a poor scaling signal for voice agents. An agent waiting for an LLM response uses almost no CPU but is fully occupied. Scaling on active sessions reflects actual capacity consumption.

Exposing custom metrics for the HPA

The HPA needs a metric to query. Expose the active session count from your agent worker using a Prometheus gauge, then use a Prometheus adapter to make it available to Kubernetes.

agent.pypython

from prometheus_client import Gauge, start_http_server
from livekit.agents import Agent, AgentSession, AgentServer, rtc_session

ACTIVE_SESSIONS = Gauge(
  "agent_active_sessions",
  "Number of active voice sessions on this worker",
)

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  ACTIVE_SESSIONS.inc()
  try:
      await session.start(
          agent=Agent(instructions="You are a helpful assistant."),
          room=session.room,
      )
      # Wait for session to end
      await session.wait()
  finally:
      ACTIVE_SESSIONS.dec()


# Expose metrics for Prometheus scraping
start_http_server(9090)

if __name__ == "__main__":
  server.run()

agent.tstypescript

import { Agent, AgentSession, AgentServer } from "@livekit/agents";
import { Gauge, Registry, collectDefaultMetrics } from "prom-client";

const register = new Registry();
collectDefaultMetrics({ register });

const activeSessions = new Gauge({
name: "agent_active_sessions",
help: "Number of active voice sessions on this worker",
registers: [register],
});

const server = new AgentServer();

server.rtcSession(async (session: AgentSession) => {
activeSessions.inc();
try {
  await session.start({
    agent: new Agent({ instructions: "You are a helpful assistant." }),
    room: session.room,
  });
  await session.wait();
} finally {
  activeSessions.dec();
}
});

server.run();

Capacity planning

Auto-scaling reacts to demand, but it cannot create resources that do not exist. You need to plan capacity so the cluster has headroom for scaling to work.

Profile a single worker

Measure memory, CPU, and network usage per concurrent session. A typical voice agent uses 50-100 MB of memory per session and negligible CPU while waiting for LLM responses.

Set sessions per worker conservatively

Start with 3-5 sessions per worker and load test. Increase only after confirming latency remains stable. An overloaded worker hurts every session on it, not just the newest one.

Reserve buffer capacity

Set your node pool autoscaler to maintain at least 20% idle capacity. New pods cannot start if there are no nodes to schedule them on. Node provisioning takes 2-4 minutes, which is an eternity for a caller on hold.

Test your scaling

Use load testing tools to simulate traffic ramps. Verify that new workers come online before existing workers hit capacity, and that scale-down does not interrupt active sessions.

Graceful shutdown matters

When scaling down, the agent worker must finish active sessions before terminating. Configure a terminationGracePeriodSeconds of at least 600 seconds (10 minutes) in your Kubernetes deployment to avoid cutting off callers mid-conversation.

What's happening

Auto-scaling turns a fixed-cost infrastructure into a variable-cost one that tracks demand. The key is choosing the right metric (active sessions, not CPU), setting conservative thresholds (scale up eagerly, scale down cautiously), and ensuring the underlying cluster has room to grow.

Test your knowledge

Question 1 of 3

Why is CPU utilization a poor scaling signal for voice AI agent workers?