Chapter 325m

Monitoring & security hardening

Monitoring and Security Hardening

You cannot operate what you cannot observe, and you cannot trust what you have not locked down. This chapter covers both halves: setting up Prometheus and Grafana to monitor your LiveKit deployment, then hardening it with network policies, TLS everywhere, and API key rotation.

Prometheus metricsGrafana dashboardsAlerting rulesNetwork policiesTLS everywhereAPI key rotation

What you'll learn

  • How to scrape LiveKit's built-in Prometheus metrics
  • The key metrics that indicate deployment health: rooms, participants, packet loss, bandwidth
  • How to build Grafana dashboards and alerting rules
  • Security hardening: network policies, TLS on every path, API key management, CORS

LiveKit's Prometheus endpoint

LiveKit exposes Prometheus-compatible metrics at /metrics on its HTTP port (default 7880). No additional configuration is needed.

terminalbash
# Verify metrics are exposed
curl -s http://localhost:7880/metrics | head -20

# Key metrics you will see:
# livekit_room_count           -- current active rooms
# livekit_participant_count    -- current connected participants
# livekit_packet_loss_ratio    -- media quality indicator
# livekit_track_count          -- audio/video tracks being forwarded
# process_cpu_seconds_total    -- server CPU usage
# process_resident_memory_bytes -- server memory usage

Configuring Prometheus

On Kubernetes, use a ServiceMonitor for automatic discovery as pods scale up and down. For non-Kubernetes deployments, use static scrape targets.

service-monitor.yamlyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: livekit-monitor
namespace: livekit
labels:
  release: prometheus
spec:
selector:
  matchLabels:
    app: livekit-server
endpoints:
  - port: http
    path: /metrics
    interval: 10s
prometheus.yml (static targets)yaml
scrape_configs:
- job_name: 'livekit'
  static_configs:
    - targets:
        - 'livekit-node-1:7880'
        - 'livekit-node-2:7880'
  metrics_path: /metrics
  scrape_interval: 10s

Key metrics to watch

Not all metrics matter equally. Focus on these in order of operational importance.

1

Packet loss ratio

livekit_packet_loss_ratio is the single most important quality metric. Above 5%, audio quality degrades noticeably. Above 10%, conversations become difficult. This metric rises before CPU maxes out and before participants complain -- it is your earliest warning.

2

Room and participant counts

livekit_room_count and livekit_participant_count show current load. Track trends over time for capacity planning and anomaly detection.

3

CPU and memory per node

rate(process_cpu_seconds_total[5m]) and process_resident_memory_bytes track resource usage. LiveKit is CPU-bound for packet forwarding -- when CPU approaches saturation, packet loss increases.

4

Track count and bandwidth

livekit_track_count by kind (audio/video) and bandwidth metrics show how much media the server is handling. A sudden spike in tracks without a corresponding increase in rooms can indicate a misbehaving client.

5

Agent worker health

If running agents, monitor agent process CPU, memory, and job completion rates. Crashed agents stop processing rooms -- participants wait in silence.

Packet loss is your canary

If you can only watch one metric, watch packet loss. Set alerts at 3% (warning) and 5% (critical) thresholds.

Grafana dashboards

Deploy Grafana and connect it to Prometheus. Build dashboards for the metrics above.

terminalbash
# Deploy Grafana on Kubernetes
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
--namespace monitoring \
--set adminPassword=your-admin-password \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Prometheus \
--set datasources."datasources\.yaml".datasources[0].type=prometheus \
--set datasources."datasources\.yaml".datasources[0].url=http://prometheus:9090 \
--set datasources."datasources\.yaml".datasources[0].isDefault=true

Here is a dashboard JSON to import as a starting point. It covers the essential panels: rooms, participants, packet loss, CPU by node, memory by node, and tracks.

livekit-dashboard.jsonjson
{
"dashboard": {
  "title": "LiveKit Overview",
  "panels": [
    {
      "title": "Active Rooms",
      "type": "stat",
      "targets": [{ "expr": "sum(livekit_room_count)" }]
    },
    {
      "title": "Active Participants",
      "type": "stat",
      "targets": [{ "expr": "sum(livekit_participant_count)" }]
    },
    {
      "title": "Packet Loss Ratio",
      "type": "timeseries",
      "targets": [{ "expr": "avg(livekit_packet_loss_ratio)" }]
    },
    {
      "title": "CPU Usage by Node",
      "type": "timeseries",
      "targets": [{ "expr": "rate(process_cpu_seconds_total{job='livekit'}[5m])" }]
    },
    {
      "title": "Memory Usage by Node",
      "type": "timeseries",
      "targets": [{ "expr": "process_resident_memory_bytes{job='livekit'}" }]
    },
    {
      "title": "Active Tracks by Kind",
      "type": "timeseries",
      "targets": [{ "expr": "sum(livekit_track_count) by (kind)" }]
    },
    {
      "title": "Agent Worker CPU",
      "type": "timeseries",
      "targets": [{ "expr": "rate(process_cpu_seconds_total{job='livekit-agent'}[5m])" }]
    }
  ]
}
}

Alerting rules

Configure Prometheus alerting rules to notify you before problems affect users.

livekit-alerts.yamlyaml
groups:
- name: livekit
  rules:
    - alert: HighPacketLoss
      expr: avg(livekit_packet_loss_ratio) > 0.05
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "LiveKit packet loss above 5%"
        description: "Average packet loss is {{ $value | humanizePercentage }} for 2 minutes."

    - alert: CriticalPacketLoss
      expr: avg(livekit_packet_loss_ratio) > 0.10
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "LiveKit packet loss above 10% -- conversations are degraded"

    - alert: HighCPUUsage
      expr: rate(process_cpu_seconds_total{job="livekit"}[5m]) > 0.85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "LiveKit node CPU above 85%"

    - alert: NodeDown
      expr: up{job="livekit"} == 0
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "LiveKit node unreachable"

    - alert: AgentWorkerDown
      expr: up{job="livekit-agent"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Agent worker unreachable -- rooms may not be processed"

    - alert: HighLatencySpike
      expr: histogram_quantile(0.99, rate(livekit_room_duration_seconds_bucket[5m])) > 300
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Rooms lasting unusually long -- possible stuck sessions"

Connect alerts to your on-call system

Prometheus alerts fire silently on their own. Connect Alertmanager to Slack, PagerDuty, OpsGenie, or email so the right person is notified when something breaks.

Network policies

By default, Kubernetes allows all pod-to-pod communication. Lock down LiveKit so only authorized services can reach it.

network-policy.yamlyaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: livekit-server-policy
namespace: livekit
spec:
podSelector:
  matchLabels:
    app: livekit-server
policyTypes:
  - Ingress
  - Egress
ingress:
  # Signaling from ingress controller
  - from:
      - namespaceSelector:
          matchLabels:
            name: ingress-nginx
    ports:
      - port: 7880
        protocol: TCP
  # Prometheus scraping
  - from:
      - namespaceSelector:
          matchLabels:
            name: monitoring
    ports:
      - port: 7880
        protocol: TCP
  # Media traffic from external clients
  - ports:
      - port: 50000
        endPort: 60000
        protocol: UDP
      - port: 7881
        protocol: TCP
      - port: 5349
        protocol: TCP
egress:
  # Redis
  - to:
      - podSelector:
          matchLabels:
            app.kubernetes.io/name: redis
    ports:
      - port: 6379
        protocol: TCP
  # DNS
  - to:
      - namespaceSelector: {}
    ports:
      - port: 53
        protocol: UDP
      - port: 53
        protocol: TCP
  # STUN for external IP detection
  - ports:
      - port: 3478
        protocol: UDP

Network policies require a supporting CNI

Network policies are only enforced if your cluster uses a CNI that supports them -- Calico, Cilium, or Weave. The default kubenet CNI does not enforce them. Verify your CNI before relying on these rules.

TLS on every path

Every connection path should be encrypted. Here is the checklist.

1

Client to LiveKit signaling

TLS via your Ingress controller. Clients connect to wss:// (WebSocket Secure). Never expose ws:// in production.

2

Client to LiveKit media

WebRTC media is encrypted by default with DTLS-SRTP. No configuration needed -- this is a protocol requirement.

3

LiveKit to Redis

Enable TLS if Redis is not on the same private network: redis.use_tls: true in config.yaml.

4

TURN over TLS

Port 5349 serves TURN over TLS. Clients behind restrictive firewalls rely on this.

What's happening

WebRTC's DTLS-SRTP encryption means the media path is always encrypted between the client and LiveKit server, even over UDP. Captured packets cannot be decoded without session keys. This is a significant security advantage -- you do not need to add media encryption yourself.

API key rotation

LiveKit uses API key/secret pairs for authentication. Every token issued to clients or agents is signed with a secret. Compromising a secret lets an attacker generate valid tokens and join any room.

1

Generate strong keys

Use cryptographically random values. Never use human-readable strings.

2

Store in a secrets manager

Kubernetes Secrets, Sealed Secrets, or HashiCorp Vault. Never commit secrets to Git.

3

Rotate without downtime

LiveKit supports multiple key pairs simultaneously. Add the new key, deploy, migrate your token server, then remove the old key.

config.yaml (during rotation)yaml
# Both keys active simultaneously during rotation
keys:
old-api-key: old-api-secret    # Still valid, being phased out
new-api-key: new-api-secret    # New key, update token server to use this

# After all tokens signed with old key have expired, remove it:
# keys:
#   new-api-key: new-api-secret
terminalbash
# Generate a strong API key and secret
API_KEY=$(python3 -c "import secrets; print(secrets.token_urlsafe(15))")
API_SECRET=$(python3 -c "import secrets; print(secrets.token_urlsafe(30))")

# Create a Kubernetes secret
kubectl -n livekit create secret generic livekit-keys \
--from-literal=api-key="$API_KEY" \
--from-literal=api-secret="$API_SECRET"

echo "API Key: $API_KEY"
echo "Store the secret securely -- it will not be displayed again."

Token expiration helps rotation

LiveKit tokens include an expiration time. When you rotate keys, existing tokens signed with the old key continue to work until they expire. Set reasonable TTLs (1-2 hours) so old-key tokens age out naturally.

CORS configuration

If your frontend connects to LiveKit from a browser, CORS headers must allow the connection. Configure at the reverse proxy level.

nginx.confnginx
server {
  listen 443 ssl;
  server_name livekit.example.com;

  add_header Access-Control-Allow-Origin "https://app.example.com" always;
  add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
  add_header Access-Control-Allow-Headers "Authorization, Content-Type" always;

  if ($request_method = OPTIONS) {
      return 204;
  }

  location / {
      proxy_pass http://127.0.0.1:7880;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
      proxy_set_header Host $host;
      proxy_read_timeout 86400s;
  }
}

Never use wildcard CORS in production

Setting Access-Control-Allow-Origin: * lets any website connect to your LiveKit server. Always specify exact allowed origins.

Test your knowledge

Question 1 of 3

Why is packet loss ratio the most critical metric for a self-hosted LiveKit deployment?

What you learned

  • LiveKit exposes Prometheus metrics at /metrics on port 7880 -- use ServiceMonitor for auto-discovery on Kubernetes
  • Packet loss ratio is the most critical metric; set alerts at 5% and 10% thresholds
  • Grafana dashboards should cover rooms, participants, packet loss, CPU, memory, tracks, and agent health
  • Kubernetes network policies restrict access to LiveKit pods -- require a supporting CNI
  • TLS covers signaling (Ingress), Redis, and TURN; media is encrypted by WebRTC's DTLS-SRTP
  • API keys should be cryptographically random, stored in secrets managers, and rotated with multi-key support

Next up

In the next chapter, you will learn how to perform rolling upgrades with zero downtime, back up your deployment, and handle disaster recovery.

Concepts covered
Prometheus metricsGrafana dashboardsAlerting rulesNetwork policiesTLS everywhereAPI key rotation