Monitoring & security hardening
Monitoring and Security Hardening
You cannot operate what you cannot observe, and you cannot trust what you have not locked down. This chapter covers both halves: setting up Prometheus and Grafana to monitor your LiveKit deployment, then hardening it with network policies, TLS everywhere, and API key rotation.
What you'll learn
- How to scrape LiveKit's built-in Prometheus metrics
- The key metrics that indicate deployment health: rooms, participants, packet loss, bandwidth
- How to build Grafana dashboards and alerting rules
- Security hardening: network policies, TLS on every path, API key management, CORS
LiveKit's Prometheus endpoint
LiveKit exposes Prometheus-compatible metrics at /metrics on its HTTP port (default 7880). No additional configuration is needed.
# Verify metrics are exposed
curl -s http://localhost:7880/metrics | head -20
# Key metrics you will see:
# livekit_room_count -- current active rooms
# livekit_participant_count -- current connected participants
# livekit_packet_loss_ratio -- media quality indicator
# livekit_track_count -- audio/video tracks being forwarded
# process_cpu_seconds_total -- server CPU usage
# process_resident_memory_bytes -- server memory usageConfiguring Prometheus
On Kubernetes, use a ServiceMonitor for automatic discovery as pods scale up and down. For non-Kubernetes deployments, use static scrape targets.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: livekit-monitor
namespace: livekit
labels:
release: prometheus
spec:
selector:
matchLabels:
app: livekit-server
endpoints:
- port: http
path: /metrics
interval: 10sscrape_configs:
- job_name: 'livekit'
static_configs:
- targets:
- 'livekit-node-1:7880'
- 'livekit-node-2:7880'
metrics_path: /metrics
scrape_interval: 10sKey metrics to watch
Not all metrics matter equally. Focus on these in order of operational importance.
Packet loss ratio
livekit_packet_loss_ratio is the single most important quality metric. Above 5%, audio quality degrades noticeably. Above 10%, conversations become difficult. This metric rises before CPU maxes out and before participants complain -- it is your earliest warning.
Room and participant counts
livekit_room_count and livekit_participant_count show current load. Track trends over time for capacity planning and anomaly detection.
CPU and memory per node
rate(process_cpu_seconds_total[5m]) and process_resident_memory_bytes track resource usage. LiveKit is CPU-bound for packet forwarding -- when CPU approaches saturation, packet loss increases.
Track count and bandwidth
livekit_track_count by kind (audio/video) and bandwidth metrics show how much media the server is handling. A sudden spike in tracks without a corresponding increase in rooms can indicate a misbehaving client.
Agent worker health
If running agents, monitor agent process CPU, memory, and job completion rates. Crashed agents stop processing rooms -- participants wait in silence.
Packet loss is your canary
If you can only watch one metric, watch packet loss. Set alerts at 3% (warning) and 5% (critical) thresholds.
Grafana dashboards
Deploy Grafana and connect it to Prometheus. Build dashboards for the metrics above.
# Deploy Grafana on Kubernetes
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
--namespace monitoring \
--set adminPassword=your-admin-password \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Prometheus \
--set datasources."datasources\.yaml".datasources[0].type=prometheus \
--set datasources."datasources\.yaml".datasources[0].url=http://prometheus:9090 \
--set datasources."datasources\.yaml".datasources[0].isDefault=trueHere is a dashboard JSON to import as a starting point. It covers the essential panels: rooms, participants, packet loss, CPU by node, memory by node, and tracks.
{
"dashboard": {
"title": "LiveKit Overview",
"panels": [
{
"title": "Active Rooms",
"type": "stat",
"targets": [{ "expr": "sum(livekit_room_count)" }]
},
{
"title": "Active Participants",
"type": "stat",
"targets": [{ "expr": "sum(livekit_participant_count)" }]
},
{
"title": "Packet Loss Ratio",
"type": "timeseries",
"targets": [{ "expr": "avg(livekit_packet_loss_ratio)" }]
},
{
"title": "CPU Usage by Node",
"type": "timeseries",
"targets": [{ "expr": "rate(process_cpu_seconds_total{job='livekit'}[5m])" }]
},
{
"title": "Memory Usage by Node",
"type": "timeseries",
"targets": [{ "expr": "process_resident_memory_bytes{job='livekit'}" }]
},
{
"title": "Active Tracks by Kind",
"type": "timeseries",
"targets": [{ "expr": "sum(livekit_track_count) by (kind)" }]
},
{
"title": "Agent Worker CPU",
"type": "timeseries",
"targets": [{ "expr": "rate(process_cpu_seconds_total{job='livekit-agent'}[5m])" }]
}
]
}
}Alerting rules
Configure Prometheus alerting rules to notify you before problems affect users.
groups:
- name: livekit
rules:
- alert: HighPacketLoss
expr: avg(livekit_packet_loss_ratio) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "LiveKit packet loss above 5%"
description: "Average packet loss is {{ $value | humanizePercentage }} for 2 minutes."
- alert: CriticalPacketLoss
expr: avg(livekit_packet_loss_ratio) > 0.10
for: 1m
labels:
severity: critical
annotations:
summary: "LiveKit packet loss above 10% -- conversations are degraded"
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total{job="livekit"}[5m]) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "LiveKit node CPU above 85%"
- alert: NodeDown
expr: up{job="livekit"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "LiveKit node unreachable"
- alert: AgentWorkerDown
expr: up{job="livekit-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Agent worker unreachable -- rooms may not be processed"
- alert: HighLatencySpike
expr: histogram_quantile(0.99, rate(livekit_room_duration_seconds_bucket[5m])) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Rooms lasting unusually long -- possible stuck sessions"Connect alerts to your on-call system
Prometheus alerts fire silently on their own. Connect Alertmanager to Slack, PagerDuty, OpsGenie, or email so the right person is notified when something breaks.
Network policies
By default, Kubernetes allows all pod-to-pod communication. Lock down LiveKit so only authorized services can reach it.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: livekit-server-policy
namespace: livekit
spec:
podSelector:
matchLabels:
app: livekit-server
policyTypes:
- Ingress
- Egress
ingress:
# Signaling from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- port: 7880
protocol: TCP
# Prometheus scraping
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- port: 7880
protocol: TCP
# Media traffic from external clients
- ports:
- port: 50000
endPort: 60000
protocol: UDP
- port: 7881
protocol: TCP
- port: 5349
protocol: TCP
egress:
# Redis
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: redis
ports:
- port: 6379
protocol: TCP
# DNS
- to:
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# STUN for external IP detection
- ports:
- port: 3478
protocol: UDPNetwork policies require a supporting CNI
Network policies are only enforced if your cluster uses a CNI that supports them -- Calico, Cilium, or Weave. The default kubenet CNI does not enforce them. Verify your CNI before relying on these rules.
TLS on every path
Every connection path should be encrypted. Here is the checklist.
Client to LiveKit signaling
TLS via your Ingress controller. Clients connect to wss:// (WebSocket Secure). Never expose ws:// in production.
Client to LiveKit media
WebRTC media is encrypted by default with DTLS-SRTP. No configuration needed -- this is a protocol requirement.
LiveKit to Redis
Enable TLS if Redis is not on the same private network: redis.use_tls: true in config.yaml.
TURN over TLS
Port 5349 serves TURN over TLS. Clients behind restrictive firewalls rely on this.
WebRTC's DTLS-SRTP encryption means the media path is always encrypted between the client and LiveKit server, even over UDP. Captured packets cannot be decoded without session keys. This is a significant security advantage -- you do not need to add media encryption yourself.
API key rotation
LiveKit uses API key/secret pairs for authentication. Every token issued to clients or agents is signed with a secret. Compromising a secret lets an attacker generate valid tokens and join any room.
Generate strong keys
Use cryptographically random values. Never use human-readable strings.
Store in a secrets manager
Kubernetes Secrets, Sealed Secrets, or HashiCorp Vault. Never commit secrets to Git.
Rotate without downtime
LiveKit supports multiple key pairs simultaneously. Add the new key, deploy, migrate your token server, then remove the old key.
# Both keys active simultaneously during rotation
keys:
old-api-key: old-api-secret # Still valid, being phased out
new-api-key: new-api-secret # New key, update token server to use this
# After all tokens signed with old key have expired, remove it:
# keys:
# new-api-key: new-api-secret# Generate a strong API key and secret
API_KEY=$(python3 -c "import secrets; print(secrets.token_urlsafe(15))")
API_SECRET=$(python3 -c "import secrets; print(secrets.token_urlsafe(30))")
# Create a Kubernetes secret
kubectl -n livekit create secret generic livekit-keys \
--from-literal=api-key="$API_KEY" \
--from-literal=api-secret="$API_SECRET"
echo "API Key: $API_KEY"
echo "Store the secret securely -- it will not be displayed again."Token expiration helps rotation
LiveKit tokens include an expiration time. When you rotate keys, existing tokens signed with the old key continue to work until they expire. Set reasonable TTLs (1-2 hours) so old-key tokens age out naturally.
CORS configuration
If your frontend connects to LiveKit from a browser, CORS headers must allow the connection. Configure at the reverse proxy level.
server {
listen 443 ssl;
server_name livekit.example.com;
add_header Access-Control-Allow-Origin "https://app.example.com" always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Authorization, Content-Type" always;
if ($request_method = OPTIONS) {
return 204;
}
location / {
proxy_pass http://127.0.0.1:7880;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400s;
}
}Never use wildcard CORS in production
Setting Access-Control-Allow-Origin: * lets any website connect to your LiveKit server. Always specify exact allowed origins.
Test your knowledge
Question 1 of 3
Why is packet loss ratio the most critical metric for a self-hosted LiveKit deployment?
What you learned
- LiveKit exposes Prometheus metrics at
/metricson port 7880 -- use ServiceMonitor for auto-discovery on Kubernetes - Packet loss ratio is the most critical metric; set alerts at 5% and 10% thresholds
- Grafana dashboards should cover rooms, participants, packet loss, CPU, memory, tracks, and agent health
- Kubernetes network policies restrict access to LiveKit pods -- require a supporting CNI
- TLS covers signaling (Ingress), Redis, and TURN; media is encrypted by WebRTC's DTLS-SRTP
- API keys should be cryptographically random, stored in secrets managers, and rotated with multi-key support
Next up
In the next chapter, you will learn how to perform rolling upgrades with zero downtime, back up your deployment, and handle disaster recovery.