Upgrades, backup & operations
Upgrades, Backup, and Day-2 Operations
Deploying LiveKit is the beginning, not the end. This chapter covers the operational work that keeps a self-hosted deployment reliable: rolling upgrades with zero downtime, version compatibility, backup and recovery, capacity planning, and load testing.
What you'll learn
- How to perform rolling upgrades on Kubernetes with zero downtime
- Version compatibility rules: server, SDK, and agent framework
- What to back up and what not to back up
- Disaster recovery procedures for common failure scenarios
- Capacity planning and load testing with livekit-cli
Rolling upgrades with zero downtime
Kubernetes rolling updates replace pods one at a time. For LiveKit, this means active rooms on other nodes continue uninterrupted while one node is upgraded. The key mechanism is LiveKit's room draining -- when a pod receives a termination signal, it stops accepting new rooms and waits for existing rooms to close before shutting down.
# Rolling update strategy
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
# Pod disruption budget -- always keep at least N-1 pods running
podDisruptionBudget:
enabled: true
minAvailable: 1
# Long grace period -- allows active rooms to finish
terminationGracePeriodSeconds: 18000 # 5 hoursCheck current version and review changelog
Read the release notes for every version between your current and target. Look for breaking changes, config format changes, and Redis schema changes.
Test in staging first
Upgrade your staging environment. Run connectivity tests with livekit-cli and verify metrics in Grafana are stable.
Upgrade production
Update the chart version or image tag in your values and run helm upgrade. Watch pods cycle.
Monitor the rollout
Verify new pods are healthy. Check packet loss, CPU, and room counts in Grafana during and after the rollout.
# Check current version
kubectl -n livekit get pods -o jsonpath='{.items[0].spec.containers[0].image}'
helm -n livekit list
# Check available versions
helm search repo livekit/livekit-server --versions | head -10
# Upgrade
helm upgrade livekit livekit/livekit-server \
-f values.yaml \
--namespace livekit \
--version 1.8.0
# Monitor the rollout
kubectl -n livekit rollout status deployment/livekit-server
kubectl -n livekit get pods -wDuring a rolling upgrade, participants on the pod being replaced receive a disconnect event and the LiveKit client SDK automatically reconnects to a healthy node -- typically within 1-3 seconds. Rooms on other nodes are completely unaffected. The long terminationGracePeriodSeconds gives active rooms time to finish naturally rather than being forcibly terminated.
Version compatibility
Not all versions of LiveKit server, client SDKs, and the agent framework are compatible. Follow these rules.
Minor versions (1.7.x to 1.7.y): Safe for rolling upgrades. No breaking changes. Deploy without a maintenance window.
Patch updates across minor versions (1.7.x to 1.8.x): Usually safe for rolling upgrades, but always read the changelog. Config options may be added or deprecated.
Major versions (1.x to 2.x): May include Redis schema migrations, protocol changes, or config format changes that require all nodes to run the same version. Schedule a maintenance window.
Rollback does not undo Redis migrations
If a new version migrated the Redis schema, rolling back the LiveKit binary may not be compatible with the new schema. Always back up Redis before major upgrades so you can restore both the binary and the data to a consistent state.
# Rollback to previous Helm revision if upgrade causes problems
helm -n livekit history livekit
helm -n livekit rollback livekit
# Or rollback to a specific revision
helm -n livekit rollback livekit 3
# Verify rollback
kubectl -n livekit get pods -o jsonpath='{.items[0].spec.containers[0].image}'Backup strategy: what to back up
LiveKit itself is stateless. All room state lives in Redis. All configuration lives in files. This makes backup straightforward.
| Component | Back up? | Why |
|---|---|---|
| Redis data | Yes | Room-to-node mappings, participant state, routing data |
| config.yaml / Helm values | Yes | Server settings, API keys, TURN config |
| TLS certificates | Yes | Avoid re-provisioning delays during recovery |
| API keys and secrets | Yes | Required to restore service and token validation |
| Media streams | No | Real-time media is ephemeral -- cannot be serialized or restored |
| LiveKit binary / container | No | Pull from registry during recovery |
Active calls cannot be restored
If LiveKit restarts, in-progress calls are lost. Participants must reconnect and rejoin rooms. Backup protects your ability to resume accepting new calls quickly, not to preserve individual sessions.
Redis backup
Configure Redis for both RDB snapshots and AOF logging.
# RDB snapshots -- periodic point-in-time backups
save 900 1
save 300 10
save 60 10000
# AOF -- append-only file for write durability
appendonly yes
appendfsync everysec
# Memory policy
maxmemory-policy noeviction
dbfilename dump.rdb
dir /data/redis#!/bin/bash
# Automated backup for self-hosted LiveKit
BACKUP_DIR="/backups/livekit/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"
# Redis RDB snapshot
redis-cli -h redis-master BGSAVE
sleep 5
cp /data/redis/dump.rdb "$BACKUP_DIR/redis-dump.rdb"
# Configuration
helm -n livekit get values livekit > "$BACKUP_DIR/helm-values.yaml"
# TLS certificates
cp -r /etc/letsencrypt/live/ "$BACKUP_DIR/tls-certs/" 2>/dev/null
# Upload to remote storage (encrypted)
aws s3 sync "$BACKUP_DIR" "s3://your-backups/livekit/$(date +%Y-%m-%d)/" \
--sse AES256
# Clean up local backups older than 30 days
find /backups/livekit -maxdepth 1 -type d -mtime +30 -exec rm -rf {} +Encrypt backups at rest
Backups contain API keys and TLS private keys. Always encrypt before storing in remote storage.
Disaster recovery
Document these procedures and practice them regularly.
Redis data loss. Stop LiveKit, restore the RDB snapshot, start Redis, start LiveKit. Active rooms will need to be recreated; participants must reconnect.
Configuration corruption. Roll back to the last known-good config from Git. Redeploy with helm upgrade.
Full server failure. Provision new infrastructure, restore config from backup, point DNS to the new server, restore Redis if available. With automation, target recovery in under 30 minutes.
Monthly: backup restore test
Restore a backup to a staging environment. Verify LiveKit starts, API keys work, and you can create rooms.
Quarterly: simulated node failure
Kill a LiveKit pod during low-traffic hours. Verify remaining nodes handle the load and new connections route correctly.
Annually: full disaster recovery drill
Rebuild from scratch using only backups and documented procedures. Measure how long it takes.
Capacity planning and load testing
Do not guess capacity -- measure it. Use livekit-cli to generate load and observe how your deployment responds.
# Create test rooms with simulated participants
livekit-cli load-test \
--url wss://livekit.example.com \
--api-key your-api-key \
--api-secret your-api-secret \
--room load-test \
--publishers 10 \
--subscribers 10 \
--duration 5m
# Monitor during the test:
# - Grafana: packet loss, CPU, memory, room count
# - kubectl top pods -n livekit
# - Redis memory: redis-cli INFO memoryScaling rules of thumb (4-core, 8 GB node):
| Signal | Action |
|---|---|
| CPU sustained above 70% | Add nodes or increase CPU per node |
| Packet loss above 3% under load | Server is overloaded -- scale horizontally |
| Redis memory above 50% of max | Increase Redis memory or add eviction monitoring |
| TURN sessions above 15% of total | Review firewall rules -- direct UDP may be blocked unnecessarily |
# Scale LiveKit horizontally on Kubernetes
kubectl -n livekit scale deployment livekit-server --replicas=4
# Verify all nodes registered
redis-cli -h redis-master keys "livekit:node:*"
# Scale agent workers independently
kubectl -n livekit scale deployment livekit-agent --replicas=6Multi-region deployment
For global availability or disaster recovery, run LiveKit in multiple regions. Each region operates independently with its own Redis instance. Clients connect to the nearest region via geo-DNS or a global load balancer.
Rooms do not span regions
Each room lives entirely within one region. Multi-region gives you geographic redundancy and lower latency for users in different parts of the world, but a single room's participants all connect to the same regional cluster.
Test your knowledge
Question 1 of 3
During a Kubernetes rolling upgrade, what happens to active rooms on the pod being replaced?
Course summary
Across this course, you built and hardened a complete self-hosted LiveKit deployment.
- Chapter 1 mapped the architecture and helped you decide whether self-hosting is right for your project
- Chapter 2 deployed LiveKit on Kubernetes with Helm, configured Redis and TURN, and set up agent workers
- Chapter 3 added Prometheus monitoring, Grafana dashboards, alerting rules, network policies, TLS, and API key rotation
- Chapter 4 covered rolling upgrades, backup, disaster recovery, capacity planning, and load testing
You now have the knowledge to run LiveKit infrastructure that is production-ready, observable, secure, and maintainable. The operational discipline -- monitoring, testing upgrades, practicing failover -- is what separates a deployment that works from one that works reliably.