Chapter 420m

Upgrades, backup & operations

Upgrades, Backup, and Day-2 Operations

Deploying LiveKit is the beginning, not the end. This chapter covers the operational work that keeps a self-hosted deployment reliable: rolling upgrades with zero downtime, version compatibility, backup and recovery, capacity planning, and load testing.

Rolling upgradesBackup proceduresRecoveryMaintenance windows

What you'll learn

  • How to perform rolling upgrades on Kubernetes with zero downtime
  • Version compatibility rules: server, SDK, and agent framework
  • What to back up and what not to back up
  • Disaster recovery procedures for common failure scenarios
  • Capacity planning and load testing with livekit-cli

Rolling upgrades with zero downtime

Kubernetes rolling updates replace pods one at a time. For LiveKit, this means active rooms on other nodes continue uninterrupted while one node is upgraded. The key mechanism is LiveKit's room draining -- when a pod receives a termination signal, it stops accepting new rooms and waits for existing rooms to close before shutting down.

values.yamlyaml
# Rolling update strategy
updateStrategy:
type: RollingUpdate
rollingUpdate:
  maxUnavailable: 1
  maxSurge: 1

# Pod disruption budget -- always keep at least N-1 pods running
podDisruptionBudget:
enabled: true
minAvailable: 1

# Long grace period -- allows active rooms to finish
terminationGracePeriodSeconds: 18000  # 5 hours
1

Check current version and review changelog

Read the release notes for every version between your current and target. Look for breaking changes, config format changes, and Redis schema changes.

2

Test in staging first

Upgrade your staging environment. Run connectivity tests with livekit-cli and verify metrics in Grafana are stable.

3

Upgrade production

Update the chart version or image tag in your values and run helm upgrade. Watch pods cycle.

4

Monitor the rollout

Verify new pods are healthy. Check packet loss, CPU, and room counts in Grafana during and after the rollout.

terminalbash
# Check current version
kubectl -n livekit get pods -o jsonpath='{.items[0].spec.containers[0].image}'
helm -n livekit list

# Check available versions
helm search repo livekit/livekit-server --versions | head -10

# Upgrade
helm upgrade livekit livekit/livekit-server \
-f values.yaml \
--namespace livekit \
--version 1.8.0

# Monitor the rollout
kubectl -n livekit rollout status deployment/livekit-server
kubectl -n livekit get pods -w
What's happening

During a rolling upgrade, participants on the pod being replaced receive a disconnect event and the LiveKit client SDK automatically reconnects to a healthy node -- typically within 1-3 seconds. Rooms on other nodes are completely unaffected. The long terminationGracePeriodSeconds gives active rooms time to finish naturally rather than being forcibly terminated.

Version compatibility

Not all versions of LiveKit server, client SDKs, and the agent framework are compatible. Follow these rules.

Minor versions (1.7.x to 1.7.y): Safe for rolling upgrades. No breaking changes. Deploy without a maintenance window.

Patch updates across minor versions (1.7.x to 1.8.x): Usually safe for rolling upgrades, but always read the changelog. Config options may be added or deprecated.

Major versions (1.x to 2.x): May include Redis schema migrations, protocol changes, or config format changes that require all nodes to run the same version. Schedule a maintenance window.

Rollback does not undo Redis migrations

If a new version migrated the Redis schema, rolling back the LiveKit binary may not be compatible with the new schema. Always back up Redis before major upgrades so you can restore both the binary and the data to a consistent state.

terminalbash
# Rollback to previous Helm revision if upgrade causes problems
helm -n livekit history livekit
helm -n livekit rollback livekit

# Or rollback to a specific revision
helm -n livekit rollback livekit 3

# Verify rollback
kubectl -n livekit get pods -o jsonpath='{.items[0].spec.containers[0].image}'

Backup strategy: what to back up

LiveKit itself is stateless. All room state lives in Redis. All configuration lives in files. This makes backup straightforward.

ComponentBack up?Why
Redis dataYesRoom-to-node mappings, participant state, routing data
config.yaml / Helm valuesYesServer settings, API keys, TURN config
TLS certificatesYesAvoid re-provisioning delays during recovery
API keys and secretsYesRequired to restore service and token validation
Media streamsNoReal-time media is ephemeral -- cannot be serialized or restored
LiveKit binary / containerNoPull from registry during recovery

Active calls cannot be restored

If LiveKit restarts, in-progress calls are lost. Participants must reconnect and rejoin rooms. Backup protects your ability to resume accepting new calls quickly, not to preserve individual sessions.

Redis backup

Configure Redis for both RDB snapshots and AOF logging.

redis.confconf
# RDB snapshots -- periodic point-in-time backups
save 900 1
save 300 10
save 60 10000

# AOF -- append-only file for write durability
appendonly yes
appendfsync everysec

# Memory policy
maxmemory-policy noeviction

dbfilename dump.rdb
dir /data/redis
backup.shbash
#!/bin/bash
# Automated backup for self-hosted LiveKit

BACKUP_DIR="/backups/livekit/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"

# Redis RDB snapshot
redis-cli -h redis-master BGSAVE
sleep 5
cp /data/redis/dump.rdb "$BACKUP_DIR/redis-dump.rdb"

# Configuration
helm -n livekit get values livekit > "$BACKUP_DIR/helm-values.yaml"

# TLS certificates
cp -r /etc/letsencrypt/live/ "$BACKUP_DIR/tls-certs/" 2>/dev/null

# Upload to remote storage (encrypted)
aws s3 sync "$BACKUP_DIR" "s3://your-backups/livekit/$(date +%Y-%m-%d)/" \
--sse AES256

# Clean up local backups older than 30 days
find /backups/livekit -maxdepth 1 -type d -mtime +30 -exec rm -rf {} +

Encrypt backups at rest

Backups contain API keys and TLS private keys. Always encrypt before storing in remote storage.

Disaster recovery

Document these procedures and practice them regularly.

Redis data loss. Stop LiveKit, restore the RDB snapshot, start Redis, start LiveKit. Active rooms will need to be recreated; participants must reconnect.

Configuration corruption. Roll back to the last known-good config from Git. Redeploy with helm upgrade.

Full server failure. Provision new infrastructure, restore config from backup, point DNS to the new server, restore Redis if available. With automation, target recovery in under 30 minutes.

1

Monthly: backup restore test

Restore a backup to a staging environment. Verify LiveKit starts, API keys work, and you can create rooms.

2

Quarterly: simulated node failure

Kill a LiveKit pod during low-traffic hours. Verify remaining nodes handle the load and new connections route correctly.

3

Annually: full disaster recovery drill

Rebuild from scratch using only backups and documented procedures. Measure how long it takes.

Capacity planning and load testing

Do not guess capacity -- measure it. Use livekit-cli to generate load and observe how your deployment responds.

terminalbash
# Create test rooms with simulated participants
livekit-cli load-test \
--url wss://livekit.example.com \
--api-key your-api-key \
--api-secret your-api-secret \
--room load-test \
--publishers 10 \
--subscribers 10 \
--duration 5m

# Monitor during the test:
# - Grafana: packet loss, CPU, memory, room count
# - kubectl top pods -n livekit
# - Redis memory: redis-cli INFO memory

Scaling rules of thumb (4-core, 8 GB node):

SignalAction
CPU sustained above 70%Add nodes or increase CPU per node
Packet loss above 3% under loadServer is overloaded -- scale horizontally
Redis memory above 50% of maxIncrease Redis memory or add eviction monitoring
TURN sessions above 15% of totalReview firewall rules -- direct UDP may be blocked unnecessarily
terminalbash
# Scale LiveKit horizontally on Kubernetes
kubectl -n livekit scale deployment livekit-server --replicas=4

# Verify all nodes registered
redis-cli -h redis-master keys "livekit:node:*"

# Scale agent workers independently
kubectl -n livekit scale deployment livekit-agent --replicas=6

Multi-region deployment

For global availability or disaster recovery, run LiveKit in multiple regions. Each region operates independently with its own Redis instance. Clients connect to the nearest region via geo-DNS or a global load balancer.

Rooms do not span regions

Each room lives entirely within one region. Multi-region gives you geographic redundancy and lower latency for users in different parts of the world, but a single room's participants all connect to the same regional cluster.

Test your knowledge

Question 1 of 3

During a Kubernetes rolling upgrade, what happens to active rooms on the pod being replaced?

Course summary

Across this course, you built and hardened a complete self-hosted LiveKit deployment.

  • Chapter 1 mapped the architecture and helped you decide whether self-hosting is right for your project
  • Chapter 2 deployed LiveKit on Kubernetes with Helm, configured Redis and TURN, and set up agent workers
  • Chapter 3 added Prometheus monitoring, Grafana dashboards, alerting rules, network policies, TLS, and API key rotation
  • Chapter 4 covered rolling upgrades, backup, disaster recovery, capacity planning, and load testing

You now have the knowledge to run LiveKit infrastructure that is production-ready, observable, secure, and maintainable. The operational discipline -- monitoring, testing upgrades, practicing failover -- is what separates a deployment that works from one that works reliably.

Concepts covered
Rolling upgradesBackup proceduresRecoveryMaintenance windows