LiveKit Architecture
Why LiveKit's architecture matters for voice AI
The architecture behind the platform. WebRTC vs WebSockets, the SFU model, rooms/participants/tracks, the data plane, telephony/egress/ingress, E2EE, and the SDK ecosystem. No coding — pure understanding.
What You Build
No project — this is conceptual. You'll understand LiveKit's architecture deeply enough to explain it to your CTO.
Why WebRTC? The latency problem in voice AI
15mVoice AI has a ~500ms latency budget. WebSocket pipelines burn 200-500ms on buffering alone. WebRTC uses UDP and RTP packets for ~10-30ms transport. We break down the difference.
P2P vs MCU vs SFU: why LiveKit chose the SFU
15mP2P doesn't scale past 3. MCU decodes and re-encodes. The SFU forwards packets without touching media — and LiveKit's Go + Pion implementation scales horizontally with Redis.
Rooms, Participants & Tracks: universal primitives
20mRoom = virtual space. Participant = anything that connects. Track = media stream. These three primitives model everything from voice AI to video conferencing to robotics.
AI agents as room participants
15mThe key insight: agents join rooms using the same SDK as humans. Subscribe to mic track, run STT→LLM→TTS, publish audio back. No special agent API needed.
The data plane: streams, RPC & state sync
15mText streams, byte streams, RPC, participant attributes, and room metadata — all through one WebRTC connection. LiveKit is a full realtime data platform.
Extending the room: telephony, egress & ingress
15mSIP turns phone callers into participants. Egress records to S3 or livestreams via RTMP. Ingress brings external streams in. All extend the room model.
Security, encryption & self-hosting
15mEnd-to-end encryption for media and data. JWT access tokens with granular grants. Full self-hosting with the same APIs as LiveKit Cloud.
The universal SDK ecosystem
10m12+ SDKs spanning web, mobile, desktop, Unity, embedded, and server. Same API everywhere. The broadest realtime SDK ecosystem available.
What You Walk Away With
Deep understanding of why WebRTC beats WebSockets for voice AI, how the SFU scales, rooms/participants/tracks as universal primitives, the data plane, and why LiveKit is architecturally unique.