Rooms, Participants & Tracks: universal primitives

In this chapter, you will learn the three core abstractions that model every interaction in LiveKit. These are not just internal implementation details — they are the mental model you will carry through every course, every project, and every production system you build. Once rooms, participants, and tracks click, the rest of LiveKit becomes a series of straightforward applications of the same three ideas.

RoomParticipantTrackParticipantKindPublish/Subscribe

Three primitives, infinite applications

Most real-time platforms present different APIs for different use cases. One API for video conferencing, another for live streaming, another for voice calls. LiveKit takes a fundamentally different approach: it provides three universal primitives that compose into any real-time application you can imagine.

A voice AI assistant? A room with two participants (a human and an agent), each publishing audio tracks.

A video conference? A room with many participants, each publishing audio and video tracks.

A live stream to 10,000 viewers? A room with one publishing participant and 10,000 subscribing participants.

A security camera monitoring system? A room with camera participants publishing video tracks and a dashboard participant subscribing to all of them.

The primitives are the same. The configuration changes. This is the elegance of LiveKit's design, and understanding it deeply is the single most valuable investment you can make before writing any code.

Rooms: the container

A Room is a virtual space where real-time communication happens. Every interaction in LiveKit occurs inside a room. A room has a name (a unique string identifier), a set of participants, and optional metadata.

Rooms are created on demand. When the first participant joins a room name that does not yet exist, the room is created. When the last participant leaves, the room is destroyed (unless configured otherwise). There is no separate "create room" step in most workflows — the room materializes when it is needed and vanishes when it is not.

Room property	What it does
Name	Unique identifier. Used to connect participants to the same space.
Metadata	Arbitrary string (often JSON) attached to the room. Visible to all participants. Updatable at runtime.
Max participants	Optional limit on how many participants can join.
Empty timeout	How long the room persists after the last participant leaves.
Creation time	When the room was created. Useful for analytics and debugging.

Rooms are cheap

Rooms are lightweight. Creating a room consumes minimal resources on the SFU — it is essentially a routing table entry. Do not hesitate to create rooms liberally. A voice AI system might create a new room for every conversation. A support center might create thousands of rooms per hour. The SFU handles this effortlessly.

Room metadata deserves special attention. It is a shared, mutable key-value space visible to every participant. Think of it as the room's "whiteboard" — a place to store state that belongs to the room itself rather than to any individual participant. In a voice AI context, room metadata might store the conversation topic, the language setting, or the current phase of a multi-step workflow.

Participants: anything that connects

A Participant is any entity connected to a room. This is where LiveKit's model diverges sharply from traditional conferencing platforms, which assume participants are humans sitting in front of cameras. In LiveKit, a participant can be:

A human using a web browser
A human using a native mobile app
A human calling in from a landline phone via SIP
An AI agent running server-side
An IoT sensor publishing data
A robotic system sending and receiving control signals

Every participant has an identity (a unique string within the room), a name (human-readable, not necessarily unique), and metadata (arbitrary string, often JSON). Participants also carry attributes — a set of key-value pairs visible to all other participants and updatable at runtime.

ParticipantKind: declaring what you are

Every participant has a kind — a declaration of what type of entity it is. LiveKit defines three kinds:

Kind	What it represents	Example
STANDARD	A regular participant connecting via WebRTC SDKs	Human on a web browser, human on a mobile app, IoT device
SIP	A participant connected via the SIP/PSTN bridge	A phone caller dialing into the room from a landline or mobile number
AGENT	An AI agent connecting via the server-side Agents SDK	A voice assistant, a transcription bot, a vision analyzer

What's happening

ParticipantKind is not about permissions or capabilities — it is about identity. An AGENT participant can do everything a STANDARD participant can do (publish tracks, subscribe to tracks, send data). The kind simply tells other participants what they are interacting with. A frontend application might render an agent's audio differently than a human's, display a bot icon instead of a webcam thumbnail, or show a "powered by AI" label. ParticipantKind makes this possible without guesswork.

This taxonomy is deceptively powerful. Consider what it means: a phone caller (SIP), a web user (STANDARD), and an AI agent (AGENT) are all just participants in the same room. They publish and subscribe to tracks using the same model. There is no "phone bridge API" separate from the "agent API" separate from the "conferencing API." It is one system, one set of primitives, three kinds of participants.

Participant attributes

Each participant carries a set of attributes — string key-value pairs that are synchronized across the room. When any participant's attributes change, every other participant is notified.

Attributes are ideal for per-participant state that others need to observe:

An agent might set status: "thinking" while processing, then status: "speaking" when generating a response
A human participant might have role: "supervisor" or language: "es"
A SIP participant might carry caller_id: "+1-555-0123" or queue: "billing"

Unlike room metadata (which belongs to the room), participant attributes belong to the individual participant. This distinction matters: when a participant disconnects, their attributes vanish with them. Room metadata persists as long as the room exists.

Tracks: the media itself

A Track is a single stream of media — one audio feed or one video feed. Tracks are the atomic unit of media in LiveKit. Everything else (rooms, participants, subscriptions) exists to organize how tracks flow between entities.

Every track has a source that describes where the media originates:

Track source	Media type	What it represents
MICROPHONE	Audio	The participant's primary audio input
CAMERA	Video	The participant's primary video input
SCREEN_SHARE	Video	The participant's screen capture
SCREEN_SHARE_AUDIO	Audio	Audio from the screen being shared

Tracks also have a name (useful when a participant publishes multiple tracks of the same type) and metadata (arbitrary string, updatable at runtime).

Tracks are not limited to microphones and cameras

While the track sources listed above reflect common human participant scenarios, tracks can carry any audio or video content. An AI agent might publish synthesized speech as a MICROPHONE-sourced audio track. A camera monitoring system might publish RTSP camera feeds as CAMERA-sourced video tracks. The source is a hint, not a constraint.

Track quality and simulcast

For video tracks, LiveKit supports simulcast — publishing the same video at multiple quality layers simultaneously. The SFU then selects which layer to forward to each subscriber based on their bandwidth and preferences. This was discussed in Chapter 2; here, the key point is that simulcast is a track-level feature. Each video track can independently be simulcast or single-layer.

For audio tracks, quality is controlled by codec selection and bitrate. LiveKit supports Opus (the dominant codec for voice) and Red (redundant encoding for packet loss resilience). These are negotiated automatically — participants do not need to specify codecs manually.

Dynacast: publish only when someone is watching

Dynacast is LiveKit's bandwidth optimization that pauses video track publishing when no subscribers are receiving a particular quality layer. Without dynacast, a publisher sends all simulcast layers continuously even if no one is watching. With dynacast enabled, the SFU signals the publisher to stop sending layers that have zero subscribers, and resumes them instantly when a subscriber appears.

This matters in rooms with many participants. In a 20-person video call, most participants are in a thumbnail grid — they only need the low-quality simulcast layer. Dynacast ensures the publisher is not wasting bandwidth sending a high-quality 720p layer that no one has requested.

Without dynacast	With dynacast
Publisher sends all simulcast layers continuously	Publisher only sends layers that have active subscribers
Wastes upstream bandwidth on unwatched layers	Upstream bandwidth scales with actual demand
Fine for small rooms (2-4 participants)	Essential for larger rooms and bandwidth-constrained networks

Dynacast is enabled by default in LiveKit's client SDKs. It is transparent to application code — you do not need to manage it. The SFU and client negotiate layer activation automatically.

Dynacast and voice AI

For voice-only agents that do not publish video, dynacast is irrelevant. But if you build multimodal agents with video (screen sharing, avatar, vision), dynacast ensures your video tracks do not consume bandwidth when the subscriber has not requested them.

Video codecs

LiveKit supports multiple video codecs, negotiated automatically between publisher and subscriber. The choice of codec affects quality, compression efficiency, and hardware compatibility.

Codec	Strengths	Typical use
VP8	Universal browser support, low CPU	Default for most WebRTC applications
H.264	Hardware acceleration on mobile/desktop, widely supported	Mobile apps, native clients, recording
VP9	Better compression than VP8 at the same quality	Bandwidth-constrained environments, screen sharing
AV1	Best compression efficiency, emerging standard	Emerging — best for high-resolution with limited bandwidth

LiveKit defaults to VP8 for maximum compatibility. You can request a preferred codec when publishing a video track, and the SFU will negotiate the best available option between publisher and subscriber capabilities. For screen sharing, VP9 or AV1 are preferred because their superior compression handles the sharp edges and text in screen content more efficiently.

Audio codecs are simpler

For audio, Opus dominates. It is the required codec for WebRTC and handles everything from narrowband voice to full-band music. LiveKit also supports Red (redundant audio encoding) for packet loss resilience. For SIP telephony, G.711 (PCMU/PCMA) is used as a fallback when Opus is not supported by the trunk provider.

Publish/Subscribe: the flow model

Tracks flow between participants through a publish/subscribe model. This is the connective tissue that ties rooms, participants, and tracks together.

Publishing means making a track available to others in the room. When a participant publishes a track, the SFU registers that track and makes it available for subscription. Publishing does not mean the track is immediately sent to everyone — it means it can be received by those who want it.

Subscribing means requesting to receive a specific track from another participant. When participant A subscribes to participant B's audio track, the SFU begins forwarding B's audio packets to A.

A participant publishes a track

A human user turns on their microphone. Their client encodes the audio and sends it to the SFU with a publication message: "I am publishing an audio track from my microphone."

The SFU notifies other participants

The SFU sends an event to every other participant in the room: "Participant X has published a new audio track." Each participant's client now knows the track exists.

Other participants subscribe

By default, LiveKit auto-subscribes participants to all new tracks. The SFU begins forwarding the audio packets. Alternatively, a participant can use manual subscription mode and choose which tracks to receive.

Media flows

The SFU forwards RTP packets from the publisher to each subscriber. No transcoding. No mixing. Each subscriber receives the original packets and renders them independently.

The publish/subscribe model has a critical property: it decouples producers from consumers. The publisher does not need to know how many subscribers exist or who they are. The subscriber does not need to know how the publisher is generating the media. The SFU handles the routing.

Example: multi-agent room

A human publishes mic audio. The SFU forwards it to Agent 1 (voice assistant) and Agent 2 (transcription). Agent 1 publishes TTS audio back. The SFU forwards it to the human. Agent 2 subscribes to audio but publishes nothing — it writes transcriptions via text streams instead. All of this happens through standard publish/subscribe. No special routing logic.

What's happening

Auto-subscribe is the default for a reason: in most applications, every participant wants to hear every other participant. But manual subscription mode is powerful for specialized use cases. A transcription agent does not need to receive its own audio back. A monitoring dashboard might subscribe only to video tracks, ignoring audio entirely. A recording service might subscribe to everything for archival. The publish/subscribe model supports all of these patterns without any changes to the publishing side.

Modeling everything with three primitives

Here is the moment where these abstractions prove their worth. Consider how different real-time applications map onto the same three concepts:

Application	Room	Participants	Tracks
Voice AI assistant	One room per conversation	Human (STANDARD) + Agent (AGENT)	Human publishes mic audio; agent publishes TTS audio
Contact center	One room per call	Caller (SIP) + Agent (AGENT) + Supervisor (STANDARD, optional)	Caller publishes phone audio; agent publishes TTS audio; supervisor subscribes to both
Video conference	One room per meeting	N humans (STANDARD)	Each publishes mic audio + camera video
Live stream	One room per broadcast	1 broadcaster (STANDARD) + N viewers (STANDARD, subscribe-only)	Broadcaster publishes camera + mic; viewers subscribe, never publish
Security cameras	One room per building/zone	N cameras (STANDARD) + 1 dashboard (STANDARD)	Cameras publish video; dashboard subscribes to all
Robotics control	One room per robot	Robot (STANDARD) + Controller (STANDARD)	Robot publishes camera video; controller publishes command audio

Every row uses the same three primitives. No special APIs. No mode switches. No "conferencing mode" vs "streaming mode." The room model is universal.

This is the 'aha' moment

If this table makes something click for you — if you suddenly see how a single set of abstractions can model such wildly different applications — then you have grasped the most important concept in LiveKit's architecture. Everything from here forward is applying these primitives in increasingly sophisticated ways.

Test your knowledge

Question 1 of 2

In LiveKit, which of the following can be a Participant in a room?

What comes next

You now understand the static structure: rooms contain participants, participants publish and subscribe to tracks, the SFU routes tracks between them. But we have been treating participants as a homogeneous group. In the next chapter, we will zoom in on the most transformative kind of participant: the AI agent. You will see how an agent is not a special construct bolted onto the side of the platform — it is simply a participant that happens to run AI models instead of a camera and microphone.

Looking ahead

The rooms/participants/tracks model reappears in every course in this academy. In Course 0.2, you will see how these primitives map to voice AI pipelines. In Course 1.1, you will write code that publishes and subscribes to tracks. In Course 2.1, you will build agents that leverage publish/subscribe for multi-party conversations. The investment you have made understanding these abstractions will pay dividends throughout.