AI agents as room participants

In this chapter, you will learn the single most important architectural insight in LiveKit's approach to voice AI: agents are not a separate system — they are participants. They join rooms, publish tracks, subscribe to tracks, and use data channels using the same primitives as every human user. This design decision unlocks capabilities that bolt-on agent architectures cannot match.

Agent as participantSame SDKMulti-partyVisionRPC

The key insight

Most voice AI platforms treat the AI as something fundamentally different from the users it serves. There is a "user-facing API" and a separate "agent API." The user connects through one system, the agent operates through another, and a bespoke integration layer bridges the two. When you want the agent to hear the user, you pipe audio from one system to the other. When you want the user to hear the agent, you pipe it back.

LiveKit rejects this architecture entirely.

In LiveKit, an AI agent is a participant. It joins a room the same way a human does — by presenting a valid access token and connecting via the SDK. Once connected, it is indistinguishable from any other participant at the protocol level. It can publish tracks, subscribe to tracks, send and receive data, update its attributes, and interact with every other participant in the room.

There is no "agent API." There is no separate infrastructure for AI. There is one platform, one set of primitives, and participants of different kinds using them in different ways.

What's happening

This is not a marketing simplification — it is a genuine architectural choice with deep consequences. When agents are participants, they inherit every capability the platform provides: NAT traversal, encryption, simulcast, multi-region routing, reconnection handling, presence events. No separate implementation needed. When the platform adds a new feature (text streams, byte streams, RPC), agents get it automatically because they use the same SDK.

How a voice AI agent works

With the rooms/participants/tracks model from Chapter 3, the entire flow of a voice AI agent becomes a straightforward application of publish and subscribe:

The agent joins the room

The agent process starts up (typically on a server) and connects to a LiveKit room. It identifies itself with ParticipantKind AGENT. Other participants in the room receive a notification: "An agent has joined."

The agent subscribes to the user's audio track

The human user's microphone audio is published as a track. The agent subscribes to it — exactly the way a second human in a conference call would. The SFU begins forwarding the user's audio packets to the agent.

Speech-to-text processes the incoming audio

The agent feeds the received audio packets into a speech-to-text model (Deepgram, OpenAI Whisper, or another provider). The STT model produces a text transcript of what the user said.

The LLM generates a response

The transcript is sent to a large language model along with conversation history and any system instructions. The LLM streams back a text response, token by token.

Text-to-speech synthesizes audio

As LLM tokens arrive, they are fed into a text-to-speech engine (Cartesia, ElevenLabs, or another provider). The TTS engine produces audio — the agent's "voice."

The agent publishes its audio track

The synthesized audio is published back into the room as the agent's audio track. The SFU forwards it to the human user. The user hears the agent's response through the same audio channel they would hear another human participant.

The voice AI data flow

Human publishes mic audio → SFU forwards to Agent → STT transcribes to text → LLM generates response → TTS synthesizes audio → Agent publishes audio → SFU forwards to Human

The entire loop uses standard room primitives. No custom bridges or proprietary protocols.

Notice what is absent from this flow: there is no custom audio bridge. No proprietary streaming protocol between the user and the agent. No "agent server" that is architecturally distinct from the "media server." The agent is a participant. Audio flows through the standard publish/subscribe model. The SFU routes it the same way it routes any other media.

The pipeline runs inside the agent

The STT, LLM, and TTS processing happens within the agent process itself, not on the LiveKit server. LiveKit provides the real-time transport — getting audio to and from the agent with minimal latency. What the agent does with that audio is entirely up to the developer. LiveKit's Agents SDK provides a framework for orchestrating these models, but the compute is yours.

No special agent API

This point deserves emphasis because it has practical consequences that compound over time.

When agents use the same SDK as humans, every platform capability is available to agents without additional integration work:

Capability	How a human uses it	How an agent uses it
Audio tracks	Publishes microphone audio	Publishes TTS-synthesized audio
Video tracks	Publishes camera video	Publishes generated video (avatars, visualizations)
Subscribing to tracks	Receives other participants' audio/video	Receives user's audio for STT, video for vision
Text streams	Sends chat messages	Streams LLM output, transcriptions
RPC	Triggers actions on other participants	Exposes callable methods (change language, start recording)
Participant attributes	Sets display name, role	Sets status ("thinking", "speaking"), current intent
Room metadata	Updates shared settings	Updates conversation state, workflow phase

The symmetry is complete. There is no capability that a human participant has that an agent lacks, and vice versa. They are the same kind of entity, differentiated only by what runs behind their SDK connection — a browser with a webcam, or a server with AI models.

Multi-party: beyond one human, one agent

The participant model makes multi-party scenarios trivially composable. Because adding an agent is just adding a participant, you can put any combination of humans and agents in a single room.

Multiple agents, one human. A customer calls a support line. The primary voice agent handles the conversation. A second agent silently subscribes to the same audio track and produces a real-time transcript. A third agent monitors sentiment. All three are participants in the same room, all subscribing to the same user audio track. They do not interfere with each other because subscriptions are independent.

One agent, multiple humans. A group of students joins a room for an AI tutoring session. The agent subscribes to all human audio tracks and uses speaker identification to attribute statements to individuals. It publishes a single audio track back. Every student hears the same agent voice through standard track subscription.

Multiple agents, multiple humans. A conference room with four executives and two AI agents — one for real-time translation, one for note-taking. Six participants in one room. The SFU routes tracks between them based on each participant's subscription preferences. The translation agent subscribes to all human audio and publishes translated audio tracks. The note-taking agent subscribes to all audio but publishes nothing — it sends notes via text streams instead.

Multi-party is not just a feature — it is a design consequence

In platforms where agents are separate infrastructure, multi-party requires explicit engineering for each combination. "Add a second agent" might mean deploying a second bridge, configuring new audio routing, and handling synchronization between agents manually. In LiveKit, it means connecting a second participant to the room. The SFU's publish/subscribe model handles the routing automatically.

Beyond audio: what agents can perceive and produce

Describing agents as "voice AI" understates what the participant model enables. Because agents can subscribe to any track and publish any track, their capabilities extend far beyond audio:

Vision

An agent can subscribe to a user's video track — their camera feed or screen share. This opens up visual understanding: the agent can analyze what the user is showing, provide feedback on a document being screen-shared, identify objects through a camera feed, or read text from a whiteboard.

The video frames arrive through the same track subscription mechanism as audio. The agent simply subscribes to the video track in addition to (or instead of) the audio track, then feeds frames to a vision model (GPT-4o, Claude, or a specialized vision system).

Text streams

Agents can send and receive text streams — ordered, reliable text delivered alongside media. An agent might stream its LLM output as text before converting it to speech, providing real-time captions. Or it might receive text input from a user who prefers typing to speaking.

Text streams also enable agent-to-agent communication within a room. A transcription agent might publish a text stream of the conversation transcript that a summarization agent subscribes to — all within the same room, no external message bus required.

RPC (Remote Procedure Calls)

Agents can expose RPC methods — callable functions that other participants can invoke. A human participant's frontend might call an RPC method on the agent to change its language, adjust its personality, trigger a specific action, or query its current state.

This is bidirectional: agents can also call RPC methods on other participants. An agent might call a method on the user's frontend to display a form, show a map, or trigger a notification. The room is not just a media conduit — it is a full bidirectional communication channel.

Byte streams

Agents can send and receive byte streams — binary data of any kind. An agent might send a generated image, a PDF report, or a data file directly to a participant. Or it might receive an image from a user for analysis.

Why this architecture matters

The decision to make agents participants rather than a separate system creates advantages that compound as applications grow more sophisticated:

No integration tax. Every new LiveKit feature works for agents automatically. When LiveKit adds a new data primitive or transport optimization, agents benefit without code changes. The agent is not a second-class citizen consuming a compatibility API — it is a first-class participant on the same platform.

Uniform observability. Because agents are participants, they appear in the same monitoring dashboards, emit the same events, and produce the same metrics as human participants. You do not need separate tooling to debug agent behavior versus user behavior. A room's participant list tells you exactly who is connected, what tracks they have published, and what they are subscribed to — regardless of whether "who" is a person or an AI.

Composability over configuration. Instead of configuring a monolithic agent system, you compose small, focused agents as participants. Need transcription? Add a transcription agent. Need translation? Add a translation agent. Need moderation? Add a moderation agent. Each is an independent participant that can be deployed, scaled, and updated independently. The room model composes them without orchestration code.

Infrastructure reuse. The SFU's NAT traversal, encryption, reconnection handling, multi-region routing, and bandwidth adaptation all apply to agents. Building these capabilities into a separate agent infrastructure would be years of engineering. Making agents participants gives them all of it for free.

What's happening

The pattern here is what software architects call the "uniform interface" — a single set of abstractions that applies to all entities in the system. REST applied it to web resources (everything is a resource, accessed through a uniform set of methods). LiveKit applies it to real-time communication (everything is a participant in a room, interacting through tracks and data channels). The power of a uniform interface is that every tool, technique, and optimization built for one entity works for all entities.

Test your knowledge

Question 1 of 2

Why is it architecturally significant that AI agents are participants rather than a separate system?

Looking ahead

In the next chapter, we will explore LiveKit's data plane — the text streams, byte streams, RPC, and state synchronization primitives that travel alongside media through the same WebRTC connection. These data capabilities are what turn an agent from a voice-only assistant into a rich, interactive participant that can share text, files, state, and actions with every other participant in the room.