Multimodal architecture and cost model

A voice-only agent listens and speaks. A multimodal agent can also see, read documents, display a visual persona, annotate screens, and draw on whiteboards. This chapter covers how LiveKit's Track-based architecture supports each modality, what each one costs in tokens, and how to manage multimodal context so your agent stays fast and affordable.

Track typesToken costsModality selectionContext management

The Track-based multimodal pipeline

LiveKit treats every modality as a Track within a Room. Audio Tracks carry speech. Video Tracks carry camera feeds, screen shares, or avatar video. Data Tracks carry structured messages like whiteboard strokes, annotations, or document uploads. Your agent subscribes to the Tracks it needs and publishes the Tracks it produces.

Modality	Track type	Direction	Example
Voice (listen)	Audio	User to agent	Microphone input
Voice (speak)	Audio	Agent to user	TTS output
Vision	Video	User to agent	Camera feed, screen share
Avatar	Video	Agent to user	Tavus avatar with lip sync
Whiteboard	Data	Bidirectional	Canvas strokes, annotations
Documents	Data	User to agent	PDF pages, images

User publishes media

The user's browser publishes audio and video Tracks into the LiveKit Room. Audio carries their voice. Video carries their camera or screen share.

Agent subscribes to Tracks

The agent joins the Room and subscribes to the user's Tracks. The framework routes audio to STT and video frames to the multimodal LLM.

Multimodal LLM processes everything

The LLM receives the text transcript from STT alongside image frames from the video Track. It reasons across both modalities and generates a text response.

Agent publishes responses

The text response goes to TTS and is published as an audio Track. If an avatar is configured, a video Track with lip-synced video is also published.

What's happening

Think of a multimodal agent as a video call participant who can see your camera, look at your screen, show you their face, and draw on a shared whiteboard — all while having a natural conversation. Each capability is a separate Track in the LiveKit Room, and the agent subscribes or publishes as needed.

Image token costs: the dominant expense

Image frames are expensive. A single camera frame can consume 1,000 tokens. A high-resolution screen share can burn through 2,000. In a multimodal conversation, images account for 80-95% of your token budget. Optimizing image handling has far greater impact than optimizing text prompts.

Image resolution	Approximate tokens	Typical use case
512 x 512	250-350	Thumbnail, small diagram
1024 x 768	750-1,000	Camera frame, standard capture
1920 x 1080	1,500-2,000	Screen share, full HD
3840 x 2160	4,000-5,000	4K screen share (avoid this)

Images dominate multimodal costs

A typical text message costs 20-50 tokens. A single image costs 750-2,000 tokens. If you send a frame every 3 seconds and keep 4 frames in context at 1,000 tokens per frame, that is 4,000 tokens of visual context in every LLM call — before any conversation history, document content, or system instructions.

Modality selection framework

Each modality adds value but also adds cost, latency, and failure modes. Use this decision framework:

Modality	Add when	Skip when	Latency impact	Token cost
Vision (camera)	User needs to show physical objects	All content is digital/text	+1-3s per LLM call	750-1,000/frame
Vision (screen share)	User needs guided help on screen	Sharing static screenshots works	+1-3s per LLM call	1,500-2,000/frame
Avatar	Visual presence increases trust or engagement	Voice-only is sufficient	Minimal (separate pipeline)	None (bandwidth only)
Whiteboard	Concepts need visual explanation	Text descriptions suffice	Minimal (data channel)	~1,000/snapshot
Documents	User has PDFs or images to discuss	All content is verbal	Minimal (on-demand tool)	Varies by length

What's happening

Start with voice. Add vision when the user needs to show you something. Add the avatar when visual presence matters. Each modality is an independent Track or data channel — they compose cleanly because they are architecturally separate. But each one you add increases complexity, cost, and potential failure points.

Session cost estimation

Here is a rough cost model for a 30-minute study session with GPT-4o:

Component	Token estimate	Notes
System instructions	500	Fixed per call
Conversation history	3,000	Grows over session
Visual context (3 frames)	3,000	Per LLM call
Document content (tool calls)	1,000	On demand
Per LLM call total	~7,500	Input tokens
Calls in 30 minutes	~60	One per user turn
Total session input tokens	~450,000	Majority is images

# Estimate session cost based on modality configuration
def estimate_session_cost(
  duration_minutes: int = 30,
  turns_per_minute: float = 2.0,
  frames_in_context: int = 3,
  tokens_per_frame: int = 1000,
  system_tokens: int = 500,
  avg_history_tokens: int = 3000,
  input_cost_per_million: float = 2.50,   # GPT-4o input pricing
  output_cost_per_million: float = 10.00,  # GPT-4o output pricing
  avg_output_tokens: int = 150,
):
  total_turns = int(duration_minutes * turns_per_minute)
  visual_tokens = frames_in_context * tokens_per_frame
  input_per_turn = system_tokens + avg_history_tokens + visual_tokens
  total_input = input_per_turn * total_turns
  total_output = avg_output_tokens * total_turns

  input_cost = (total_input / 1_000_000) * input_cost_per_million
  output_cost = (total_output / 1_000_000) * output_cost_per_million

  return {
      "total_turns": total_turns,
      "total_input_tokens": total_input,
      "total_output_tokens": total_output,
      "input_cost_usd": round(input_cost, 4),
      "output_cost_usd": round(output_cost, 4),
      "total_cost_usd": round(input_cost + output_cost, 4),
      "visual_token_share": f"{(visual_tokens / input_per_turn) * 100:.0f}%",
  }

function estimateSessionCost({
durationMinutes = 30,
turnsPerMinute = 2.0,
framesInContext = 3,
tokensPerFrame = 1000,
systemTokens = 500,
avgHistoryTokens = 3000,
inputCostPerMillion = 2.50,
outputCostPerMillion = 10.00,
avgOutputTokens = 150,
} = {}) {
const totalTurns = Math.floor(durationMinutes * turnsPerMinute);
const visualTokens = framesInContext * tokensPerFrame;
const inputPerTurn = systemTokens + avgHistoryTokens + visualTokens;
const totalInput = inputPerTurn * totalTurns;
const totalOutput = avgOutputTokens * totalTurns;

const inputCost = (totalInput / 1_000_000) * inputCostPerMillion;
const outputCost = (totalOutput / 1_000_000) * outputCostPerMillion;

return {
  totalTurns,
  totalInputTokens: totalInput,
  totalOutputTokens: totalOutput,
  inputCostUsd: +inputCost.toFixed(4),
  outputCostUsd: +outputCost.toFixed(4),
  totalCostUsd: +(inputCost + outputCost).toFixed(4),
  visualTokenShare: `${((visualTokens / inputPerTurn) * 100).toFixed(0)}%`,
};
}

Context window management strategies

The context window is finite, and multimodal content fills it fast. Here are four strategies that matter most.

Strategy 1: limit frames in context

Cap how many image frames the LLM sees at once. Older frames are dropped as new ones arrive.

from livekit.agents import RoomInputOptions, VideoFrameOptions

room_input_options = RoomInputOptions(
  video_enabled=True,
  video_frame_options=VideoFrameOptions(
      capture_interval=3.0,
      max_frames_in_context=3,  # Only keep 3 most recent frames
      max_width=1024,
      max_height=768,
  ),
)

const roomInputOptions = {
videoEnabled: true,
videoFrameOptions: {
  captureInterval: 3.0,
  maxFramesInContext: 3,
  maxWidth: 1024,
  maxHeight: 768,
},
};

Strategy 2: downscale images aggressively

Reducing image resolution has a direct, proportional impact on token cost.

Use case	Recommended max_width	Token savings vs 1920px
General camera observation	768	~60%
Reading printed text	1024	~40%
Screen share with small text	1920	Baseline
Handwriting recognition	1024	~40%

What's happening

Multimodal LLMs are surprisingly good at reading text and understanding layouts even at reduced resolution. A 1024px-wide image is readable for most printed text. Only screen shares with very small fonts genuinely need full HD. Test with your actual content to find the minimum resolution that works.

Strategy 3: capture only when needed

Instead of capturing frames on a fixed interval, capture only when the user speaks. Silent periods produce zero image tokens.

from livekit.agents import Agent

class SmartCaptureAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="""You are a study partner with camera access. You only analyze
          visual content when the user asks you to look at something. Do not describe
          what you see unprompted — wait for the user to ask.""",
      )

  async def on_user_turn_completed(self, turn_ctx):
      # Frames are only captured when the user speaks,
      # so silent periods produce zero image tokens
      await Agent.default.on_user_turn_completed(self, turn_ctx)

Strategy 4: summarize and discard

For long sessions, periodically summarize visual context into text and discard the image frames.

from livekit.agents import Agent, function_tool

class ContextEfficientAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="""You are a study partner. When you have analyzed a visual
          element (a page, diagram, or screen), create a concise text summary of
          what you saw. After summarizing, you no longer need the original image
          in context — your summary preserves the key information.""",
      )
      self._visual_summaries = []

  @function_tool
  async def summarize_visual(self, context, summary: str):
      """Store a text summary of something you just saw visually.
      Use this after analyzing an image to preserve the information
      without keeping the expensive image in context."""
      self._visual_summaries.append(summary)
      return f"Summary stored. You now have {len(self._visual_summaries)} visual summaries."

  @function_tool
  async def recall_summaries(self, context):
      """Retrieve all stored visual summaries from this session."""
      if not self._visual_summaries:
          return "No visual summaries stored yet."
      return "

".join(f"[{i+1}] {s}" for i, s in enumerate(self._visual_summaries))

import { Agent, functionTool } from "@livekit/agents";

class ContextEfficientAgent extends Agent {
private visualSummaries: string[] = [];

constructor() {
  super({
    instructions: `You are a study partner. When you have analyzed a visual
      element (a page, diagram, or screen), create a concise text summary of
      what you saw. After summarizing, you no longer need the original image
      in context — your summary preserves the key information.`,
  });
}

@functionTool({
  description: "Store a text summary of something you just saw visually.",
})
async summarizeVisual(context, { summary }) {
  this.visualSummaries.push(summary);
  return `Summary stored. You now have ${this.visualSummaries.length} visual summaries.`;
}

@functionTool({
  description: "Retrieve all stored visual summaries from this session.",
})
async recallSummaries(context) {
  if (this.visualSummaries.length === 0) return "No visual summaries stored yet.";
  return this.visualSummaries.map((s, i) => `[${i + 1}] ${s}`).join("

");
}
}

What's happening

This pattern trades a one-time cost (analyzing the image and generating a summary) for ongoing savings (the summary replaces the image in all future context). For a 30-minute session where the agent looks at 20 different pages, the savings are substantial — 20 text summaries might cost 2,000 total tokens versus 20,000 tokens for keeping the images.

Track token usage in production

Add logging to track actual token consumption per session. The estimates above are rough — real usage depends on how often the user speaks, how many frames are captured, and how long the conversation history gets. Measure first, then optimize.

Test your knowledge

Question 1 of 3

How does LiveKit represent different modalities (voice, vision, avatar, whiteboard) in its architecture?

What is next

In the next chapter, you will build your first vision-capable agent. You will connect a camera feed to a multimodal LLM, configure frame capture, and explore continuous observation versus on-demand analysis patterns.

Multimodal architecture & cost model

Multimodal architecture and cost model

The Track-based multimodal pipeline

User publishes media

Agent subscribes to Tracks

Multimodal LLM processes everything

Agent publishes responses

Image token costs: the dominant expense

Modality selection framework

Session cost estimation

Context window management strategies

Strategy 1: limit frames in context

Strategy 2: downscale images aggressively

Strategy 3: capture only when needed

Strategy 4: summarize and discard

What is next