Chapter 125m

Multimodal architecture & cost model

Multimodal architecture and cost model

A voice-only agent listens and speaks. A multimodal agent can also see, read documents, display a visual persona, annotate screens, and draw on whiteboards. This chapter covers how LiveKit's Track-based architecture supports each modality, what each one costs in tokens, and how to manage multimodal context so your agent stays fast and affordable.

Track typesToken costsModality selectionContext management

The Track-based multimodal pipeline

LiveKit treats every modality as a Track within a Room. Audio Tracks carry speech. Video Tracks carry camera feeds, screen shares, or avatar video. Data Tracks carry structured messages like whiteboard strokes, annotations, or document uploads. Your agent subscribes to the Tracks it needs and publishes the Tracks it produces.

ModalityTrack typeDirectionExample
Voice (listen)AudioUser to agentMicrophone input
Voice (speak)AudioAgent to userTTS output
VisionVideoUser to agentCamera feed, screen share
AvatarVideoAgent to userTavus avatar with lip sync
WhiteboardDataBidirectionalCanvas strokes, annotations
DocumentsDataUser to agentPDF pages, images
1

User publishes media

The user's browser publishes audio and video Tracks into the LiveKit Room. Audio carries their voice. Video carries their camera or screen share.

2

Agent subscribes to Tracks

The agent joins the Room and subscribes to the user's Tracks. The framework routes audio to STT and video frames to the multimodal LLM.

3

Multimodal LLM processes everything

The LLM receives the text transcript from STT alongside image frames from the video Track. It reasons across both modalities and generates a text response.

4

Agent publishes responses

The text response goes to TTS and is published as an audio Track. If an avatar is configured, a video Track with lip-synced video is also published.

What's happening

Think of a multimodal agent as a video call participant who can see your camera, look at your screen, show you their face, and draw on a shared whiteboard — all while having a natural conversation. Each capability is a separate Track in the LiveKit Room, and the agent subscribes or publishes as needed.

Image token costs: the dominant expense

Image frames are expensive. A single camera frame can consume 1,000 tokens. A high-resolution screen share can burn through 2,000. In a multimodal conversation, images account for 80-95% of your token budget. Optimizing image handling has far greater impact than optimizing text prompts.

Image resolutionApproximate tokensTypical use case
512 x 512250-350Thumbnail, small diagram
1024 x 768750-1,000Camera frame, standard capture
1920 x 10801,500-2,000Screen share, full HD
3840 x 21604,000-5,0004K screen share (avoid this)

Images dominate multimodal costs

A typical text message costs 20-50 tokens. A single image costs 750-2,000 tokens. If you send a frame every 3 seconds and keep 4 frames in context at 1,000 tokens per frame, that is 4,000 tokens of visual context in every LLM call — before any conversation history, document content, or system instructions.

Modality selection framework

Each modality adds value but also adds cost, latency, and failure modes. Use this decision framework:

ModalityAdd whenSkip whenLatency impactToken cost
Vision (camera)User needs to show physical objectsAll content is digital/text+1-3s per LLM call750-1,000/frame
Vision (screen share)User needs guided help on screenSharing static screenshots works+1-3s per LLM call1,500-2,000/frame
AvatarVisual presence increases trust or engagementVoice-only is sufficientMinimal (separate pipeline)None (bandwidth only)
WhiteboardConcepts need visual explanationText descriptions sufficeMinimal (data channel)~1,000/snapshot
DocumentsUser has PDFs or images to discussAll content is verbalMinimal (on-demand tool)Varies by length
What's happening

Start with voice. Add vision when the user needs to show you something. Add the avatar when visual presence matters. Each modality is an independent Track or data channel — they compose cleanly because they are architecturally separate. But each one you add increases complexity, cost, and potential failure points.

Session cost estimation

Here is a rough cost model for a 30-minute study session with GPT-4o:

ComponentToken estimateNotes
System instructions500Fixed per call
Conversation history3,000Grows over session
Visual context (3 frames)3,000Per LLM call
Document content (tool calls)1,000On demand
Per LLM call total~7,500Input tokens
Calls in 30 minutes~60One per user turn
Total session input tokens~450,000Majority is images
# Estimate session cost based on modality configuration
def estimate_session_cost(
  duration_minutes: int = 30,
  turns_per_minute: float = 2.0,
  frames_in_context: int = 3,
  tokens_per_frame: int = 1000,
  system_tokens: int = 500,
  avg_history_tokens: int = 3000,
  input_cost_per_million: float = 2.50,   # GPT-4o input pricing
  output_cost_per_million: float = 10.00,  # GPT-4o output pricing
  avg_output_tokens: int = 150,
):
  total_turns = int(duration_minutes * turns_per_minute)
  visual_tokens = frames_in_context * tokens_per_frame
  input_per_turn = system_tokens + avg_history_tokens + visual_tokens
  total_input = input_per_turn * total_turns
  total_output = avg_output_tokens * total_turns

  input_cost = (total_input / 1_000_000) * input_cost_per_million
  output_cost = (total_output / 1_000_000) * output_cost_per_million

  return {
      "total_turns": total_turns,
      "total_input_tokens": total_input,
      "total_output_tokens": total_output,
      "input_cost_usd": round(input_cost, 4),
      "output_cost_usd": round(output_cost, 4),
      "total_cost_usd": round(input_cost + output_cost, 4),
      "visual_token_share": f"{(visual_tokens / input_per_turn) * 100:.0f}%",
  }
function estimateSessionCost({
durationMinutes = 30,
turnsPerMinute = 2.0,
framesInContext = 3,
tokensPerFrame = 1000,
systemTokens = 500,
avgHistoryTokens = 3000,
inputCostPerMillion = 2.50,
outputCostPerMillion = 10.00,
avgOutputTokens = 150,
} = {}) {
const totalTurns = Math.floor(durationMinutes * turnsPerMinute);
const visualTokens = framesInContext * tokensPerFrame;
const inputPerTurn = systemTokens + avgHistoryTokens + visualTokens;
const totalInput = inputPerTurn * totalTurns;
const totalOutput = avgOutputTokens * totalTurns;

const inputCost = (totalInput / 1_000_000) * inputCostPerMillion;
const outputCost = (totalOutput / 1_000_000) * outputCostPerMillion;

return {
  totalTurns,
  totalInputTokens: totalInput,
  totalOutputTokens: totalOutput,
  inputCostUsd: +inputCost.toFixed(4),
  outputCostUsd: +outputCost.toFixed(4),
  totalCostUsd: +(inputCost + outputCost).toFixed(4),
  visualTokenShare: `${((visualTokens / inputPerTurn) * 100).toFixed(0)}%`,
};
}

Context window management strategies

The context window is finite, and multimodal content fills it fast. Here are four strategies that matter most.

Strategy 1: limit frames in context

Cap how many image frames the LLM sees at once. Older frames are dropped as new ones arrive.

from livekit.agents import RoomInputOptions, VideoFrameOptions

room_input_options = RoomInputOptions(
  video_enabled=True,
  video_frame_options=VideoFrameOptions(
      capture_interval=3.0,
      max_frames_in_context=3,  # Only keep 3 most recent frames
      max_width=1024,
      max_height=768,
  ),
)
const roomInputOptions = {
videoEnabled: true,
videoFrameOptions: {
  captureInterval: 3.0,
  maxFramesInContext: 3,
  maxWidth: 1024,
  maxHeight: 768,
},
};

Strategy 2: downscale images aggressively

Reducing image resolution has a direct, proportional impact on token cost.

Use caseRecommended max_widthToken savings vs 1920px
General camera observation768~60%
Reading printed text1024~40%
Screen share with small text1920Baseline
Handwriting recognition1024~40%
What's happening

Multimodal LLMs are surprisingly good at reading text and understanding layouts even at reduced resolution. A 1024px-wide image is readable for most printed text. Only screen shares with very small fonts genuinely need full HD. Test with your actual content to find the minimum resolution that works.

Strategy 3: capture only when needed

Instead of capturing frames on a fixed interval, capture only when the user speaks. Silent periods produce zero image tokens.

from livekit.agents import Agent

class SmartCaptureAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="""You are a study partner with camera access. You only analyze
          visual content when the user asks you to look at something. Do not describe
          what you see unprompted — wait for the user to ask.""",
      )

  async def on_user_turn_completed(self, turn_ctx):
      # Frames are only captured when the user speaks,
      # so silent periods produce zero image tokens
      await Agent.default.on_user_turn_completed(self, turn_ctx)

Strategy 4: summarize and discard

For long sessions, periodically summarize visual context into text and discard the image frames.

from livekit.agents import Agent, function_tool

class ContextEfficientAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="""You are a study partner. When you have analyzed a visual
          element (a page, diagram, or screen), create a concise text summary of
          what you saw. After summarizing, you no longer need the original image
          in context — your summary preserves the key information.""",
      )
      self._visual_summaries = []

  @function_tool
  async def summarize_visual(self, context, summary: str):
      """Store a text summary of something you just saw visually.
      Use this after analyzing an image to preserve the information
      without keeping the expensive image in context."""
      self._visual_summaries.append(summary)
      return f"Summary stored. You now have {len(self._visual_summaries)} visual summaries."

  @function_tool
  async def recall_summaries(self, context):
      """Retrieve all stored visual summaries from this session."""
      if not self._visual_summaries:
          return "No visual summaries stored yet."
      return "

".join(f"[{i+1}] {s}" for i, s in enumerate(self._visual_summaries))
import { Agent, functionTool } from "@livekit/agents";

class ContextEfficientAgent extends Agent {
private visualSummaries: string[] = [];

constructor() {
  super({
    instructions: `You are a study partner. When you have analyzed a visual
      element (a page, diagram, or screen), create a concise text summary of
      what you saw. After summarizing, you no longer need the original image
      in context — your summary preserves the key information.`,
  });
}

@functionTool({
  description: "Store a text summary of something you just saw visually.",
})
async summarizeVisual(context, { summary }) {
  this.visualSummaries.push(summary);
  return `Summary stored. You now have ${this.visualSummaries.length} visual summaries.`;
}

@functionTool({
  description: "Retrieve all stored visual summaries from this session.",
})
async recallSummaries(context) {
  if (this.visualSummaries.length === 0) return "No visual summaries stored yet.";
  return this.visualSummaries.map((s, i) => `[${i + 1}] ${s}`).join("

");
}
}
What's happening

This pattern trades a one-time cost (analyzing the image and generating a summary) for ongoing savings (the summary replaces the image in all future context). For a 30-minute session where the agent looks at 20 different pages, the savings are substantial — 20 text summaries might cost 2,000 total tokens versus 20,000 tokens for keeping the images.

Track token usage in production

Add logging to track actual token consumption per session. The estimates above are rough — real usage depends on how often the user speaks, how many frames are captured, and how long the conversation history gets. Measure first, then optimize.

Test your knowledge

Question 1 of 3

How does LiveKit represent different modalities (voice, vision, avatar, whiteboard) in its architecture?

What is next

In the next chapter, you will build your first vision-capable agent. You will connect a camera feed to a multimodal LLM, configure frame capture, and explore continuous observation versus on-demand analysis patterns.

Concepts covered
Track typesToken costs by resolutionLatency impactModality selection framework