Multimodal architecture & cost model
Multimodal architecture and cost model
A voice-only agent listens and speaks. A multimodal agent can also see, read documents, display a visual persona, annotate screens, and draw on whiteboards. This chapter covers how LiveKit's Track-based architecture supports each modality, what each one costs in tokens, and how to manage multimodal context so your agent stays fast and affordable.
The Track-based multimodal pipeline
LiveKit treats every modality as a Track within a Room. Audio Tracks carry speech. Video Tracks carry camera feeds, screen shares, or avatar video. Data Tracks carry structured messages like whiteboard strokes, annotations, or document uploads. Your agent subscribes to the Tracks it needs and publishes the Tracks it produces.
| Modality | Track type | Direction | Example |
|---|---|---|---|
| Voice (listen) | Audio | User to agent | Microphone input |
| Voice (speak) | Audio | Agent to user | TTS output |
| Vision | Video | User to agent | Camera feed, screen share |
| Avatar | Video | Agent to user | Tavus avatar with lip sync |
| Whiteboard | Data | Bidirectional | Canvas strokes, annotations |
| Documents | Data | User to agent | PDF pages, images |
User publishes media
The user's browser publishes audio and video Tracks into the LiveKit Room. Audio carries their voice. Video carries their camera or screen share.
Agent subscribes to Tracks
The agent joins the Room and subscribes to the user's Tracks. The framework routes audio to STT and video frames to the multimodal LLM.
Multimodal LLM processes everything
The LLM receives the text transcript from STT alongside image frames from the video Track. It reasons across both modalities and generates a text response.
Agent publishes responses
The text response goes to TTS and is published as an audio Track. If an avatar is configured, a video Track with lip-synced video is also published.
Think of a multimodal agent as a video call participant who can see your camera, look at your screen, show you their face, and draw on a shared whiteboard — all while having a natural conversation. Each capability is a separate Track in the LiveKit Room, and the agent subscribes or publishes as needed.
Image token costs: the dominant expense
Image frames are expensive. A single camera frame can consume 1,000 tokens. A high-resolution screen share can burn through 2,000. In a multimodal conversation, images account for 80-95% of your token budget. Optimizing image handling has far greater impact than optimizing text prompts.
| Image resolution | Approximate tokens | Typical use case |
|---|---|---|
| 512 x 512 | 250-350 | Thumbnail, small diagram |
| 1024 x 768 | 750-1,000 | Camera frame, standard capture |
| 1920 x 1080 | 1,500-2,000 | Screen share, full HD |
| 3840 x 2160 | 4,000-5,000 | 4K screen share (avoid this) |
Images dominate multimodal costs
A typical text message costs 20-50 tokens. A single image costs 750-2,000 tokens. If you send a frame every 3 seconds and keep 4 frames in context at 1,000 tokens per frame, that is 4,000 tokens of visual context in every LLM call — before any conversation history, document content, or system instructions.
Modality selection framework
Each modality adds value but also adds cost, latency, and failure modes. Use this decision framework:
| Modality | Add when | Skip when | Latency impact | Token cost |
|---|---|---|---|---|
| Vision (camera) | User needs to show physical objects | All content is digital/text | +1-3s per LLM call | 750-1,000/frame |
| Vision (screen share) | User needs guided help on screen | Sharing static screenshots works | +1-3s per LLM call | 1,500-2,000/frame |
| Avatar | Visual presence increases trust or engagement | Voice-only is sufficient | Minimal (separate pipeline) | None (bandwidth only) |
| Whiteboard | Concepts need visual explanation | Text descriptions suffice | Minimal (data channel) | ~1,000/snapshot |
| Documents | User has PDFs or images to discuss | All content is verbal | Minimal (on-demand tool) | Varies by length |
Start with voice. Add vision when the user needs to show you something. Add the avatar when visual presence matters. Each modality is an independent Track or data channel — they compose cleanly because they are architecturally separate. But each one you add increases complexity, cost, and potential failure points.
Session cost estimation
Here is a rough cost model for a 30-minute study session with GPT-4o:
| Component | Token estimate | Notes |
|---|---|---|
| System instructions | 500 | Fixed per call |
| Conversation history | 3,000 | Grows over session |
| Visual context (3 frames) | 3,000 | Per LLM call |
| Document content (tool calls) | 1,000 | On demand |
| Per LLM call total | ~7,500 | Input tokens |
| Calls in 30 minutes | ~60 | One per user turn |
| Total session input tokens | ~450,000 | Majority is images |
# Estimate session cost based on modality configuration
def estimate_session_cost(
duration_minutes: int = 30,
turns_per_minute: float = 2.0,
frames_in_context: int = 3,
tokens_per_frame: int = 1000,
system_tokens: int = 500,
avg_history_tokens: int = 3000,
input_cost_per_million: float = 2.50, # GPT-4o input pricing
output_cost_per_million: float = 10.00, # GPT-4o output pricing
avg_output_tokens: int = 150,
):
total_turns = int(duration_minutes * turns_per_minute)
visual_tokens = frames_in_context * tokens_per_frame
input_per_turn = system_tokens + avg_history_tokens + visual_tokens
total_input = input_per_turn * total_turns
total_output = avg_output_tokens * total_turns
input_cost = (total_input / 1_000_000) * input_cost_per_million
output_cost = (total_output / 1_000_000) * output_cost_per_million
return {
"total_turns": total_turns,
"total_input_tokens": total_input,
"total_output_tokens": total_output,
"input_cost_usd": round(input_cost, 4),
"output_cost_usd": round(output_cost, 4),
"total_cost_usd": round(input_cost + output_cost, 4),
"visual_token_share": f"{(visual_tokens / input_per_turn) * 100:.0f}%",
}function estimateSessionCost({
durationMinutes = 30,
turnsPerMinute = 2.0,
framesInContext = 3,
tokensPerFrame = 1000,
systemTokens = 500,
avgHistoryTokens = 3000,
inputCostPerMillion = 2.50,
outputCostPerMillion = 10.00,
avgOutputTokens = 150,
} = {}) {
const totalTurns = Math.floor(durationMinutes * turnsPerMinute);
const visualTokens = framesInContext * tokensPerFrame;
const inputPerTurn = systemTokens + avgHistoryTokens + visualTokens;
const totalInput = inputPerTurn * totalTurns;
const totalOutput = avgOutputTokens * totalTurns;
const inputCost = (totalInput / 1_000_000) * inputCostPerMillion;
const outputCost = (totalOutput / 1_000_000) * outputCostPerMillion;
return {
totalTurns,
totalInputTokens: totalInput,
totalOutputTokens: totalOutput,
inputCostUsd: +inputCost.toFixed(4),
outputCostUsd: +outputCost.toFixed(4),
totalCostUsd: +(inputCost + outputCost).toFixed(4),
visualTokenShare: `${((visualTokens / inputPerTurn) * 100).toFixed(0)}%`,
};
}Context window management strategies
The context window is finite, and multimodal content fills it fast. Here are four strategies that matter most.
Strategy 1: limit frames in context
Cap how many image frames the LLM sees at once. Older frames are dropped as new ones arrive.
from livekit.agents import RoomInputOptions, VideoFrameOptions
room_input_options = RoomInputOptions(
video_enabled=True,
video_frame_options=VideoFrameOptions(
capture_interval=3.0,
max_frames_in_context=3, # Only keep 3 most recent frames
max_width=1024,
max_height=768,
),
)const roomInputOptions = {
videoEnabled: true,
videoFrameOptions: {
captureInterval: 3.0,
maxFramesInContext: 3,
maxWidth: 1024,
maxHeight: 768,
},
};Strategy 2: downscale images aggressively
Reducing image resolution has a direct, proportional impact on token cost.
| Use case | Recommended max_width | Token savings vs 1920px |
|---|---|---|
| General camera observation | 768 | ~60% |
| Reading printed text | 1024 | ~40% |
| Screen share with small text | 1920 | Baseline |
| Handwriting recognition | 1024 | ~40% |
Multimodal LLMs are surprisingly good at reading text and understanding layouts even at reduced resolution. A 1024px-wide image is readable for most printed text. Only screen shares with very small fonts genuinely need full HD. Test with your actual content to find the minimum resolution that works.
Strategy 3: capture only when needed
Instead of capturing frames on a fixed interval, capture only when the user speaks. Silent periods produce zero image tokens.
from livekit.agents import Agent
class SmartCaptureAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a study partner with camera access. You only analyze
visual content when the user asks you to look at something. Do not describe
what you see unprompted — wait for the user to ask.""",
)
async def on_user_turn_completed(self, turn_ctx):
# Frames are only captured when the user speaks,
# so silent periods produce zero image tokens
await Agent.default.on_user_turn_completed(self, turn_ctx)Strategy 4: summarize and discard
For long sessions, periodically summarize visual context into text and discard the image frames.
from livekit.agents import Agent, function_tool
class ContextEfficientAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a study partner. When you have analyzed a visual
element (a page, diagram, or screen), create a concise text summary of
what you saw. After summarizing, you no longer need the original image
in context — your summary preserves the key information.""",
)
self._visual_summaries = []
@function_tool
async def summarize_visual(self, context, summary: str):
"""Store a text summary of something you just saw visually.
Use this after analyzing an image to preserve the information
without keeping the expensive image in context."""
self._visual_summaries.append(summary)
return f"Summary stored. You now have {len(self._visual_summaries)} visual summaries."
@function_tool
async def recall_summaries(self, context):
"""Retrieve all stored visual summaries from this session."""
if not self._visual_summaries:
return "No visual summaries stored yet."
return "
".join(f"[{i+1}] {s}" for i, s in enumerate(self._visual_summaries))import { Agent, functionTool } from "@livekit/agents";
class ContextEfficientAgent extends Agent {
private visualSummaries: string[] = [];
constructor() {
super({
instructions: `You are a study partner. When you have analyzed a visual
element (a page, diagram, or screen), create a concise text summary of
what you saw. After summarizing, you no longer need the original image
in context — your summary preserves the key information.`,
});
}
@functionTool({
description: "Store a text summary of something you just saw visually.",
})
async summarizeVisual(context, { summary }) {
this.visualSummaries.push(summary);
return `Summary stored. You now have ${this.visualSummaries.length} visual summaries.`;
}
@functionTool({
description: "Retrieve all stored visual summaries from this session.",
})
async recallSummaries(context) {
if (this.visualSummaries.length === 0) return "No visual summaries stored yet.";
return this.visualSummaries.map((s, i) => `[${i + 1}] ${s}`).join("
");
}
}This pattern trades a one-time cost (analyzing the image and generating a summary) for ongoing savings (the summary replaces the image in all future context). For a 30-minute session where the agent looks at 20 different pages, the savings are substantial — 20 text summaries might cost 2,000 total tokens versus 20,000 tokens for keeping the images.
Track token usage in production
Add logging to track actual token consumption per session. The estimates above are rough — real usage depends on how often the user speaks, how many frames are captured, and how long the conversation history gets. Measure first, then optimize.
Test your knowledge
Question 1 of 3
How does LiveKit represent different modalities (voice, vision, avatar, whiteboard) in its architecture?
What is next
In the next chapter, you will build your first vision-capable agent. You will connect a camera feed to a multimodal LLM, configure frame capture, and explore continuous observation versus on-demand analysis patterns.