Multimodal Agents

Add vision, avatars, and rich media to voice agents

Extend voice agents with vision, avatars, document processing, and screen sharing. Learn when each modality adds value, manage token costs and latency, and build a complete multimodal agent.

What You Build

Multimodal voice agent with vision, document Q&A, and avatar — with cost-aware modality selection.

Prerequisites

->Course 1.1
->Course 1.2

Chapters

Multimodal architecture & cost model

25m

How multimodal works in LiveKit (audio, video, data tracks). The token cost model for images (250-5000 tokens per frame). When adding vision or avatars is worth the cost and latency — and when it is not.

Track typesToken costs by resolutionLatency impactModality selection framework

Vision agents

25m

Add vision to voice agents: frame capture configuration, continuous observation vs on-demand analysis, screen reading patterns, and handling vision limitations (poor lighting, small text, hallucination).

Frame captureMultimodal LLMAnalysis patternsVision limitations

Avatar integration

20m

Give your agent a face with Tavus avatars. Understand the architecture (TTS -> avatar lip-sync -> video track), configure quality vs bandwidth, and render in the frontend.

Tavus personasLip-sync architectureBandwidth managementFrontend rendering

Document processing & screen sharing

25m

Handle document uploads via data channels, extract text from PDFs, use vision models for OCR, analyze screen shares, and overlay annotations with normalized coordinates.

Document uploadPDF extractionVision OCRScreen share analysis+1 more

Building a complete multimodal agent

25m

Build a complete study partner agent combining voice, vision, documents, and drawing. Manage context windows (images are 80-95% of token cost), estimate session costs, and test each modality in isolation.

Context window managementCost estimationModality testingProgressive disclosure

What You Walk Away With

Ability to add vision, avatars, and document processing to voice agents with clear understanding of cost/latency tradeoffs and when each modality is worth adding.