Courses/Track 4: Frontends & Multimodal
4.1advanced2h 30m5 chapters

Multimodal Agents

Add vision, avatars, and rich media to voice agents

Extend voice agents with vision, avatars, document processing, and screen sharing. Learn when each modality adds value, manage token costs and latency, and build a complete multimodal agent.

What You Build

Multimodal voice agent with vision, document Q&A, and avatar — with cost-aware modality selection.

Prerequisites

  • Course 1.1
  • Course 1.2
Chapters

What You Walk Away With

Ability to add vision, avatars, and document processing to voice agents with clear understanding of cost/latency tradeoffs and when each modality is worth adding.