Gemini Live
Gemini Live
Google's Gemini Live is the second major realtime model available in LiveKit. Where OpenAI's Realtime API focuses on audio-only speech-to-speech, Gemini Live is natively multimodal — it can process audio and video input simultaneously, opening up use cases that neither pipeline nor OpenAI Realtime can match. In this chapter, you will implement a Gemini Live agent, configure its multimodal capabilities, and compare it directly with the OpenAI Realtime agent you built in Chapter 3.
What you'll learn
- How to set up a Gemini Live realtime agent in LiveKit
- Configuring voice, modalities, and multimodal input
- Sending video frames alongside audio for vision-aware conversations
- Key differences between Gemini Live and OpenAI Realtime
Setting up Gemini Live
The basic Gemini Live agent follows the same pattern as OpenAI Realtime — pass the realtime model as the llm parameter:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are a helpful assistant powered by Gemini. "
"Keep responses concise — two sentences maximum. "
"Never use markdown or bullet points."
),
),
room=session.room,
llm=google_realtime.RealtimeModel(
model="gemini-2.0-flash",
voice="Puck",
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are a helpful assistant powered by Gemini. " +
"Keep responses concise — two sentences maximum. " +
"Never use markdown or bullet points.",
}),
room: session.room,
llm: new RealtimeModel({
model: "gemini-2.0-flash",
voice: "Puck",
}),
});
},
});The structure is identical to the OpenAI Realtime agent from Chapter 3. The framework handles the differences between providers internally — you swap the import and the model configuration, and everything else stays the same.
Gemini 2.0 Flash is optimized for low latency and multimodal input. Unlike GPT-4o Realtime, which is based on a large reasoning model, Gemini Flash prioritizes speed. This makes it a strong choice when response time matters more than deep reasoning, and especially when you want to combine audio with visual context.
Configuring voice and behavior
Gemini Live offers its own set of voices and configuration options:
from livekit.plugins.google import realtime as google_realtime
model = google_realtime.RealtimeModel(
model="gemini-2.0-flash",
voice="Puck", # Voice: Puck, Charon, Kore, Fenrir, Aoede
temperature=0.7, # Controls response creativity
modalities=["AUDIO"], # Output modality
)import { RealtimeModel } from "@livekit/agents-plugins-google";
const model = new RealtimeModel({
model: "gemini-2.0-flash",
voice: "Puck", // Voice: Puck, Charon, Kore, Fenrir, Aoede
temperature: 0.7, // Controls response creativity
modalities: ["AUDIO"], // Output modality
});Available voices
| Voice | Character |
|---|---|
| Puck | Playful, energetic |
| Charon | Deep, measured |
| Kore | Warm, friendly |
| Fenrir | Strong, authoritative |
| Aoede | Melodic, expressive |
Different voices, different strengths
Gemini's voices are distinct from OpenAI's. Neither set is objectively better — they have different tonal qualities. Test both providers with your target audience to determine which voice best fits your application's personality.
Multimodal input: audio plus vision
The defining feature of Gemini Live is native multimodal input. The model can process the user's camera feed or screen share alongside their voice, enabling vision-aware conversations without bolting on a separate vision model.
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are a visual assistant. When the user shares their camera, "
"describe what you see and answer questions about it. "
"When no video is available, respond to audio only."
),
),
room=session.room,
llm=google_realtime.RealtimeModel(
model="gemini-2.0-flash",
voice="Kore",
# Video frames from the user's track are sent automatically
# when the model is configured for multimodal input
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are a visual assistant. When the user shares their camera, " +
"describe what you see and answer questions about it. " +
"When no video is available, respond to audio only.",
}),
room: session.room,
llm: new RealtimeModel({
model: "gemini-2.0-flash",
voice: "Kore",
}),
});
},
});When a user publishes a video track to the LiveKit room, the framework captures frames and sends them to Gemini alongside the audio stream. The model processes both modalities together, so it can answer questions like "What color is the object I'm holding?" without needing a separate vision pipeline. This is fundamentally different from a pipeline approach, where you would need to capture frames, send them to a vision model, inject the description into the LLM context, and coordinate timing — all as separate steps.
Video increases token consumption
Each video frame sent to Gemini consumes tokens. At the default frame rate, a 10-minute conversation with video can consume significantly more tokens than audio alone. Monitor your usage closely during development and consider reducing the frame rate for cost-sensitive applications.
Gemini Live vs OpenAI Realtime
Having built both agents, here is a concrete comparison:
| Dimension | OpenAI Realtime | Gemini Live |
|---|---|---|
| Model | gpt-4o-realtime-preview | gemini-2.0-flash |
| Multimodal input | Audio only | Audio + video natively |
| Voices | 6 (alloy, echo, fable, onyx, nova, shimmer) | 5 (Puck, Charon, Kore, Fenrir, Aoede) |
| Latency profile | Consistent, slightly higher baseline | Generally faster for simple queries |
| Reasoning depth | Stronger for complex multi-step reasoning | Optimized for speed over depth |
| Tool calling | Mature, well-documented | Supported, rapidly improving |
| Pricing model | Per-minute audio pricing | Token-based pricing |
Choose Gemini Live when you need vision
If your agent needs to see the user's camera, a shared screen, or uploaded images during a live conversation, Gemini Live handles this natively. With OpenAI Realtime, you would need a separate vision pipeline running alongside the audio model.
Choose OpenAI Realtime when reasoning depth matters
For complex multi-turn conversations requiring careful reasoning, GPT-4o Realtime currently has an edge. If the agent needs to follow intricate instructions or handle nuanced edge cases, OpenAI's model tends to be more reliable.
Choose based on your existing infrastructure
If you already use Google Cloud and have Gemini API access, Gemini Live integrates naturally. If you are already using OpenAI for text-based LLMs, their Realtime API shares the same API key and billing.
Environment setup
To use Gemini Live, you need a Google AI API key:
GOOGLE_API_KEY=your-google-ai-api-key
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secretTry both providers side by side
The fastest way to evaluate Gemini Live vs OpenAI Realtime for your use case is to build the same agent with both and run them in parallel rooms. Ask the same questions, test the same scenarios, and compare the experience directly. The code differences are minimal — mostly import paths and model names.
Test your knowledge
Question 1 of 2
What is the defining capability that distinguishes Gemini Live from OpenAI Realtime?
What you learned
- Gemini Live uses the same
AgentSessionpattern as OpenAI Realtime, with a different plugin import and model configuration - Gemini's native multimodal support lets the agent process audio and video together without a separate vision pipeline
- The two realtime providers have different strengths: Gemini excels at speed and vision, OpenAI at reasoning depth
- Video input increases token consumption and should be monitored in production
Next up
You now have three implementations of the same agent: pipeline, OpenAI Realtime, and Gemini Live. In the next chapter, you will learn how to combine these approaches — using realtime for simple conversational queries and falling back to the pipeline for complex tool use, giving you the best of both architectures.