Gemini Live

Google's Gemini Live is the second major realtime model available in LiveKit. Where OpenAI's Realtime API focuses on audio-only speech-to-speech, Gemini Live is natively multimodal — it can process audio and video input simultaneously, opening up use cases that neither pipeline nor OpenAI Realtime can match. In this chapter, you will implement a Gemini Live agent, configure its multimodal capabilities, and compare it directly with the OpenAI Realtime agent you built in Chapter 3.

Gemini LiveGoogle AIMultimodal realtime

What you'll learn

How to set up a Gemini Live realtime agent in LiveKit
Configuring voice, modalities, and multimodal input
Sending video frames alongside audio for vision-aware conversations
Key differences between Gemini Live and OpenAI Realtime

Setting up Gemini Live

The basic Gemini Live agent follows the same pattern as OpenAI Realtime — pass the realtime model as the llm parameter:

gemini_realtime_agent.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a helpful assistant powered by Gemini. "
              "Keep responses concise — two sentences maximum. "
              "Never use markdown or bullet points."
          ),
      ),
      room=session.room,
      llm=google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          voice="Puck",
      ),
  )


if __name__ == "__main__":
  server.run()

geminiRealtimeAgent.tstypescript

import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a helpful assistant powered by Gemini. " +
        "Keep responses concise — two sentences maximum. " +
        "Never use markdown or bullet points.",
    }),
    room: session.room,
    llm: new RealtimeModel({
      model: "gemini-2.0-flash",
      voice: "Puck",
    }),
  });
},
});

The structure is identical to the OpenAI Realtime agent from Chapter 3. The framework handles the differences between providers internally — you swap the import and the model configuration, and everything else stays the same.

What's happening

Gemini 2.0 Flash is optimized for low latency and multimodal input. Unlike GPT-4o Realtime, which is based on a large reasoning model, Gemini Flash prioritizes speed. This makes it a strong choice when response time matters more than deep reasoning, and especially when you want to combine audio with visual context.

Configuring voice and behavior

Gemini Live offers its own set of voices and configuration options:

gemini_config.pypython

from livekit.plugins.google import realtime as google_realtime

model = google_realtime.RealtimeModel(
  model="gemini-2.0-flash",
  voice="Puck",              # Voice: Puck, Charon, Kore, Fenrir, Aoede
  temperature=0.7,           # Controls response creativity
  modalities=["AUDIO"],      # Output modality
)

geminiConfig.tstypescript

import { RealtimeModel } from "@livekit/agents-plugins-google";

const model = new RealtimeModel({
model: "gemini-2.0-flash",
voice: "Puck",              // Voice: Puck, Charon, Kore, Fenrir, Aoede
temperature: 0.7,           // Controls response creativity
modalities: ["AUDIO"],      // Output modality
});

Available voices

Voice	Character
Puck	Playful, energetic
Charon	Deep, measured
Kore	Warm, friendly
Fenrir	Strong, authoritative
Aoede	Melodic, expressive

Different voices, different strengths

Gemini's voices are distinct from OpenAI's. Neither set is objectively better — they have different tonal qualities. Test both providers with your target audience to determine which voice best fits your application's personality.

Multimodal input: audio plus vision

The defining feature of Gemini Live is native multimodal input. The model can process the user's camera feed or screen share alongside their voice, enabling vision-aware conversations without bolting on a separate vision model.

gemini_multimodal.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a visual assistant. When the user shares their camera, "
              "describe what you see and answer questions about it. "
              "When no video is available, respond to audio only."
          ),
      ),
      room=session.room,
      llm=google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          voice="Kore",
          # Video frames from the user's track are sent automatically
          # when the model is configured for multimodal input
      ),
  )


if __name__ == "__main__":
  server.run()

geminiMultimodal.tstypescript

import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a visual assistant. When the user shares their camera, " +
        "describe what you see and answer questions about it. " +
        "When no video is available, respond to audio only.",
    }),
    room: session.room,
    llm: new RealtimeModel({
      model: "gemini-2.0-flash",
      voice: "Kore",
    }),
  });
},
});

What's happening

When a user publishes a video track to the LiveKit room, the framework captures frames and sends them to Gemini alongside the audio stream. The model processes both modalities together, so it can answer questions like "What color is the object I'm holding?" without needing a separate vision pipeline. This is fundamentally different from a pipeline approach, where you would need to capture frames, send them to a vision model, inject the description into the LLM context, and coordinate timing — all as separate steps.

Video increases token consumption

Each video frame sent to Gemini consumes tokens. At the default frame rate, a 10-minute conversation with video can consume significantly more tokens than audio alone. Monitor your usage closely during development and consider reducing the frame rate for cost-sensitive applications.

Gemini Live vs OpenAI Realtime

Having built both agents, here is a concrete comparison:

Dimension	OpenAI Realtime	Gemini Live
Model	gpt-4o-realtime-preview	gemini-2.0-flash
Multimodal input	Audio only	Audio + video natively
Voices	6 (alloy, echo, fable, onyx, nova, shimmer)	5 (Puck, Charon, Kore, Fenrir, Aoede)
Latency profile	Consistent, slightly higher baseline	Generally faster for simple queries
Reasoning depth	Stronger for complex multi-step reasoning	Optimized for speed over depth
Tool calling	Mature, well-documented	Supported, rapidly improving
Pricing model	Per-minute audio pricing	Token-based pricing

Choose Gemini Live when you need vision

If your agent needs to see the user's camera, a shared screen, or uploaded images during a live conversation, Gemini Live handles this natively. With OpenAI Realtime, you would need a separate vision pipeline running alongside the audio model.

Choose OpenAI Realtime when reasoning depth matters

For complex multi-turn conversations requiring careful reasoning, GPT-4o Realtime currently has an edge. If the agent needs to follow intricate instructions or handle nuanced edge cases, OpenAI's model tends to be more reliable.

Choose based on your existing infrastructure

If you already use Google Cloud and have Gemini API access, Gemini Live integrates naturally. If you are already using OpenAI for text-based LLMs, their Realtime API shares the same API key and billing.

Environment setup

To use Gemini Live, you need a Google AI API key:

.envpython

GOOGLE_API_KEY=your-google-ai-api-key
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Try both providers side by side

The fastest way to evaluate Gemini Live vs OpenAI Realtime for your use case is to build the same agent with both and run them in parallel rooms. Ask the same questions, test the same scenarios, and compare the experience directly. The code differences are minimal — mostly import paths and model names.

Test your knowledge

Question 1 of 2

What is the defining capability that distinguishes Gemini Live from OpenAI Realtime?

What you learned

Gemini Live uses the same AgentSession pattern as OpenAI Realtime, with a different plugin import and model configuration
Gemini's native multimodal support lets the agent process audio and video together without a separate vision pipeline
The two realtime providers have different strengths: Gemini excels at speed and vision, OpenAI at reasoning depth
Video input increases token consumption and should be monitored in production

Next up

You now have three implementations of the same agent: pipeline, OpenAI Realtime, and Gemini Live. In the next chapter, you will learn how to get the best of both worlds — using a realtime model for speech comprehension while pairing it with a custom TTS for voice output. This half-cascade pattern is one of the most practical architectures for production agents that need both low latency and voice control.