Chapter 425m

Gemini Live

Gemini Live

Google's Gemini Live is the second major realtime model available in LiveKit. Where OpenAI's Realtime API focuses on audio-only speech-to-speech, Gemini Live is natively multimodal — it can process audio and video input simultaneously, opening up use cases that neither pipeline nor OpenAI Realtime can match. In this chapter, you will implement a Gemini Live agent, configure its multimodal capabilities, and compare it directly with the OpenAI Realtime agent you built in Chapter 3.

Gemini LiveGoogle AIMultimodal realtime

What you'll learn

  • How to set up a Gemini Live realtime agent in LiveKit
  • Configuring voice, modalities, and multimodal input
  • Sending video frames alongside audio for vision-aware conversations
  • Key differences between Gemini Live and OpenAI Realtime

Setting up Gemini Live

The basic Gemini Live agent follows the same pattern as OpenAI Realtime — pass the realtime model as the llm parameter:

gemini_realtime_agent.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a helpful assistant powered by Gemini. "
              "Keep responses concise — two sentences maximum. "
              "Never use markdown or bullet points."
          ),
      ),
      room=session.room,
      llm=google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          voice="Puck",
      ),
  )


if __name__ == "__main__":
  server.run()
geminiRealtimeAgent.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a helpful assistant powered by Gemini. " +
        "Keep responses concise — two sentences maximum. " +
        "Never use markdown or bullet points.",
    }),
    room: session.room,
    llm: new RealtimeModel({
      model: "gemini-2.0-flash",
      voice: "Puck",
    }),
  });
},
});

The structure is identical to the OpenAI Realtime agent from Chapter 3. The framework handles the differences between providers internally — you swap the import and the model configuration, and everything else stays the same.

What's happening

Gemini 2.0 Flash is optimized for low latency and multimodal input. Unlike GPT-4o Realtime, which is based on a large reasoning model, Gemini Flash prioritizes speed. This makes it a strong choice when response time matters more than deep reasoning, and especially when you want to combine audio with visual context.

Configuring voice and behavior

Gemini Live offers its own set of voices and configuration options:

gemini_config.pypython
from livekit.plugins.google import realtime as google_realtime

model = google_realtime.RealtimeModel(
  model="gemini-2.0-flash",
  voice="Puck",              # Voice: Puck, Charon, Kore, Fenrir, Aoede
  temperature=0.7,           # Controls response creativity
  modalities=["AUDIO"],      # Output modality
)
geminiConfig.tstypescript
import { RealtimeModel } from "@livekit/agents-plugins-google";

const model = new RealtimeModel({
model: "gemini-2.0-flash",
voice: "Puck",              // Voice: Puck, Charon, Kore, Fenrir, Aoede
temperature: 0.7,           // Controls response creativity
modalities: ["AUDIO"],      // Output modality
});

Available voices

VoiceCharacter
PuckPlayful, energetic
CharonDeep, measured
KoreWarm, friendly
FenrirStrong, authoritative
AoedeMelodic, expressive

Different voices, different strengths

Gemini's voices are distinct from OpenAI's. Neither set is objectively better — they have different tonal qualities. Test both providers with your target audience to determine which voice best fits your application's personality.

Multimodal input: audio plus vision

The defining feature of Gemini Live is native multimodal input. The model can process the user's camera feed or screen share alongside their voice, enabling vision-aware conversations without bolting on a separate vision model.

gemini_multimodal.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a visual assistant. When the user shares their camera, "
              "describe what you see and answer questions about it. "
              "When no video is available, respond to audio only."
          ),
      ),
      room=session.room,
      llm=google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          voice="Kore",
          # Video frames from the user's track are sent automatically
          # when the model is configured for multimodal input
      ),
  )


if __name__ == "__main__":
  server.run()
geminiMultimodal.tstypescript
import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { RealtimeModel } from "@livekit/agents-plugins-google";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a visual assistant. When the user shares their camera, " +
        "describe what you see and answer questions about it. " +
        "When no video is available, respond to audio only.",
    }),
    room: session.room,
    llm: new RealtimeModel({
      model: "gemini-2.0-flash",
      voice: "Kore",
    }),
  });
},
});
What's happening

When a user publishes a video track to the LiveKit room, the framework captures frames and sends them to Gemini alongside the audio stream. The model processes both modalities together, so it can answer questions like "What color is the object I'm holding?" without needing a separate vision pipeline. This is fundamentally different from a pipeline approach, where you would need to capture frames, send them to a vision model, inject the description into the LLM context, and coordinate timing — all as separate steps.

Video increases token consumption

Each video frame sent to Gemini consumes tokens. At the default frame rate, a 10-minute conversation with video can consume significantly more tokens than audio alone. Monitor your usage closely during development and consider reducing the frame rate for cost-sensitive applications.

Gemini Live vs OpenAI Realtime

Having built both agents, here is a concrete comparison:

DimensionOpenAI RealtimeGemini Live
Modelgpt-4o-realtime-previewgemini-2.0-flash
Multimodal inputAudio onlyAudio + video natively
Voices6 (alloy, echo, fable, onyx, nova, shimmer)5 (Puck, Charon, Kore, Fenrir, Aoede)
Latency profileConsistent, slightly higher baselineGenerally faster for simple queries
Reasoning depthStronger for complex multi-step reasoningOptimized for speed over depth
Tool callingMature, well-documentedSupported, rapidly improving
Pricing modelPer-minute audio pricingToken-based pricing
1

Choose Gemini Live when you need vision

If your agent needs to see the user's camera, a shared screen, or uploaded images during a live conversation, Gemini Live handles this natively. With OpenAI Realtime, you would need a separate vision pipeline running alongside the audio model.

2

Choose OpenAI Realtime when reasoning depth matters

For complex multi-turn conversations requiring careful reasoning, GPT-4o Realtime currently has an edge. If the agent needs to follow intricate instructions or handle nuanced edge cases, OpenAI's model tends to be more reliable.

3

Choose based on your existing infrastructure

If you already use Google Cloud and have Gemini API access, Gemini Live integrates naturally. If you are already using OpenAI for text-based LLMs, their Realtime API shares the same API key and billing.

Environment setup

To use Gemini Live, you need a Google AI API key:

.envpython
GOOGLE_API_KEY=your-google-ai-api-key
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Try both providers side by side

The fastest way to evaluate Gemini Live vs OpenAI Realtime for your use case is to build the same agent with both and run them in parallel rooms. Ask the same questions, test the same scenarios, and compare the experience directly. The code differences are minimal — mostly import paths and model names.

Test your knowledge

Question 1 of 2

What is the defining capability that distinguishes Gemini Live from OpenAI Realtime?

What you learned

  • Gemini Live uses the same AgentSession pattern as OpenAI Realtime, with a different plugin import and model configuration
  • Gemini's native multimodal support lets the agent process audio and video together without a separate vision pipeline
  • The two realtime providers have different strengths: Gemini excels at speed and vision, OpenAI at reasoning depth
  • Video input increases token consumption and should be monitored in production

Next up

You now have three implementations of the same agent: pipeline, OpenAI Realtime, and Gemini Live. In the next chapter, you will learn how to combine these approaches — using realtime for simple conversational queries and falling back to the pipeline for complex tool use, giving you the best of both architectures.

Concepts covered
Gemini LiveGoogle AIMultimodal realtime