OpenAI Realtime

OpenAI's Realtime API is the first widely available speech-to-speech model. Instead of converting audio to text, processing text, and converting back to audio, it handles the entire conversation in a single model. In this chapter, you will implement a realtime agent using LiveKit, configure its voice and behavior, add tool use, and understand the key differences from the pipeline approach you built in Chapter 2.

OpenAI Realtime APIgpt-4o-realtimeConfiguration

Setting up the realtime model

The simplest realtime agent requires remarkably little code:

openai_realtime_agent.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a customer support agent for Acme Corp. "
              "Help customers with orders, returns, and product questions. "
              "Keep responses concise — two sentences maximum. "
              "Never use markdown or bullet points."
          ),
      ),
      room=session.room,
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      ),
  )


if __name__ == "__main__":
  server.run()

openaiRealtimeAgent.tstypescript

import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a customer support agent for Acme Corp. " +
        "Help customers with orders, returns, and product questions. " +
        "Keep responses concise — two sentences maximum. " +
        "Never use markdown or bullet points.",
    }),
    room: session.room,
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      voice: "alloy",
    }),
  });
},
});

Compare this to the pipeline version from Chapter 2. No stt parameter. No tts parameter. No vad parameter. The realtime model handles speech recognition, language understanding, response generation, and speech synthesis in a single pass.

What's happening

The RealtimeModel is passed as the llm parameter even though it does far more than a traditional LLM. LiveKit's AgentSession detects that it is a realtime model and routes audio directly to it, bypassing the STT and TTS stages entirely. This design keeps the API surface consistent — you always call session.start() with the same parameter structure, regardless of architecture.

Configuring voice and behavior

The gpt-4o-realtime-preview model supports several voices and configuration options:

realtime_config.pypython

from livekit.plugins.openai import realtime

model = realtime.RealtimeModel(
  model="gpt-4o-realtime-preview",
  voice="alloy",           # Voice: alloy, echo, fable, onyx, nova, shimmer
  temperature=0.6,         # Lower = more consistent, higher = more creative
  modalities=["audio", "text"],  # Enable both audio and text output
)

realtimeConfig.tstypescript

import { OpenAIRealtime } from "@livekit/agents-plugin-openai";

const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",           // Voice: alloy, echo, fable, onyx, nova, shimmer
temperature: 0.6,         // Lower = more consistent, higher = more creative
modalities: ["audio", "text"],  // Enable both audio and text output
});

Available voices

Voice	Character
alloy	Neutral, balanced
echo	Warm, conversational
fable	Expressive, storytelling
onyx	Deep, authoritative
nova	Bright, energetic
shimmer	Soft, calm

Voice selection is more limited than TTS providers

Six voices is a far smaller selection than what dedicated TTS providers offer. Cartesia alone has hundreds of voices with fine-grained control over emotion, speed, and style. If voice selection is critical to your application — if you need a specific accent, a cloned voice, or precise emotional control — the pipeline approach with a dedicated TTS gives you more options.

Modalities

The modalities parameter controls what the model outputs:

["audio", "text"] — the model produces both audio and a text transcript of its response. Useful when you need transcripts for logging or display.
["audio"] — audio only, no text transcript. Slightly lower latency since the model does not need to generate aligned text.

modalities.py (excerpt)python

# Audio + text transcript (recommended for production)
model = realtime.RealtimeModel(
  model="gpt-4o-realtime-preview",
  voice="alloy",
  modalities=["audio", "text"],
)

# Audio only (lowest latency, no transcript)
model = realtime.RealtimeModel(
  model="gpt-4o-realtime-preview",
  voice="alloy",
  modalities=["audio"],
)

modalities.ts (excerpt)typescript

// Audio + text transcript (recommended for production)
const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
modalities: ["audio", "text"],
});

// Audio only (lowest latency, no transcript)
const modelFast = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
modalities: ["audio"],
});

Turn detection in realtime models

One of the most significant differences between pipeline and realtime architectures is how they handle turn detection — determining when the user has stopped speaking and the agent should respond.

In a pipeline, you configure VAD (Voice Activity Detection) explicitly and tune its sensitivity. In a realtime model, turn detection is built into the model itself. The model uses semantic understanding, not just silence detection, to determine when the user has finished their thought.

turn_detection.py (excerpt)python

# Server-side VAD (the model decides when the user is done)
model = realtime.RealtimeModel(
  model="gpt-4o-realtime-preview",
  voice="alloy",
  turn_detection=realtime.ServerVad(
      threshold=0.5,           # Speech detection sensitivity (0-1)
      prefix_padding_ms=300,   # Audio to include before speech starts
      silence_duration_ms=500, # Silence before triggering end-of-turn
  ),
)

turnDetection.ts (excerpt)typescript

const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
turnDetection: {
  type: "server_vad",
  threshold: 0.5,           // Speech detection sensitivity (0-1)
  prefixPaddingMs: 300,     // Audio to include before speech starts
  silenceDurationMs: 500,   // Silence before triggering end-of-turn
},
});

What's happening

The realtime model's turn detection has access to context that a standalone VAD does not. It understands that "I want to order a..." is an incomplete sentence even if the user pauses. A standalone VAD would detect silence and trigger a response. The realtime model waits because it knows the sentence is not finished. This leads to more natural conversations with fewer accidental interruptions.

Adding tools to realtime models

Realtime models support function calling (tools), and LiveKit uses the same tool definition pattern as pipeline agents:

realtime_tools.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session, function_tool, RunContext
from livekit.plugins.openai import realtime

server = AgentServer()


class AcmeAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions=(
              "You are a customer support agent for Acme Corp. "
              "Use the lookup_order tool when customers ask about an order. "
              "Keep responses concise."
          ),
      )

  @function_tool()
  async def lookup_order(self, ctx: RunContext, order_id: str) -> str:
      """Look up the status of a customer order by order ID."""
      # In production, query your database here
      return f"Order {order_id}: Shipped on March 25, arriving March 29."


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=AcmeAgent(),
      room=session.room,
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      ),
  )


if __name__ == "__main__":
  server.run()

realtimeTools.tstypescript

import { AgentSession, Agent, defineAgent, functionTool, type RunContext, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { z } from "zod";

class AcmeAgent extends Agent {
constructor() {
  super({
    instructions:
      "You are a customer support agent for Acme Corp. " +
      "Use the lookup_order tool when customers ask about an order. " +
      "Keep responses concise.",
  });
}

tools = [
  functionTool({
    name: "lookup_order",
    description: "Look up the status of a customer order by order ID.",
    parameters: z.object({
      orderId: z.string().describe("The order ID to look up"),
    }),
    execute: async (ctx: RunContext, params: { orderId: string }) => {
      // In production, query your database here
      return `Order ${params.orderId}: Shipped on March 25, arriving March 29.`;
    },
  }),
];
}

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new AcmeAgent(),
    room: session.room,
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      voice: "alloy",
    }),
  });
},
});

Tool execution pauses audio

When a realtime model calls a tool, the audio stream pauses while the tool executes. The user hears silence until the tool returns and the model resumes speaking. For fast tools (database lookups, API calls under 500ms), this is barely noticeable. For slow tools (complex computations, external APIs with high latency), the pause can be jarring. If your use case requires heavy tool use with slow execution, the pipeline approach may provide a better experience because the LLM can generate filler text while waiting.

Pipeline vs realtime: side-by-side

Here is the same agent implemented both ways:

Pipeline version:

pipeline_version.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(instructions="You are Acme Corp support. Two sentences max."),
      room=session.room,
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(),
      vad=silero.VAD.load(),
  )


if __name__ == "__main__":
  server.run()

Realtime version:

realtime_version.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(instructions="You are Acme Corp support. Two sentences max."),
      room=session.room,
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      ),
  )


if __name__ == "__main__":
  server.run()

The key differences:

Fewer dependencies

The pipeline imports four plugins (deepgram, openai, cartesia, silero). The realtime version imports one (openai.realtime). Fewer dependencies means fewer API keys, fewer potential failures, and simpler deployment.

No explicit STT or TTS

The realtime model handles speech recognition and synthesis internally. You do not choose an STT engine or a TTS voice from an external provider.

No explicit VAD

Turn detection is built into the realtime model. The model uses both acoustic and semantic signals to determine when the user is done speaking.

Different configuration surface

With a pipeline, you tune each component independently. With realtime, you configure the model as a whole — voice, temperature, modalities, turn detection.

Handling events

The event model for realtime agents is the same API, but transcripts come from the model itself rather than a dedicated STT:

realtime_events.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  @session.on("agent_state_changed")
  def on_state(state: str):
      print(f"Agent state: {state}")

  @session.on("user_input_transcribed")
  def on_transcript(transcript):
      # With realtime models, transcripts come from the model itself,
      # not a separate STT engine
      print(f"User said: {transcript.text}")

  @session.on("conversation_item_added")
  def on_item(item):
      print(f"Conversation item: {item}")

  await session.start(
      agent=Agent(instructions="You are Acme Corp support."),
      room=session.room,
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
          modalities=["audio", "text"],
      ),
  )


if __name__ == "__main__":
  server.run()

realtimeEvents.tstypescript

import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";

export default defineAgent({
entry: async (session: RtcSession) => {
  session.on("agentStateChanged", (state) => {
    console.log("Agent state:", state);
  });

  session.on("userInputTranscribed", (transcript) => {
    console.log("User said:", transcript.text);
  });

  session.on("conversationItemAdded", (item) => {
    console.log("Conversation item:", item);
  });

  await session.start({
    agent: new Agent({ instructions: "You are Acme Corp support." }),
    room: session.room,
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      voice: "alloy",
      modalities: ["audio", "text"],
    }),
  });
},
});

What's happening

The user_input_transcribed event still fires with realtime models, but the transcript comes from the realtime model's own speech recognition rather than a dedicated STT engine. The quality is generally good but may differ from what Deepgram or Google would produce for the same audio. If transcript accuracy is critical for your application — for compliance logging, for example — you may want to run a separate STT in parallel or use the pipeline approach.

Environment setup

To use OpenAI's Realtime API, you need an OpenAI API key with access to the realtime models:

.envpython

OPENAI_API_KEY=sk-your-openai-api-key
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Realtime API pricing

OpenAI's Realtime API is priced per minute of audio input and output, which differs from the per-token pricing of text models. As of early 2026, realtime is significantly more expensive per conversation minute than a well-optimized pipeline. Check OpenAI's current pricing before committing to a production deployment.

Test your knowledge

Question 1 of 3

Why does LiveKit's AgentSession accept a realtime model via the 'llm' parameter even though it handles far more than just language understanding?

What comes next

You have built an OpenAI Realtime agent and understand how it differs from the pipeline approach. In the next chapter, you will implement the same agent using Google's Gemini Live — another realtime model with its own strengths, including native multimodal capabilities. Having three implementations of the same agent will give you a concrete basis for comparison.