OpenAI Realtime
OpenAI Realtime
OpenAI's Realtime API is the first widely available speech-to-speech model. Instead of converting audio to text, processing text, and converting back to audio, it handles the entire conversation in a single model. In this chapter, you will implement a realtime agent using LiveKit, configure its voice and behavior, add tool use, and understand the key differences from the pipeline approach you built in Chapter 2.
Setting up the realtime model
The simplest realtime agent requires remarkably little code:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are a customer support agent for Acme Corp. "
"Help customers with orders, returns, and product questions. "
"Keep responses concise — two sentences maximum. "
"Never use markdown or bullet points."
),
),
room=session.room,
llm=realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are a customer support agent for Acme Corp. " +
"Help customers with orders, returns, and product questions. " +
"Keep responses concise — two sentences maximum. " +
"Never use markdown or bullet points.",
}),
room: session.room,
llm: new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
}),
});
},
});Compare this to the pipeline version from Chapter 2. No stt parameter. No tts parameter. No vad parameter. The realtime model handles speech recognition, language understanding, response generation, and speech synthesis in a single pass.
The RealtimeModel is passed as the llm parameter even though it does far more than a traditional LLM. LiveKit's AgentSession detects that it is a realtime model and routes audio directly to it, bypassing the STT and TTS stages entirely. This design keeps the API surface consistent — you always call session.start() with the same parameter structure, regardless of architecture.
Configuring voice and behavior
The gpt-4o-realtime-preview model supports several voices and configuration options:
from livekit.plugins.openai import realtime
model = realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy", # Voice: alloy, echo, fable, onyx, nova, shimmer
temperature=0.6, # Lower = more consistent, higher = more creative
modalities=["audio", "text"], # Enable both audio and text output
)import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy", // Voice: alloy, echo, fable, onyx, nova, shimmer
temperature: 0.6, // Lower = more consistent, higher = more creative
modalities: ["audio", "text"], // Enable both audio and text output
});Available voices
| Voice | Character |
|---|---|
| alloy | Neutral, balanced |
| echo | Warm, conversational |
| fable | Expressive, storytelling |
| onyx | Deep, authoritative |
| nova | Bright, energetic |
| shimmer | Soft, calm |
Voice selection is more limited than TTS providers
Six voices is a far smaller selection than what dedicated TTS providers offer. Cartesia alone has hundreds of voices with fine-grained control over emotion, speed, and style. If voice selection is critical to your application — if you need a specific accent, a cloned voice, or precise emotional control — the pipeline approach with a dedicated TTS gives you more options.
Modalities
The modalities parameter controls what the model outputs:
["audio", "text"]— the model produces both audio and a text transcript of its response. Useful when you need transcripts for logging or display.["audio"]— audio only, no text transcript. Slightly lower latency since the model does not need to generate aligned text.
# Audio + text transcript (recommended for production)
model = realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
modalities=["audio", "text"],
)
# Audio only (lowest latency, no transcript)
model = realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
modalities=["audio"],
)// Audio + text transcript (recommended for production)
const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
modalities: ["audio", "text"],
});
// Audio only (lowest latency, no transcript)
const modelFast = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
modalities: ["audio"],
});Turn detection in realtime models
One of the most significant differences between pipeline and realtime architectures is how they handle turn detection — determining when the user has stopped speaking and the agent should respond.
In a pipeline, you configure VAD (Voice Activity Detection) explicitly and tune its sensitivity. In a realtime model, turn detection is built into the model itself. The model uses semantic understanding, not just silence detection, to determine when the user has finished their thought.
# Server-side VAD (the model decides when the user is done)
model = realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
turn_detection=realtime.ServerVad(
threshold=0.5, # Speech detection sensitivity (0-1)
prefix_padding_ms=300, # Audio to include before speech starts
silence_duration_ms=500, # Silence before triggering end-of-turn
),
)const model = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
turnDetection: {
type: "server_vad",
threshold: 0.5, // Speech detection sensitivity (0-1)
prefixPaddingMs: 300, // Audio to include before speech starts
silenceDurationMs: 500, // Silence before triggering end-of-turn
},
});The realtime model's turn detection has access to context that a standalone VAD does not. It understands that "I want to order a..." is an incomplete sentence even if the user pauses. A standalone VAD would detect silence and trigger a response. The realtime model waits because it knows the sentence is not finished. This leads to more natural conversations with fewer accidental interruptions.
Adding tools to realtime models
Realtime models support function calling (tools), and LiveKit uses the same tool definition pattern as pipeline agents:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session, function_tool, RunContext
from livekit.plugins.openai import realtime
server = AgentServer()
class AcmeAgent(Agent):
def __init__(self):
super().__init__(
instructions=(
"You are a customer support agent for Acme Corp. "
"Use the lookup_order tool when customers ask about an order. "
"Keep responses concise."
),
)
@function_tool()
async def lookup_order(self, ctx: RunContext, order_id: str) -> str:
"""Look up the status of a customer order by order ID."""
# In production, query your database here
return f"Order {order_id}: Shipped on March 25, arriving March 29."
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=AcmeAgent(),
room=session.room,
llm=realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, functionTool, type RunContext, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { z } from "zod";
class AcmeAgent extends Agent {
constructor() {
super({
instructions:
"You are a customer support agent for Acme Corp. " +
"Use the lookup_order tool when customers ask about an order. " +
"Keep responses concise.",
});
}
tools = [
functionTool({
name: "lookup_order",
description: "Look up the status of a customer order by order ID.",
parameters: z.object({
orderId: z.string().describe("The order ID to look up"),
}),
execute: async (ctx: RunContext, params: { orderId: string }) => {
// In production, query your database here
return `Order ${params.orderId}: Shipped on March 25, arriving March 29.`;
},
}),
];
}
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new AcmeAgent(),
room: session.room,
llm: new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
}),
});
},
});Tool execution pauses audio
When a realtime model calls a tool, the audio stream pauses while the tool executes. The user hears silence until the tool returns and the model resumes speaking. For fast tools (database lookups, API calls under 500ms), this is barely noticeable. For slow tools (complex computations, external APIs with high latency), the pause can be jarring. If your use case requires heavy tool use with slow execution, the pipeline approach may provide a better experience because the LLM can generate filler text while waiting.
Pipeline vs realtime: side-by-side
Here is the same agent implemented both ways:
Pipeline version:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins import deepgram, openai, cartesia, silero
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(instructions="You are Acme Corp support. Two sentences max."),
room=session.room,
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(),
vad=silero.VAD.load(),
)
if __name__ == "__main__":
server.run()Realtime version:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(instructions="You are Acme Corp support. Two sentences max."),
room=session.room,
llm=realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
),
)
if __name__ == "__main__":
server.run()The key differences:
Fewer dependencies
The pipeline imports four plugins (deepgram, openai, cartesia, silero). The realtime version imports one (openai.realtime). Fewer dependencies means fewer API keys, fewer potential failures, and simpler deployment.
No explicit STT or TTS
The realtime model handles speech recognition and synthesis internally. You do not choose an STT engine or a TTS voice from an external provider.
No explicit VAD
Turn detection is built into the realtime model. The model uses both acoustic and semantic signals to determine when the user is done speaking.
Different configuration surface
With a pipeline, you tune each component independently. With realtime, you configure the model as a whole — voice, temperature, modalities, turn detection.
Handling events
The event model for realtime agents is the same API, but transcripts come from the model itself rather than a dedicated STT:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
@session.on("agent_state_changed")
def on_state(state: str):
print(f"Agent state: {state}")
@session.on("user_input_transcribed")
def on_transcript(transcript):
# With realtime models, transcripts come from the model itself,
# not a separate STT engine
print(f"User said: {transcript.text}")
@session.on("conversation_item_added")
def on_item(item):
print(f"Conversation item: {item}")
await session.start(
agent=Agent(instructions="You are Acme Corp support."),
room=session.room,
llm=realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
voice="alloy",
modalities=["audio", "text"],
),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
export default defineAgent({
entry: async (session: RtcSession) => {
session.on("agentStateChanged", (state) => {
console.log("Agent state:", state);
});
session.on("userInputTranscribed", (transcript) => {
console.log("User said:", transcript.text);
});
session.on("conversationItemAdded", (item) => {
console.log("Conversation item:", item);
});
await session.start({
agent: new Agent({ instructions: "You are Acme Corp support." }),
room: session.room,
llm: new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
voice: "alloy",
modalities: ["audio", "text"],
}),
});
},
});The user_input_transcribed event still fires with realtime models, but the transcript comes from the realtime model's own speech recognition rather than a dedicated STT engine. The quality is generally good but may differ from what Deepgram or Google would produce for the same audio. If transcript accuracy is critical for your application — for compliance logging, for example — you may want to run a separate STT in parallel or use the pipeline approach.
Environment setup
To use OpenAI's Realtime API, you need an OpenAI API key with access to the realtime models:
OPENAI_API_KEY=sk-your-openai-api-key
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secretRealtime API pricing
OpenAI's Realtime API is priced per minute of audio input and output, which differs from the per-token pricing of text models. As of early 2026, realtime is significantly more expensive per conversation minute than a well-optimized pipeline. Check OpenAI's current pricing before committing to a production deployment.
Test your knowledge
Question 1 of 3
Why does LiveKit's AgentSession accept a realtime model via the 'llm' parameter even though it handles far more than just language understanding?
What comes next
You have built an OpenAI Realtime agent and understand how it differs from the pipeline approach. In the next chapter, you will implement the same agent using Google's Gemini Live — another realtime model with its own strengths, including native multimodal capabilities. Having three implementations of the same agent will give you a concrete basis for comparison.