Realtime model nodes
Realtime model nodes
Everything you have built so far uses the pipeline architecture: separate STT, LLM, and TTS services connected in sequence. Realtime models like OpenAI Realtime and Google Gemini Live take a fundamentally different approach — a single model accepts audio and produces audio directly. In this chapter, you will learn how the node override system adapts for realtime models and how to customize their output.
What you'll learn
- How the pipeline differs when using realtime (speech-to-speech) models
- Which nodes exist in realtime mode and which do not
- How to override
realtime_audio_output_nodeto process realtime audio output - The trade-offs between pipeline node overrides and realtime node overrides
Pipeline vs realtime: different node structures
In the standard pipeline, audio flows through five distinct stages with a node at each one. Realtime models collapse the middle three stages into a single model, which changes the override points available to you.
Pipeline model (STT + LLM + TTS):
Pipeline model
Audio In
stt_node
on_user_turn_completed
llm_node
tts_node
Audio Out
Realtime model (speech-to-speech):
Realtime model
Audio In
Realtime Model
realtime_audio_output_node
Audio Out
The realtime model handles speech recognition, reasoning, and speech synthesis internally. You cannot intercept between these stages because they happen inside a single model inference. Instead, you get realtime_audio_output_node — an override point for processing the audio output after the model generates it.
Not all nodes disappear
When using a realtime model, stt_node, llm_node, and tts_node are not called. However, on_user_turn_completed may still fire depending on the model and configuration, giving you a hook for turn-level logic like logging.
Setting up a realtime model agent
Before overriding nodes, here is how to configure an agent with a realtime model:
from livekit.agents import Agent, AgentSession
from livekit.plugins import openai
class RealtimeAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant with a warm, conversational tone.",
llm=openai.RealtimeLLM(
model="gpt-4o-realtime",
voice="alloy",
),
)import { Agent } from "@livekit/agents";
import { openai } from "@livekit/plugins-openai";
class RealtimeAgent extends Agent {
constructor() {
super({
instructions:
"You are a helpful assistant with a warm, conversational tone.",
llm: new openai.RealtimeLLM({
model: "gpt-4o-realtime",
voice: "alloy",
}),
});
}
}With a realtime model, you configure the voice on the LLM itself rather than on a separate TTS plugin. The model generates audio directly, so voice selection is part of the model configuration. There is no separate STT or TTS plugin.
Overriding realtime_audio_output_node
The realtime_audio_output_node receives audio frames as the realtime model generates them. You can use this override to process, analyze, or transform the output audio.
from livekit.agents import Agent, rtc
import typing
import logging
import time
logger = logging.getLogger("realtime-agent")
class ProcessedRealtimeAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant.",
llm=openai.RealtimeLLM(
model="gpt-4o-realtime",
voice="alloy",
),
)
async def realtime_audio_output_node(
self, audio: rtc.AudioStream
) -> typing.AsyncGenerator[rtc.AudioFrame, None]:
"""Process audio output from the realtime model."""
frame_count = 0
start_time = time.time()
async for frame in Agent.default.realtime_audio_output_node(self, audio):
frame_count += 1
if frame_count == 1:
elapsed = time.time() - start_time
logger.info(f"First audio frame in {elapsed:.3f}s")
# Process the audio frame — e.g., adjust volume
processed = self.adjust_volume(frame, gain=0.8)
yield processed
duration = time.time() - start_time
logger.info(
f"Response complete: {frame_count} frames in {duration:.3f}s"
)
def adjust_volume(self, frame: rtc.AudioFrame, gain: float) -> rtc.AudioFrame:
"""Apply a volume gain to an audio frame."""
import numpy as np
samples = np.frombuffer(frame.data, dtype=np.int16)
adjusted = np.clip(samples * gain, -32768, 32767).astype(np.int16)
return rtc.AudioFrame(
data=adjusted.tobytes(),
sample_rate=frame.sample_rate,
num_channels=frame.num_channels,
samples_per_channel=frame.samples_per_channel,
)import { Agent, rtc } from "@livekit/agents";
import { openai } from "@livekit/plugins-openai";
class ProcessedRealtimeAgent extends Agent {
constructor() {
super({
instructions: "You are a helpful assistant.",
llm: new openai.RealtimeLLM({
model: "gpt-4o-realtime",
voice: "alloy",
}),
});
}
async *realtimeAudioOutputNode(
audio: rtc.AudioStream
): AsyncGenerator<rtc.AudioFrame> {
let frameCount = 0;
const startTime = Date.now();
for await (const frame of Agent.default.realtimeAudioOutputNode(
this,
audio
)) {
frameCount++;
if (frameCount === 1) {
const elapsed = (Date.now() - startTime) / 1000;
console.log(`First audio frame in ${elapsed.toFixed(3)}s`);
}
const processed = this.adjustVolume(frame, 0.8);
yield processed;
}
const duration = (Date.now() - startTime) / 1000;
console.log(
`Response complete: ${frameCount} frames in ${duration.toFixed(3)}s`
);
}
private adjustVolume(
frame: rtc.AudioFrame,
gain: number
): rtc.AudioFrame {
const samples = new Int16Array(frame.data.buffer);
const adjusted = new Int16Array(samples.length);
for (let i = 0; i < samples.length; i++) {
adjusted[i] = Math.max(
-32768,
Math.min(32767, Math.round(samples[i] * gain))
);
}
return new rtc.AudioFrame({
data: Buffer.from(adjusted.buffer),
sampleRate: frame.sampleRate,
numChannels: frame.numChannels,
samplesPerChannel: frame.samplesPerChannel,
});
}
}Receive audio frames from the realtime model
The realtime_audio_output_node receives a stream of audio frames as the model generates its response. Each frame is a small chunk of PCM audio.
Track timing metrics
Measuring time-to-first-frame gives you the model's latency. Tracking total frames and duration helps you monitor response length and generation speed.
Process each frame
You can apply any audio processing — volume adjustment, noise reduction, audio effects — to each frame before yielding it. The example shows simple gain control.
Yield the processed frame
The processed frame continues to the user's speaker. If you do not yield a frame, it is dropped and the user hears silence for that segment.
What you can and cannot do with realtime nodes
The realtime architecture gives you less granular control than the pipeline. Here is a clear comparison:
| Capability | Pipeline | Realtime |
|---|---|---|
| Filter user speech before processing | stt_node override | Not available — model processes audio directly |
| Inject RAG context | on_user_turn_completed | Possible via model context, but less flexible |
| Parse structured output | llm_node override | Not available — model produces audio, not text |
| Control voice/emotion | tts_node override per chunk | Voice set at model configuration time |
| Process output audio | tts_node yields audio | realtime_audio_output_node |
| Content filter on output text | llm_node override | Requires separate transcription of model audio |
Realtime models trade control for simplicity
If your application requires fine-grained pipeline control — structured output parsing, dynamic voice switching, multi-layer content filtering — the pipeline architecture is the better choice. Realtime models excel when you want low-latency, natural conversation with minimal customization.
Audio analysis on realtime output
Even though you cannot intercept text in realtime mode, you can analyze the audio output for monitoring purposes:
import numpy as np
class AnalyticsRealtimeAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant.",
llm=openai.RealtimeLLM(
model="gpt-4o-realtime",
voice="alloy",
),
)
async def realtime_audio_output_node(self, audio):
"""Analyze output audio for quality monitoring."""
total_energy = 0.0
frame_count = 0
silence_frames = 0
async for frame in Agent.default.realtime_audio_output_node(self, audio):
frame_count += 1
samples = np.frombuffer(frame.data, dtype=np.int16)
# Calculate RMS energy
rms = np.sqrt(np.mean(samples.astype(float) ** 2))
total_energy += rms
# Detect silence (very low energy)
if rms < 100:
silence_frames += 1
yield frame
if frame_count > 0:
avg_energy = total_energy / frame_count
silence_pct = (silence_frames / frame_count) * 100
logger.info(
f"Audio stats: avg_energy={avg_energy:.1f}, "
f"silence={silence_pct:.1f}%, frames={frame_count}"
)class AnalyticsRealtimeAgent extends Agent {
constructor() {
super({
instructions: "You are a helpful assistant.",
llm: new openai.RealtimeLLM({
model: "gpt-4o-realtime",
voice: "alloy",
}),
});
}
async *realtimeAudioOutputNode(
audio: rtc.AudioStream
): AsyncGenerator<rtc.AudioFrame> {
let totalEnergy = 0;
let frameCount = 0;
let silenceFrames = 0;
for await (const frame of Agent.default.realtimeAudioOutputNode(
this,
audio
)) {
frameCount++;
const samples = new Int16Array(frame.data.buffer);
let sumSquares = 0;
for (let i = 0; i < samples.length; i++) {
sumSquares += samples[i] * samples[i];
}
const rms = Math.sqrt(sumSquares / samples.length);
totalEnergy += rms;
if (rms < 100) {
silenceFrames++;
}
yield frame;
}
if (frameCount > 0) {
const avgEnergy = totalEnergy / frameCount;
const silencePct = (silenceFrames / frameCount) * 100;
console.log(
`Audio stats: avg_energy=${avgEnergy.toFixed(1)}, ` +
`silence=${silencePct.toFixed(1)}%, frames=${frameCount}`
);
}
}
}Audio analysis on the output stream lets you monitor quality even when you cannot inspect the text. High silence percentages might indicate model issues. Abnormally low energy suggests volume problems. These metrics feed into the monitoring system you will build in the final chapter.
Choosing between pipeline and realtime
Use this decision framework when starting a new agent project:
Choose pipeline when you need:
- Structured output (JSON, chain-of-thought)
- Multi-layer content filtering
- Dynamic voice switching per response
- RAG injection with full control
- Best-in-class component selection (mix different STT, LLM, TTS providers)
Choose realtime when you need:
- Lowest possible latency
- Natural conversational flow with overlapping speech
- Emotional responsiveness to user tone of voice
- Simple configuration with minimal custom logic
Test your knowledge
Question 1 of 2
Why can't you filter user speech before processing when using a realtime model, unlike with a pipeline?
What you learned
- Realtime models collapse STT, LLM, and TTS into a single model, changing the available override points
realtime_audio_output_nodeis the primary override for processing realtime model output- You can adjust volume, analyze audio quality, and track metrics on the output stream
- Pipeline models offer more granular control; realtime models offer lower latency and simpler setup
Next up
In the final chapter, you will assemble every customization from this course into a complete chain-of-thought agent. You will wire together all the node overrides, add tests for each node, and instrument the pipeline with metrics and logging.