Realtime model nodes

Everything you have built so far uses the pipeline architecture: separate STT, LLM, and TTS services connected in sequence. Realtime models like OpenAI Realtime and Google Gemini Live take a fundamentally different approach — a single model accepts audio and produces audio directly. In this chapter, you will learn how the node override system adapts for realtime models and how to customize their output.

realtime_audio_output_nodeRealtime models

What you'll learn

How the pipeline differs when using realtime (speech-to-speech) models
Which nodes exist in realtime mode and which do not
How to override realtime_audio_output_node to process realtime audio output
The trade-offs between pipeline node overrides and realtime node overrides

Pipeline vs realtime: different node structures

In the standard pipeline, audio flows through five distinct stages with a node at each one. Realtime models collapse the middle three stages into a single model, which changes the override points available to you.

Pipeline model (STT + LLM + TTS):

Pipeline model

Audio In

stt_node

on_user_turn_completed

llm_node

tts_node

Audio Out

Realtime model (speech-to-speech):

Realtime model

Audio In

Realtime Model

realtime_audio_output_node

Audio Out

The realtime model handles speech recognition, reasoning, and speech synthesis internally. You cannot intercept between these stages because they happen inside a single model inference. Instead, you get realtime_audio_output_node — an override point for processing the audio output after the model generates it.

Not all nodes disappear

When using a realtime model, stt_node, llm_node, and tts_node are not called. However, on_user_turn_completed may still fire depending on the model and configuration, giving you a hook for turn-level logic like logging.

Setting up a realtime model agent

Before overriding nodes, here is how to configure an agent with a realtime model:

agent.pypython

from livekit.agents import Agent, AgentSession
from livekit.plugins import openai


class RealtimeAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful assistant with a warm, conversational tone.",
          llm=openai.RealtimeLLM(
              model="gpt-4o-realtime",
              voice="alloy",
          ),
      )

agent.tstypescript

import { Agent } from "@livekit/agents";
import { openai } from "@livekit/plugins-openai";

class RealtimeAgent extends Agent {
constructor() {
  super({
    instructions:
      "You are a helpful assistant with a warm, conversational tone.",
    llm: new openai.RealtimeLLM({
      model: "gpt-4o-realtime",
      voice: "alloy",
    }),
  });
}
}

What's happening

With a realtime model, you configure the voice on the LLM itself rather than on a separate TTS plugin. The model generates audio directly, so voice selection is part of the model configuration. There is no separate STT or TTS plugin.

Overriding realtime_audio_output_node

The realtime_audio_output_node receives audio frames as the realtime model generates them. You can use this override to process, analyze, or transform the output audio.

agent.pypython

from livekit.agents import Agent, rtc
import typing
import logging
import time

logger = logging.getLogger("realtime-agent")


class ProcessedRealtimeAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful assistant.",
          llm=openai.RealtimeLLM(
              model="gpt-4o-realtime",
              voice="alloy",
          ),
      )

  async def realtime_audio_output_node(
      self, audio: rtc.AudioStream
  ) -> typing.AsyncGenerator[rtc.AudioFrame, None]:
      """Process audio output from the realtime model."""
      frame_count = 0
      start_time = time.time()

      async for frame in Agent.default.realtime_audio_output_node(self, audio):
          frame_count += 1

          if frame_count == 1:
              elapsed = time.time() - start_time
              logger.info(f"First audio frame in {elapsed:.3f}s")

          # Process the audio frame — e.g., adjust volume
          processed = self.adjust_volume(frame, gain=0.8)

          yield processed

      duration = time.time() - start_time
      logger.info(
          f"Response complete: {frame_count} frames in {duration:.3f}s"
      )

  def adjust_volume(self, frame: rtc.AudioFrame, gain: float) -> rtc.AudioFrame:
      """Apply a volume gain to an audio frame."""
      import numpy as np

      samples = np.frombuffer(frame.data, dtype=np.int16)
      adjusted = np.clip(samples * gain, -32768, 32767).astype(np.int16)
      return rtc.AudioFrame(
          data=adjusted.tobytes(),
          sample_rate=frame.sample_rate,
          num_channels=frame.num_channels,
          samples_per_channel=frame.samples_per_channel,
      )

agent.tstypescript

import { Agent, rtc } from "@livekit/agents";
import { openai } from "@livekit/plugins-openai";

class ProcessedRealtimeAgent extends Agent {
constructor() {
  super({
    instructions: "You are a helpful assistant.",
    llm: new openai.RealtimeLLM({
      model: "gpt-4o-realtime",
      voice: "alloy",
    }),
  });
}

async *realtimeAudioOutputNode(
  audio: rtc.AudioStream
): AsyncGenerator<rtc.AudioFrame> {
  let frameCount = 0;
  const startTime = Date.now();

  for await (const frame of Agent.default.realtimeAudioOutputNode(
    this,
    audio
  )) {
    frameCount++;

    if (frameCount === 1) {
      const elapsed = (Date.now() - startTime) / 1000;
      console.log(`First audio frame in ${elapsed.toFixed(3)}s`);
    }

    const processed = this.adjustVolume(frame, 0.8);
    yield processed;
  }

  const duration = (Date.now() - startTime) / 1000;
  console.log(
    `Response complete: ${frameCount} frames in ${duration.toFixed(3)}s`
  );
}

private adjustVolume(
  frame: rtc.AudioFrame,
  gain: number
): rtc.AudioFrame {
  const samples = new Int16Array(frame.data.buffer);
  const adjusted = new Int16Array(samples.length);
  for (let i = 0; i < samples.length; i++) {
    adjusted[i] = Math.max(
      -32768,
      Math.min(32767, Math.round(samples[i] * gain))
    );
  }
  return new rtc.AudioFrame({
    data: Buffer.from(adjusted.buffer),
    sampleRate: frame.sampleRate,
    numChannels: frame.numChannels,
    samplesPerChannel: frame.samplesPerChannel,
  });
}
}

Receive audio frames from the realtime model

The realtime_audio_output_node receives a stream of audio frames as the model generates its response. Each frame is a small chunk of PCM audio.

Track timing metrics

Measuring time-to-first-frame gives you the model's latency. Tracking total frames and duration helps you monitor response length and generation speed.

Process each frame

You can apply any audio processing — volume adjustment, noise reduction, audio effects — to each frame before yielding it. The example shows simple gain control.

Yield the processed frame

The processed frame continues to the user's speaker. If you do not yield a frame, it is dropped and the user hears silence for that segment.

What you can and cannot do with realtime nodes

The realtime architecture gives you less granular control than the pipeline. Here is a clear comparison:

Capability	Pipeline	Realtime
Filter user speech before processing	`stt_node` override	Not available — model processes audio directly
Inject RAG context	`on_user_turn_completed`	Possible via model context, but less flexible
Parse structured output	`llm_node` override	Not available — model produces audio, not text
Control voice/emotion	`tts_node` override per chunk	Voice set at model configuration time
Process output audio	`tts_node` yields audio	`realtime_audio_output_node`
Content filter on output text	`llm_node` override	Requires separate transcription of model audio

Realtime models trade control for simplicity

If your application requires fine-grained pipeline control — structured output parsing, dynamic voice switching, multi-layer content filtering — the pipeline architecture is the better choice. Realtime models excel when you want low-latency, natural conversation with minimal customization.

Audio analysis on realtime output

Even though you cannot intercept text in realtime mode, you can analyze the audio output for monitoring purposes:

agent.pypython

import numpy as np


class AnalyticsRealtimeAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful assistant.",
          llm=openai.RealtimeLLM(
              model="gpt-4o-realtime",
              voice="alloy",
          ),
      )

  async def realtime_audio_output_node(self, audio):
      """Analyze output audio for quality monitoring."""
      total_energy = 0.0
      frame_count = 0
      silence_frames = 0

      async for frame in Agent.default.realtime_audio_output_node(self, audio):
          frame_count += 1
          samples = np.frombuffer(frame.data, dtype=np.int16)

          # Calculate RMS energy
          rms = np.sqrt(np.mean(samples.astype(float) ** 2))
          total_energy += rms

          # Detect silence (very low energy)
          if rms < 100:
              silence_frames += 1

          yield frame

      if frame_count > 0:
          avg_energy = total_energy / frame_count
          silence_pct = (silence_frames / frame_count) * 100
          logger.info(
              f"Audio stats: avg_energy={avg_energy:.1f}, "
              f"silence={silence_pct:.1f}%, frames={frame_count}"
          )

agent.tstypescript

class AnalyticsRealtimeAgent extends Agent {
constructor() {
  super({
    instructions: "You are a helpful assistant.",
    llm: new openai.RealtimeLLM({
      model: "gpt-4o-realtime",
      voice: "alloy",
    }),
  });
}

async *realtimeAudioOutputNode(
  audio: rtc.AudioStream
): AsyncGenerator<rtc.AudioFrame> {
  let totalEnergy = 0;
  let frameCount = 0;
  let silenceFrames = 0;

  for await (const frame of Agent.default.realtimeAudioOutputNode(
    this,
    audio
  )) {
    frameCount++;
    const samples = new Int16Array(frame.data.buffer);

    let sumSquares = 0;
    for (let i = 0; i < samples.length; i++) {
      sumSquares += samples[i] * samples[i];
    }
    const rms = Math.sqrt(sumSquares / samples.length);
    totalEnergy += rms;

    if (rms < 100) {
      silenceFrames++;
    }

    yield frame;
  }

  if (frameCount > 0) {
    const avgEnergy = totalEnergy / frameCount;
    const silencePct = (silenceFrames / frameCount) * 100;
    console.log(
      `Audio stats: avg_energy=${avgEnergy.toFixed(1)}, ` +
        `silence=${silencePct.toFixed(1)}%, frames=${frameCount}`
    );
  }
}
}

What's happening

Audio analysis on the output stream lets you monitor quality even when you cannot inspect the text. High silence percentages might indicate model issues. Abnormally low energy suggests volume problems. These metrics feed into the monitoring system you will build in the final chapter.

Choosing between pipeline and realtime

Use this decision framework when starting a new agent project:

Choose pipeline when you need:

Structured output (JSON, chain-of-thought)
Multi-layer content filtering
Dynamic voice switching per response
RAG injection with full control
Best-in-class component selection (mix different STT, LLM, TTS providers)

Choose realtime when you need:

Lowest possible latency
Natural conversational flow with overlapping speech
Emotional responsiveness to user tone of voice
Simple configuration with minimal custom logic

Test your knowledge

Question 1 of 2

Why can't you filter user speech before processing when using a realtime model, unlike with a pipeline?

What you learned

Realtime models collapse STT, LLM, and TTS into a single model, changing the available override points
realtime_audio_output_node is the primary override for processing realtime model output
You can adjust volume, analyze audio quality, and track metrics on the output stream
Pipeline models offer more granular control; realtime models offer lower latency and simpler setup

Next up

In the final chapter, you will assemble every customization from this course into a complete chain-of-thought agent. You will wire together all the node overrides, add tests for each node, and instrument the pipeline with metrics and logging.