Custom TTS with realtime models

Realtime models give you low latency and natural speech comprehension — but they lock you into a handful of built-in voices. Pipelines give you any voice from any TTS provider — but add latency from three model hops. What if you could get the best of both? LiveKit supports exactly this: use a realtime model for speech understanding and reasoning, but route its text output through a dedicated TTS for speech synthesis. This is called the half-cascade architecture, and it is one of the most practical patterns for production voice agents.

Half-cascade architectureText-only modalityCustom TTS with realtime

What you'll learn

How the half-cascade architecture works and why it exists
Configuring a realtime model for text-only output
Pairing a realtime model with any TTS provider
When half-cascade is the right choice vs pure realtime or full pipeline

The problem: voices vs latency

In Chapters 3 and 4, you saw that realtime models offer a small set of built-in voices — six from OpenAI, five from Gemini. For many applications, that is not enough. You might need a specific brand voice, a cloned voice, fine-grained emotion control, or a provider like Cartesia or ElevenLabs with hundreds of voice options.

The obvious solution is to use a pipeline — but then you lose the realtime model's advantages: lower latency on speech comprehension, audio-native understanding of tone and emotion, and built-in turn detection that uses semantic context.

The half-cascade gives you a middle path.

How half-cascade works

Instead of sending audio to and from the realtime model, you configure it to output text only. The realtime model still receives raw audio input — it still "hears" the user with all the tonal and emotional context that provides. But instead of generating audio output, it generates text. That text is then routed to a dedicated TTS of your choice.

Half-cascade architecture

User Audio

Realtime Model

Audio in, text out

Custom TTS

Agent Audio

Compare this to the three architectures you already know:

Architecture	Input processing	Reasoning	Output speech
Pipeline	STT model (text)	LLM (text)	TTS model
Pure realtime	Realtime model (audio)	Realtime model	Realtime model
Half-cascade	Realtime model (audio)	Realtime model	TTS model

What's happening

The half-cascade keeps the realtime model's advantage on the input side — it processes raw audio, so it understands tone, emphasis, and emotion that STT would lose in transcription. But on the output side, it delegates to a dedicated TTS, giving you full control over voice selection, emotion, speed, and provider choice. The tradeoff is slightly higher output latency compared to pure realtime, since the text must be synthesized by a separate model.

Implementation

The implementation is straightforward. Set the realtime model's modalities to ["text"] so it produces text instead of audio, and provide a tts parameter with any TTS instance.

half_cascade_agent.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime
from livekit.plugins import cartesia

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions=(
              "You are a brand ambassador for Luxe Cosmetics. "
              "Speak warmly and enthusiastically about products. "
              "Keep responses concise — two sentences maximum."
          ),
      ),
      room=session.room,
      # Realtime model handles audio input and reasoning
      # but outputs TEXT only — no built-in voice synthesis
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          modalities=["text"],
      ),
      # Cartesia handles all speech output with your chosen voice
      tts=cartesia.TTS(voice="your-brand-voice-id"),
  )


if __name__ == "__main__":
  server.run()

halfCascadeAgent.tstypescript

import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new Agent({
      instructions:
        "You are a brand ambassador for Luxe Cosmetics. " +
        "Speak warmly and enthusiastically about products. " +
        "Keep responses concise — two sentences maximum.",
    }),
    room: session.room,
    // Realtime model handles audio input and reasoning
    // but outputs TEXT only — no built-in voice synthesis
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      modalities: ["text"],
    }),
    // Cartesia handles all speech output with your chosen voice
    tts: new CartesiaTTS({ voice: "your-brand-voice-id" }),
  });
},
});

The key is modalities=["text"]. This single parameter changes the realtime model from an end-to-end speech model into a "speech-in, text-out" model. The framework detects that TTS is configured and routes the text output through it automatically.

Works with any realtime model that supports text modality

Not all realtime model providers support text-only output. OpenAI Realtime supports it. Check the relevant LiveKit plugin page for your provider to confirm support before using this pattern.

Using any TTS provider

Because the TTS is now a separate component, you can use any provider LiveKit supports — the same providers available in a full pipeline.

tts_providers.py (excerpt)python

from livekit.plugins import cartesia, elevenlabs, google
from livekit.plugins.openai import realtime

# Option 1: Cartesia — hundreds of voices, fine-grained emotion control
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = cartesia.TTS(voice="your-voice-id")

# Option 2: ElevenLabs — voice cloning, multilingual
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = elevenlabs.TTS(voice="your-cloned-voice-id")

# Option 3: Google Cloud TTS — affordable, production-grade
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = google.TTS(voice="en-US-Journey-F")

ttsProviders.ts (excerpt)typescript

import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { ElevenLabsTTS } from "@livekit/agents-plugin-elevenlabs";

// Option 1: Cartesia — hundreds of voices, fine-grained emotion control
const llm = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
modalities: ["text"],
});
const tts1 = new CartesiaTTS({ voice: "your-voice-id" });

// Option 2: ElevenLabs — voice cloning, multilingual
const tts2 = new ElevenLabsTTS({ voice: "your-cloned-voice-id" });

What's happening

This is the same TTS flexibility you get with a full pipeline, but without the STT stage. The realtime model replaces both STT and LLM — it understands audio directly and produces text — while TTS is handled by a specialist provider. You get voice cloning, custom voices, emotion control, and every other TTS feature that pure realtime models cannot offer.

What about adding a separate STT?

You can also go the other direction: keep the realtime model's built-in voice output but add a dedicated STT running in parallel for better transcriptions. Realtime models have a known "delayed transcription" problem — user input transcripts can arrive late or after the agent has already responded. Adding a parallel STT (like Deepgram Nova-3) gives you real-time interim transcripts for frontend display and compliance logging. This is also required if you want to use LiveKit's context-aware turn detector model, which needs live STT results to operate. Unlike half-cascade, this does not change the core architecture — the realtime model still does all comprehension and reasoning. The parallel STT is a supplementary stream, not a replacement.

Gemini Live half-cascade

The same pattern works with Gemini Live. Use text-only output modality and pair it with a TTS:

gemini_half_cascade.pypython

from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime
from livekit.plugins import elevenlabs

server = AgentServer()


@server.rtc_session
async def entrypoint(session: AgentSession):
  await session.start(
      agent=Agent(
          instructions="You are a visual shopping assistant. "
          "Describe products the user shows on camera. "
          "Keep descriptions brief and enthusiastic.",
      ),
      room=session.room,
      llm=google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          modalities=["TEXT"],
      ),
      tts=elevenlabs.TTS(voice="your-voice-id"),
  )


if __name__ == "__main__":
  server.run()

With Gemini Live, this is especially powerful: you get multimodal input (audio + video) combined with any voice you want for output. The model sees the user's camera, hears their voice with full emotional context, reasons about both, and responds through a premium TTS voice.

Gemini + vision + custom voice

Half-cascade with Gemini Live is one of the most capable configurations available: multimodal input (audio + video), strong reasoning, and unlimited voice selection. No other single architecture gives you all three.

Latency impact

Half-cascade sits between pure realtime and full pipeline in latency:

Architecture	Typical TTFB	Why
Pure realtime	200-500ms	Single model, audio output directly
Half-cascade	350-700ms	Realtime reasoning + TTS synthesis
Full pipeline	500-1200ms	STT + LLM + TTS chain

The additional latency comes from TTS synthesis — the realtime model's text output must be converted to audio by a separate model. But you skip the STT stage entirely, so half-cascade is still meaningfully faster than a full pipeline.

What's happening

The latency numbers tell the story clearly. Half-cascade adds roughly 100-200ms over pure realtime (the TTS synthesis step), but saves 200-500ms compared to a full pipeline (by skipping STT entirely). For applications where you need a specific voice but also want faster responses than a pipeline can deliver, half-cascade is the sweet spot.

When to use half-cascade

Half-cascade is the right choice when:

You need a specific voice but want faster-than-pipeline latency. Your brand requires a particular voice or you use voice cloning, but 500-1200ms pipeline latency is too slow.
You want audio-native input understanding with voice control. The realtime model hears tone and emotion — and you want a premium TTS for the output.
You need the say method for scripted speech. Pure realtime models cannot reliably produce scripted output. With a separate TTS, you can use say() to speak exact text when needed (greetings, legal disclaimers, hold messages).
You need to load conversation history reliably. Some realtime models (notably OpenAI) can become text-only after loading extensive history. Using a separate TTS from the start avoids this failure mode entirely.
You want Gemini's vision with a better voice. Gemini Live's built-in voices may not match your needs, but its multimodal input is unmatched. Half-cascade lets you pair Gemini's vision with any TTS.

Half-cascade is NOT the right choice when:

Pure realtime latency is essential. If every millisecond counts, the extra TTS hop matters.
The built-in voices are good enough. If one of the six OpenAI voices or five Gemini voices works for your application, pure realtime is simpler and faster.
You need custom STT accuracy. Half-cascade still relies on the realtime model's built-in speech recognition. If you need a specific STT model (for domain-specific vocabulary, accent handling, or compliance), use a full pipeline.

Comparing all four architectures

You now know four distinct architectures. Here is a summary:

	Pipeline	Pure Realtime	Half-Cascade	Hybrid
STT	Dedicated model	Built into realtime	Built into realtime	Mixed
LLM/Reasoning	Dedicated text LLM	Built into realtime	Built into realtime	Mixed
TTS	Dedicated model	Built into realtime	Dedicated model	Mixed
Voice selection	Unlimited	Limited (5-6 voices)	Unlimited	Mixed
Audio understanding	Text only (loses tone)	Audio-native	Audio-native	Mixed
Latency	500-1200ms	200-500ms	350-700ms	Varies
Complexity	Medium	Low	Low-Medium	High

Test your knowledge

Question 1 of 3

What does setting modalities=['text'] on a realtime model change about its behavior?

What you learned

The half-cascade architecture pairs a realtime model (audio in, text out) with a dedicated TTS
Setting modalities=["text"] converts a realtime model to text-only output while preserving audio input
Half-cascade gives you unlimited voice selection with faster-than-pipeline latency
The pattern works with OpenAI Realtime, Gemini Live, and other providers that support text-only modality
Half-cascade is especially powerful with Gemini Live: multimodal input plus any voice

Next up

You have now seen four architectures: pipeline, pure realtime, half-cascade, and (up next) hybrid approaches that switch between architectures dynamically. In the next chapter, you will build agents that combine multiple architectures in a single conversation, routing to the right one based on what the user needs.