Custom TTS with realtime models
Custom TTS with realtime models
Realtime models give you low latency and natural speech comprehension — but they lock you into a handful of built-in voices. Pipelines give you any voice from any TTS provider — but add latency from three model hops. What if you could get the best of both? LiveKit supports exactly this: use a realtime model for speech understanding and reasoning, but route its text output through a dedicated TTS for speech synthesis. This is called the half-cascade architecture, and it is one of the most practical patterns for production voice agents.
What you'll learn
- How the half-cascade architecture works and why it exists
- Configuring a realtime model for text-only output
- Pairing a realtime model with any TTS provider
- When half-cascade is the right choice vs pure realtime or full pipeline
The problem: voices vs latency
In Chapters 3 and 4, you saw that realtime models offer a small set of built-in voices — six from OpenAI, five from Gemini. For many applications, that is not enough. You might need a specific brand voice, a cloned voice, fine-grained emotion control, or a provider like Cartesia or ElevenLabs with hundreds of voice options.
The obvious solution is to use a pipeline — but then you lose the realtime model's advantages: lower latency on speech comprehension, audio-native understanding of tone and emotion, and built-in turn detection that uses semantic context.
The half-cascade gives you a middle path.
How half-cascade works
Instead of sending audio to and from the realtime model, you configure it to output text only. The realtime model still receives raw audio input — it still "hears" the user with all the tonal and emotional context that provides. But instead of generating audio output, it generates text. That text is then routed to a dedicated TTS of your choice.
Half-cascade architecture
User Audio
Realtime Model
Audio in, text out
Custom TTS
Agent Audio
Compare this to the three architectures you already know:
| Architecture | Input processing | Reasoning | Output speech |
|---|---|---|---|
| Pipeline | STT model (text) | LLM (text) | TTS model |
| Pure realtime | Realtime model (audio) | Realtime model | Realtime model |
| Half-cascade | Realtime model (audio) | Realtime model | TTS model |
The half-cascade keeps the realtime model's advantage on the input side — it processes raw audio, so it understands tone, emphasis, and emotion that STT would lose in transcription. But on the output side, it delegates to a dedicated TTS, giving you full control over voice selection, emotion, speed, and provider choice. The tradeoff is slightly higher output latency compared to pure realtime, since the text must be synthesized by a separate model.
Implementation
The implementation is straightforward. Set the realtime model's modalities to ["text"] so it produces text instead of audio, and provide a tts parameter with any TTS instance.
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime
from livekit.plugins import cartesia
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions=(
"You are a brand ambassador for Luxe Cosmetics. "
"Speak warmly and enthusiastically about products. "
"Keep responses concise — two sentences maximum."
),
),
room=session.room,
# Realtime model handles audio input and reasoning
# but outputs TEXT only — no built-in voice synthesis
llm=realtime.RealtimeModel(
model="gpt-4o-realtime-preview",
modalities=["text"],
),
# Cartesia handles all speech output with your chosen voice
tts=cartesia.TTS(voice="your-brand-voice-id"),
)
if __name__ == "__main__":
server.run()import { AgentSession, Agent, defineAgent, type RtcSession } from "@livekit/agents";
import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
export default defineAgent({
entry: async (session: RtcSession) => {
await session.start({
agent: new Agent({
instructions:
"You are a brand ambassador for Luxe Cosmetics. " +
"Speak warmly and enthusiastically about products. " +
"Keep responses concise — two sentences maximum.",
}),
room: session.room,
// Realtime model handles audio input and reasoning
// but outputs TEXT only — no built-in voice synthesis
llm: new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
modalities: ["text"],
}),
// Cartesia handles all speech output with your chosen voice
tts: new CartesiaTTS({ voice: "your-brand-voice-id" }),
});
},
});The key is modalities=["text"]. This single parameter changes the realtime model from an end-to-end speech model into a "speech-in, text-out" model. The framework detects that TTS is configured and routes the text output through it automatically.
Works with any realtime model that supports text modality
Not all realtime model providers support text-only output. OpenAI Realtime supports it. Check the relevant LiveKit plugin page for your provider to confirm support before using this pattern.
Using any TTS provider
Because the TTS is now a separate component, you can use any provider LiveKit supports — the same providers available in a full pipeline.
from livekit.plugins import cartesia, elevenlabs, google
from livekit.plugins.openai import realtime
# Option 1: Cartesia — hundreds of voices, fine-grained emotion control
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = cartesia.TTS(voice="your-voice-id")
# Option 2: ElevenLabs — voice cloning, multilingual
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = elevenlabs.TTS(voice="your-cloned-voice-id")
# Option 3: Google Cloud TTS — affordable, production-grade
llm = realtime.RealtimeModel(model="gpt-4o-realtime-preview", modalities=["text"])
tts = google.TTS(voice="en-US-Journey-F")import { OpenAIRealtime } from "@livekit/agents-plugin-openai";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { ElevenLabsTTS } from "@livekit/agents-plugin-elevenlabs";
// Option 1: Cartesia — hundreds of voices, fine-grained emotion control
const llm = new OpenAIRealtime({
model: "gpt-4o-realtime-preview",
modalities: ["text"],
});
const tts1 = new CartesiaTTS({ voice: "your-voice-id" });
// Option 2: ElevenLabs — voice cloning, multilingual
const tts2 = new ElevenLabsTTS({ voice: "your-cloned-voice-id" });This is the same TTS flexibility you get with a full pipeline, but without the STT stage. The realtime model replaces both STT and LLM — it understands audio directly and produces text — while TTS is handled by a specialist provider. You get voice cloning, custom voices, emotion control, and every other TTS feature that pure realtime models cannot offer.
What about adding a separate STT?
You can also go the other direction: keep the realtime model's built-in voice output but add a dedicated STT running in parallel for better transcriptions. Realtime models have a known "delayed transcription" problem — user input transcripts can arrive late or after the agent has already responded. Adding a parallel STT (like Deepgram Nova-3) gives you real-time interim transcripts for frontend display and compliance logging. This is also required if you want to use LiveKit's context-aware turn detector model, which needs live STT results to operate. Unlike half-cascade, this does not change the core architecture — the realtime model still does all comprehension and reasoning. The parallel STT is a supplementary stream, not a replacement.
Gemini Live half-cascade
The same pattern works with Gemini Live. Use text-only output modality and pair it with a TTS:
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.google import realtime as google_realtime
from livekit.plugins import elevenlabs
server = AgentServer()
@server.rtc_session
async def entrypoint(session: AgentSession):
await session.start(
agent=Agent(
instructions="You are a visual shopping assistant. "
"Describe products the user shows on camera. "
"Keep descriptions brief and enthusiastic.",
),
room=session.room,
llm=google_realtime.RealtimeModel(
model="gemini-2.0-flash",
modalities=["TEXT"],
),
tts=elevenlabs.TTS(voice="your-voice-id"),
)
if __name__ == "__main__":
server.run()With Gemini Live, this is especially powerful: you get multimodal input (audio + video) combined with any voice you want for output. The model sees the user's camera, hears their voice with full emotional context, reasons about both, and responds through a premium TTS voice.
Gemini + vision + custom voice
Half-cascade with Gemini Live is one of the most capable configurations available: multimodal input (audio + video), strong reasoning, and unlimited voice selection. No other single architecture gives you all three.
Latency impact
Half-cascade sits between pure realtime and full pipeline in latency:
| Architecture | Typical TTFB | Why |
|---|---|---|
| Pure realtime | 200-500ms | Single model, audio output directly |
| Half-cascade | 350-700ms | Realtime reasoning + TTS synthesis |
| Full pipeline | 500-1200ms | STT + LLM + TTS chain |
The additional latency comes from TTS synthesis — the realtime model's text output must be converted to audio by a separate model. But you skip the STT stage entirely, so half-cascade is still meaningfully faster than a full pipeline.
The latency numbers tell the story clearly. Half-cascade adds roughly 100-200ms over pure realtime (the TTS synthesis step), but saves 200-500ms compared to a full pipeline (by skipping STT entirely). For applications where you need a specific voice but also want faster responses than a pipeline can deliver, half-cascade is the sweet spot.
When to use half-cascade
Half-cascade is the right choice when:
- You need a specific voice but want faster-than-pipeline latency. Your brand requires a particular voice or you use voice cloning, but 500-1200ms pipeline latency is too slow.
- You want audio-native input understanding with voice control. The realtime model hears tone and emotion — and you want a premium TTS for the output.
- You need the
saymethod for scripted speech. Pure realtime models cannot reliably produce scripted output. With a separate TTS, you can usesay()to speak exact text when needed (greetings, legal disclaimers, hold messages). - You need to load conversation history reliably. Some realtime models (notably OpenAI) can become text-only after loading extensive history. Using a separate TTS from the start avoids this failure mode entirely.
- You want Gemini's vision with a better voice. Gemini Live's built-in voices may not match your needs, but its multimodal input is unmatched. Half-cascade lets you pair Gemini's vision with any TTS.
Half-cascade is NOT the right choice when:
- Pure realtime latency is essential. If every millisecond counts, the extra TTS hop matters.
- The built-in voices are good enough. If one of the six OpenAI voices or five Gemini voices works for your application, pure realtime is simpler and faster.
- You need custom STT accuracy. Half-cascade still relies on the realtime model's built-in speech recognition. If you need a specific STT model (for domain-specific vocabulary, accent handling, or compliance), use a full pipeline.
Comparing all four architectures
You now know four distinct architectures. Here is a summary:
| Pipeline | Pure Realtime | Half-Cascade | Hybrid | |
|---|---|---|---|---|
| STT | Dedicated model | Built into realtime | Built into realtime | Mixed |
| LLM/Reasoning | Dedicated text LLM | Built into realtime | Built into realtime | Mixed |
| TTS | Dedicated model | Built into realtime | Dedicated model | Mixed |
| Voice selection | Unlimited | Limited (5-6 voices) | Unlimited | Mixed |
| Audio understanding | Text only (loses tone) | Audio-native | Audio-native | Mixed |
| Latency | 500-1200ms | 200-500ms | 350-700ms | Varies |
| Complexity | Medium | Low | Low-Medium | High |
Test your knowledge
Question 1 of 3
What does setting modalities=['text'] on a realtime model change about its behavior?
What you learned
- The half-cascade architecture pairs a realtime model (audio in, text out) with a dedicated TTS
- Setting
modalities=["text"]converts a realtime model to text-only output while preserving audio input - Half-cascade gives you unlimited voice selection with faster-than-pipeline latency
- The pattern works with OpenAI Realtime, Gemini Live, and other providers that support text-only modality
- Half-cascade is especially powerful with Gemini Live: multimodal input plus any voice
Next up
You have now seen four architectures: pipeline, pure realtime, half-cascade, and (up next) hybrid approaches that switch between architectures dynamically. In the next chapter, you will build agents that combine multiple architectures in a single conversation, routing to the right one based on what the user needs.