Chapter 520m

Hybrid approaches

Hybrid approaches

You do not have to choose one architecture for every interaction. The most capable production agents use realtime models for fast conversational exchanges and switch to the pipeline when the conversation demands complex tool use, specialized STT accuracy, or a particular voice. In this chapter, you will build a hybrid agent that switches between architectures dynamically, implement fallback patterns for reliability, and learn when hybrid complexity is worth the engineering cost.

Hybrid architectureFallbackModel switching

What you'll learn

  • How to build a hybrid agent that switches between realtime and pipeline
  • Dynamic model switching based on conversation state and intent
  • Fallback patterns for handling provider outages gracefully
  • When hybrid architecture is worth the added complexity

Why hybrid?

Each architecture has clear strengths. Realtime models deliver lower latency and more natural conversation flow. Pipelines give you control over each component — better STT accuracy, wider voice selection, mature tool calling. A hybrid approach lets you use realtime for the 80% of interactions that are simple conversational exchanges, and switch to the pipeline for the 20% that require heavy lifting.

Hybrid routing

User speaks

Simple query?

Route based on complexity

Yes: Realtime Model

Fast, low-latency response

No: Pipeline (STT, LLM, TTS)

Full control and reliability

Agent responds

What's happening

The key insight behind hybrid architecture is that most voice conversations are dominated by simple exchanges — greetings, confirmations, short factual questions. These do not need the full power of a pipeline with a top-tier LLM. A fast realtime model handles them with lower latency and less cost. But when the user asks something that requires a database lookup, a multi-step calculation, or a specialized model, the pipeline takes over.

Dynamic model switching via agent handoff

You cannot swap the LLM inside a tool call and expect it to help — the current model has already done its reasoning by the time the tool executes. Instead, the correct pattern is agent handoff: a lightweight realtime agent handles fast conversational turns, and when it detects a complex request, it hands off to a pipeline agent that processes the next turn with full STT/LLM/TTS control.

The realtime agent's only job for complex requests is to recognize them and hand off. The pipeline agent handles the actual reasoning, tool calling, and response. When the complex work is done, the pipeline agent hands back to the realtime agent for fast follow-up conversation.

hybrid_agent.pypython
from livekit.agents import (
  AgentSession, Agent, AgentServer, rtc_session,
  function_tool, RunContext,
)
from livekit.plugins.openai import realtime, LLM
from livekit.plugins import deepgram, cartesia

server = AgentServer()


class RealtimeAgent(Agent):
  """Fast conversational agent using a realtime model.
  Handles greetings, simple questions, and chitchat.
  Hands off to PipelineAgent for complex operations."""

  def __init__(self):
      super().__init__(
          instructions=(
              "You are a customer support agent. Handle simple questions "
              "conversationally. When the user asks about orders, accounts, "
              "or billing, use the hand_to_specialist tool — do NOT try to "
              "answer those questions yourself."
          ),
      )

  @function_tool()
  async def hand_to_specialist(self, ctx: RunContext, reason: str) -> str:
      """Hand off to the specialist agent for complex queries.

      Args:
          reason: Brief description of what the user needs.
      """
      ctx.session.update_agent(PipelineAgent())
      return f"Handing off: {reason}"


class PipelineAgent(Agent):
  """Pipeline agent for complex operations.
  Uses dedicated STT, a powerful text LLM, and high-quality TTS.
  Hands back to RealtimeAgent when done."""

  def __init__(self):
      super().__init__(
          instructions=(
              "You are an order specialist. Help the user with their "
              "order inquiry. Use the lookup_order tool to find details. "
              "When the user's issue is resolved, use hand_back_to_chat "
              "to return to normal conversation mode."
          ),
      )

  @function_tool()
  async def lookup_order(self, ctx: RunContext, order_id: str) -> str:
      """Look up order status by ID."""
      # Real database query here
      return f"Order {order_id}: 3 items, shipped March 27, arriving April 1."

  @function_tool()
  async def hand_back_to_chat(self, ctx: RunContext) -> str:
      """Return to fast conversational mode after resolving the issue."""
      ctx.session.update_agent(RealtimeAgent())
      return "Switching back to conversational mode."


@server.rtc_session
async def entrypoint(session: AgentSession):
  # Start with realtime for fast initial interaction
  await session.start(
      agent=RealtimeAgent(),
      room=session.room,
      # RealtimeAgent uses the realtime model — no STT/TTS needed
      llm=realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      ),
  )


if __name__ == "__main__":
  server.run()
hybridAgent.tstypescript
import {
AgentSession, Agent, defineAgent, functionTool,
type RunContext, type RtcSession,
} from "@livekit/agents";
import { OpenAIRealtime, OpenAILLM } from "@livekit/agents-plugin-openai";
import { DeepgramSTT } from "@livekit/agents-plugin-deepgram";
import { CartesiaTTS } from "@livekit/agents-plugin-cartesia";
import { z } from "zod";

class RealtimeAgent extends Agent {
constructor() {
  super({
    instructions:
      "You are a customer support agent. Handle simple questions " +
      "conversationally. When the user asks about orders, accounts, " +
      "or billing, use the hand_to_specialist tool — do NOT try to " +
      "answer those questions yourself.",
  });
}

tools = [
  functionTool({
    name: "hand_to_specialist",
    description: "Hand off to the specialist agent for complex queries.",
    parameters: z.object({
      reason: z.string().describe("What the user needs help with"),
    }),
    execute: async (ctx: RunContext, { reason }: { reason: string }) => {
      ctx.session.updateAgent(new PipelineAgent());
      return `Handing off: ${reason}`;
    },
  }),
];
}

class PipelineAgent extends Agent {
constructor() {
  super({
    instructions:
      "You are an order specialist. Help the user with their " +
      "order inquiry. Use the lookup_order tool to find details. " +
      "When resolved, use hand_back_to_chat to return to normal mode.",
  });
}

tools = [
  functionTool({
    name: "lookup_order",
    description: "Look up order status by ID.",
    parameters: z.object({
      orderId: z.string().describe("The order ID to look up"),
    }),
    execute: async (ctx: RunContext, { orderId }: { orderId: string }) => {
      return `Order ${orderId}: 3 items, shipped March 27, arriving April 1.`;
    },
  }),
  functionTool({
    name: "hand_back_to_chat",
    description: "Return to fast conversational mode.",
    parameters: z.object({}),
    execute: async (ctx: RunContext) => {
      ctx.session.updateAgent(new RealtimeAgent());
      return "Switching back to conversational mode.";
    },
  }),
];
}

export default defineAgent({
entry: async (session: RtcSession) => {
  await session.start({
    agent: new RealtimeAgent(),
    room: session.room,
    llm: new OpenAIRealtime({
      model: "gpt-4o-realtime-preview",
      voice: "alloy",
    }),
  });
},
});

Why agent handoff, not mid-turn model swapping?

You might think you can swap the LLM inside a tool call — but by the time a tool executes, the current model has already done its reasoning. The swap would only affect the next turn, and you would have wasted the current turn's processing. Agent handoff is cleaner: the realtime agent recognizes "this is complex," hands off immediately, and the pipeline agent handles the next turn from scratch with the right architecture.

Session continuity during handoff

When you hand off between agents, LiveKit preserves the conversation history. The new agent receives the full context of what has been said so far, so the user does not need to repeat themselves. The handoff introduces a brief pause — typically 200-500ms — which is noticeable but acceptable for most use cases.

Fallback patterns

Provider outages happen. A robust production agent needs fallback paths so a single provider going down does not take your entire system offline.

fallback_agent.pypython
from livekit.agents import AgentSession, Agent, AgentServer, rtc_session
from livekit.plugins.openai import realtime as openai_realtime
from livekit.plugins.google import realtime as google_realtime
from livekit.plugins import deepgram, openai, cartesia
import logging

logger = logging.getLogger("fallback-agent")

server = AgentServer()


async def create_llm_with_fallback():
  """Try realtime providers in order, fall back to pipeline."""
  providers = [
      ("OpenAI Realtime", lambda: openai_realtime.RealtimeModel(
          model="gpt-4o-realtime-preview",
          voice="alloy",
      )),
      ("Gemini Live", lambda: google_realtime.RealtimeModel(
          model="gemini-2.0-flash",
          voice="Puck",
      )),
  ]

  for name, factory in providers:
      try:
          model = factory()
          logger.info(f"Using {name}")
          return model, None, None
      except Exception as e:
          logger.warning(f"{name} unavailable: {e}")

  # All realtime providers failed — fall back to pipeline
  logger.warning("All realtime providers unavailable, using pipeline")
  return (
      openai.LLM(model="gpt-4o-mini"),
      deepgram.STT(model="nova-3"),
      cartesia.TTS(),
  )


@server.rtc_session
async def entrypoint(session: AgentSession):
  llm, stt, tts = await create_llm_with_fallback()

  await session.start(
      agent=Agent(instructions="You are a helpful assistant."),
      room=session.room,
      llm=llm,
      stt=stt,
      tts=tts,
  )


if __name__ == "__main__":
  server.run()
What's happening

The fallback pattern tries providers in priority order. If the preferred realtime model is unavailable, it tries the next one. If all realtime providers fail, it falls back to a pipeline. The user always gets a working agent — the experience may differ slightly, but the conversation never fails entirely. In production, you would add health checks and circuit breakers rather than catching exceptions at connection time.

Deciding when to switch

The hardest part of hybrid architecture is deciding when to switch. Here are three common strategies:

1

Capability-based handoff

The realtime agent has a hand-off tool for each complex domain. When the user asks about orders, billing, or account changes, the realtime agent recognizes the domain and hands off to a pipeline agent with the right tools and a more capable LLM. The pipeline agent hands back when the task is complete. This is the pattern shown above.

2

Intent-based handoff

Give the realtime agent a single escalate tool with a reason parameter. Instruct it to escalate when the user's request involves multi-step reasoning, detailed analysis, or domain-specific knowledge. A router in the entrypoint reads the reason and dispatches to the appropriate pipeline agent. This is more flexible than capability-based handoff but harder to test.

3

Phase-based switching

Some conversations have clear phases: a fast greeting phase (realtime), a complex transaction phase (pipeline), and a wrap-up phase (realtime). Structure the flow as a sequence of agents, each using the architecture that fits its phase. No runtime detection needed — the handoffs are predetermined by the conversation design.

Hybrid adds complexity

Every switching strategy adds code paths, failure modes, and testing surface. A straightforward realtime-only or pipeline-only agent is easier to build, test, deploy, and debug. Only adopt a hybrid approach when you have concrete evidence that a single architecture cannot meet your requirements. Start simple, measure, and add complexity only when the data justifies it.

Test your knowledge

Question 1 of 2

What is the key insight behind hybrid architecture that makes the engineering complexity worthwhile?

What you learned

  • Hybrid agents combine the low latency of realtime models with the control and reliability of pipelines
  • Dynamic model switching lets you change architectures mid-conversation based on the task at hand
  • Fallback patterns with ordered provider lists keep the agent running even when individual providers fail
  • Three switching strategies — tool-based, intent-based, and quality-based — each suit different use cases
  • Hybrid complexity is only worth it when a single architecture demonstrably falls short

Next up

You have built agents with every architecture: pipeline, OpenAI Realtime, Gemini Live, and hybrid. In the next chapter, you will put them all to the test — measuring latency, quality, and cost across each approach with real benchmarks so you can make data-driven decisions.

Concepts covered
Hybrid architectureFallbackModel switching