Choosing your architecture
Choosing your architecture
You have built pipeline agents, OpenAI Realtime agents, Gemini Live agents, and hybrid agents. You have benchmarked their latency, compared their audio quality, and estimated their cost. Now it is time to pull it all together into a decision framework that you can apply to any voice AI project. This final chapter gives you a structured way to choose the right architecture for your specific requirements.
What you'll learn
- A repeatable decision framework for choosing between pipeline, realtime, and hybrid
- Which architecture fits which use case
- How to weigh cost, latency, quality, and control against your priorities
- A decision tree you can use for future projects
The decision framework
Architecture decisions should be driven by requirements, not hype. Work through these five questions in order:
What is your latency budget?
If your users expect sub-500ms TTFB (gaming, live translation, interactive characters), realtime is likely required. If 800-1200ms is acceptable (customer support, scheduling, information lookup), pipeline is viable. If you are unsure, test with real users — perceived latency tolerance varies by use case.
How important is voice control?
If you need a specific brand voice, voice cloning, fine-grained emotion control, or a particular accent, pipeline with a dedicated TTS provider (Cartesia, ElevenLabs) is the only option today. If one of the built-in realtime voices works for your application, this constraint disappears.
How complex is your tool use?
Simple tools (single lookups, short API calls) work well in both architectures. Complex tool chains (multi-step workflows, conditional branching, retries, database transactions) are more reliable in pipeline mode where text-based LLMs have mature function-calling patterns. If your agent is tool-heavy, lean toward pipeline or hybrid.
Do you need multimodal input?
If your agent needs to process video or images during live conversation, Gemini Live handles this natively. Pipeline requires bolting on a separate vision model. OpenAI Realtime is audio-only. This single requirement can drive the entire decision.
What is your cost constraint?
Pipeline with a small LLM (GPT-4o-mini) is the cheapest per conversation. Gemini Live is competitive. OpenAI Realtime is the most expensive. If you are building for high volume (thousands of concurrent conversations), cost differences compound quickly.
Decision tree
Use this tree as a starting point. Enter at the top and follow the branches based on your answers:
Architecture decision tree
Start
Evaluate your requirements
Need sub-500ms TTFB?
If yes, lean toward realtime
Yes: Need multimodal vision?
If yes: Gemini Live. If no: check tool use.
Heavy tool use?
If yes: Hybrid (Realtime + Pipeline fallback). If no: OpenAI Realtime or Gemini Live.
No: Need specific voice/TTS?
If yes: Pipeline
Heavy tool use?
If yes: Pipeline
Cost sensitive?
If yes: Pipeline with small LLM. If no: Realtime or Hybrid.
This decision tree is a simplification. Real projects often have competing requirements — you might need low latency AND a specific voice, which means neither pure approach is perfect. In those cases, hybrid architectures or creative workarounds (like using realtime for the conversation and a separate TTS for specific branded audio clips) can bridge the gap. Use the tree as a starting point, not a final answer.
Use case recommendations
Here are concrete recommendations for common voice AI use cases:
| Use case | Recommended architecture | Why |
|---|---|---|
| Customer support hotline | Pipeline | Needs transcript logging, reliable tool use for order lookup, and specific brand voice |
| Interactive game character | Realtime (OpenAI or Gemini) | Latency is critical, natural conversation dynamics matter, tool use is minimal |
| Healthcare triage | Pipeline | Regulatory requirement for accurate transcripts, complex decision trees, audit trail |
| Language tutor | Realtime (Gemini Live) | Benefits from hearing pronunciation nuance, video can show written text or gestures |
| Virtual receptionist | Hybrid | Fast greetings and simple routing via realtime, switch to pipeline for appointment scheduling with calendar tools |
| Technical support with screen sharing | Gemini Live | Vision capability lets the agent see the user's screen and guide them visually |
| High-volume outbound calls | Pipeline (small LLM) | Cost optimization at scale, 800ms TTFB is acceptable for outbound, needs telephony integration |
| Real-time translation | Realtime | Lowest possible latency is essential, audio-native processing preserves tone |
Recommendations evolve with the models
These recommendations reflect the state of realtime models as of early 2026. Both OpenAI and Google are rapidly improving their realtime offerings — adding more voices, better tool support, and lower prices. Revisit your architecture decision every 6-12 months as the landscape shifts.
Cost analysis at scale
Cost differences that seem small per conversation become significant at scale. Here is what the monthly bill looks like at different volumes:
| Daily conversations | Pipeline (GPT-4o-mini) | OpenAI Realtime | Gemini Live |
|---|---|---|---|
| 100 | $300 - $450 | $900 - $1,800 | $300 - $750 |
| 1,000 | $3,000 - $4,500 | $9,000 - $18,000 | $3,000 - $7,500 |
| 10,000 | $30,000 - $45,000 | $90,000 - $180,000 | $30,000 - $75,000 |
These are estimates based on 10-minute average conversations
Actual costs depend on conversation length, response verbosity, tool call frequency, and current provider pricing. Use these numbers for order-of-magnitude planning, not budgeting. Run your own cost analysis with realistic conversation samples before committing to a production architecture.
At 10,000 daily conversations, the difference between pipeline and OpenAI Realtime can be $50,000-$130,000 per month. That is enough to fund a dedicated engineering team to optimize your pipeline. Cost analysis should not be an afterthought — for high-volume applications, it is often the deciding factor.
Making the final decision
After working through the framework, you should have a clear primary architecture and a fallback plan:
Build a prototype with your primary choice
Implement your agent with the architecture the decision tree suggests. Use your actual prompts, tools, and conversation flows — not toy examples.
Benchmark with real scenarios
Run the benchmarking harness from Chapter 6 with your actual use case. Measure TTFB, E2E latency, and audio quality with your target user population if possible.
Test the fallback
Build a second implementation with your fallback architecture. You do not need to polish it — just prove it works so you have a backup if your primary choice hits a wall.
Deploy, measure, iterate
Ship the primary architecture. Monitor latency, quality, cost, and user satisfaction in production. Be prepared to switch architectures if the data tells you to — the code changes are minimal thanks to LiveKit's consistent API surface.
Test your knowledge
Question 1 of 2
A company is building a high-volume outbound calling system handling 10,000 daily conversations. Which architecture consideration should dominate their decision?
What you learned
- A five-question decision framework covers latency, voice control, tool complexity, multimodal needs, and cost
- The decision tree provides a structured starting point for architecture selection
- Use case recommendations map common scenarios to their best-fit architectures
- Cost differences compound at scale — pipeline with a small LLM is 3-6x cheaper than OpenAI Realtime
- The best approach is to prototype, benchmark, and iterate rather than committing to an architecture based on theory alone
Course summary
Over seven chapters, you have gone from understanding the fundamental difference between pipeline and realtime architectures to building production-ready agents with both approaches. You implemented a pipeline agent with optimized model selection, built realtime agents with both OpenAI and Gemini Live, designed hybrid architectures that combine the strengths of each, and measured their performance with real benchmarks. You now have the knowledge and the code to choose the right architecture for any voice AI project — and the framework to revisit that decision as the technology evolves.
Keep experimenting
The voice AI landscape is moving fast. New realtime models, new TTS providers, and new hybrid patterns emerge regularly. The architectures you learned in this course give you a stable foundation — but the specific model choices and provider recommendations will change. Stay curious, keep benchmarking, and let the data guide your decisions.