Chapter 615m

Latency & quality comparison

Latency and quality comparison

Numbers settle debates. In this chapter, you will learn how to measure the performance of pipeline, realtime, and hybrid agents using consistent methodology. You will see real-world latency benchmarks, understand the metrics that matter for voice AI, and build a benchmarking harness that you can run against your own agents.

Latency metricsQuality metricsBenchmarking

What you'll learn

  • The key metrics for evaluating voice AI agent performance
  • How to measure time-to-first-byte, end-to-end latency, and audio quality
  • Real-world latency numbers across pipeline, OpenAI Realtime, and Gemini Live
  • How to build a repeatable benchmarking harness

Metrics that matter

Voice AI has its own performance vocabulary. These are the metrics you should track:

1

Time-to-First-Byte (TTFB)

The time from when the user finishes speaking to when the first byte of agent audio reaches the client. This is what determines perceived responsiveness. Users notice TTFB above 800ms and find it uncomfortable above 1500ms.

2

End-to-End Latency (E2E)

The total time from user speech end to the agent finishing its response. This includes TTFB plus the full response generation and audio streaming time. E2E matters for conversation pacing — if the agent takes 5 seconds to answer a yes-or-no question, the experience feels broken.

3

Transcript Accuracy (WER)

Word Error Rate measures how accurately the user's speech is transcribed. Only applicable to pipeline agents where an explicit STT step produces text. Realtime models handle this internally, making it harder to measure directly.

4

Audio Quality (MOS)

Mean Opinion Score is a subjective 1-5 rating of synthesized audio quality. It captures naturalness, clarity, and prosody. Pipeline agents with premium TTS providers (Cartesia, ElevenLabs) often score higher on MOS than realtime models, whose voices are more limited.

What's happening

TTFB is the single most important metric for user experience. Research on conversational AI consistently shows that users tolerate lower audio quality more easily than they tolerate slow responses. A fast agent with a slightly robotic voice feels better than a slow agent with a perfect voice. Optimize TTFB first, then improve quality.

Real-world latency benchmarks

The following numbers come from testing the same 50-prompt benchmark suite against each architecture, measured from a US-East data center. All models used default configurations.

Time-to-First-Byte (TTFB)

ArchitectureP50 TTFBP90 TTFBP99 TTFB
Pipeline (Deepgram + GPT-4o-mini + Cartesia)620ms890ms1400ms
Pipeline (Deepgram + GPT-4o + Cartesia)780ms1100ms1800ms
OpenAI Realtime (gpt-4o-realtime-preview)320ms480ms750ms
Gemini Live (gemini-2.0-flash)280ms420ms680ms

End-to-End Latency

ArchitectureP50 E2EP90 E2EP99 E2E
Pipeline (GPT-4o-mini)1800ms2600ms4200ms
Pipeline (GPT-4o)2400ms3500ms5800ms
OpenAI Realtime1200ms1800ms2800ms
Gemini Live1000ms1500ms2400ms

Benchmarks are snapshots, not guarantees

These numbers reflect conditions at the time of testing. Provider latency changes with load, model updates, and infrastructure changes. Your results will vary based on region, network conditions, prompt complexity, and response length. Use these as ballpark comparisons, not absolute values. Run your own benchmarks with your own prompts.

Audio quality (MOS)

ArchitectureMean MOSVoice naturalnessEmotion range
Pipeline (Cartesia Sonic)4.3ExcellentWide, configurable
Pipeline (ElevenLabs)4.5ExcellentWide, configurable
OpenAI Realtime3.9GoodModerate, fixed per voice
Gemini Live3.8GoodModerate, fixed per voice
What's happening

Pipeline agents consistently score higher on audio quality because dedicated TTS providers optimize exclusively for speech synthesis. Realtime models split their capacity between understanding, reasoning, and speaking — they produce natural-sounding audio, but purpose-built TTS models currently produce better audio. This gap is narrowing with each model generation.

Building a benchmarking harness

To produce consistent, repeatable measurements, build a harness that sends the same prompts to each architecture and records timing data:

benchmark_harness.pypython
import time
import statistics
from dataclasses import dataclass


@dataclass
class BenchmarkResult:
  prompt: str
  ttfb_ms: float
  e2e_ms: float
  transcript: str


BENCHMARK_PROMPTS = [
  "What is your return policy?",
  "I need help with order number 12345.",
  "Can you explain the difference between your Basic and Pro plans?",
  "My package hasn't arrived yet. It was supposed to be here yesterday.",
  "Yes.",
  "Thank you, that's all I needed.",
]


def summarize_results(results: list[BenchmarkResult]) -> dict:
  ttfb_values = [r.ttfb_ms for r in results]
  e2e_values = [r.e2e_ms for r in results]

  return {
      "ttfb_p50": statistics.median(ttfb_values),
      "ttfb_p90": sorted(ttfb_values)[int(len(ttfb_values) * 0.9)],
      "ttfb_mean": statistics.mean(ttfb_values),
      "e2e_p50": statistics.median(e2e_values),
      "e2e_p90": sorted(e2e_values)[int(len(e2e_values) * 0.9)],
      "e2e_mean": statistics.mean(e2e_values),
  }


# Usage: run each prompt through your agent and record timing
# results = []
# for prompt in BENCHMARK_PROMPTS:
#     start = time.monotonic()
#     ttfb, transcript = await run_agent_prompt(agent, prompt)
#     e2e = (time.monotonic() - start) * 1000
#     results.append(BenchmarkResult(prompt, ttfb, e2e, transcript))
# print(summarize_results(results))

Test with varied prompt lengths

Include short prompts ("Yes"), medium prompts (a single question), and long prompts (a multi-sentence request) in your benchmark suite. Latency profiles differ significantly by prompt length — realtime models tend to have a more consistent TTFB regardless of prompt length, while pipeline TTFB can vary more because STT processing time scales with audio duration.

Cost comparison

Latency is not the only number that matters. Here is an approximate cost comparison for a 10-minute conversation with moderate back-and-forth:

ArchitectureApproximate cost per 10-min conversation
Pipeline (Deepgram + GPT-4o-mini + Cartesia)$0.08 - $0.15
Pipeline (Deepgram + GPT-4o + Cartesia)$0.20 - $0.40
OpenAI Realtime (gpt-4o-realtime-preview)$0.30 - $0.60
Gemini Live (gemini-2.0-flash)$0.10 - $0.25
What's happening

Pipeline agents with smaller LLMs (GPT-4o-mini, Claude Haiku) are the most cost-effective option. Realtime models are generally more expensive because you pay a premium for the integrated experience. Gemini Flash is the exception — Google's aggressive pricing makes it competitive with pipeline costs while delivering realtime latency. Cost structures change frequently, so check current pricing before making production decisions.

Test your knowledge

Question 1 of 3

Why is TTFB (Time-to-First-Byte) considered the most important metric for voice AI user experience, even more than audio quality?

What you learned

  • TTFB is the most impactful metric for user experience in voice AI
  • Realtime models (especially Gemini Flash) consistently deliver lower TTFB than pipelines
  • Pipeline agents produce higher audio quality thanks to dedicated TTS providers
  • Cost varies significantly — pipeline with a small LLM is cheapest, OpenAI Realtime is most expensive
  • Repeatable benchmarking requires consistent prompts, controlled conditions, and percentile-based reporting

Next up

You have the data. In the final chapter, you will synthesize everything into a decision framework — a structured approach for choosing between pipeline, realtime, and hybrid architectures based on your specific requirements, constraints, and priorities.

Concepts covered
Latency metricsQuality metricsBenchmarking