Cost analysis and use case recommendations

You know the providers and their tradeoffs. Now it is time to put numbers to decisions. This chapter models the real cost of voice AI conversations across different stacks, then gives you concrete recommendations for four common use cases — each with exact LiveKit configurations you can copy into your project. All examples use LiveKit Inference where possible for the simplest setup, with plugin alternatives noted.

Cost per conversationStack optimizationUse case matchingDecision framework

What you will learn

How costs break down across STT, LLM, and TTS for a typical 5-minute conversation
How to compare total cost across four stack profiles: Budget, Balanced, Premium, and Enterprise
When realtime mode is cheaper or more expensive than pipeline mode
Specific stack recommendations with LiveKit plugin code for common use cases
Cost optimization strategies that do not sacrifice quality

Anatomy of a voice AI conversation cost

Here is a typical 5-minute customer service conversation broken into billable units:

Metric	Typical value	Notes
User speaking time	~2 minutes	Questions, clarifications
Agent speaking time	~3 minutes	Responses, confirmations
LLM input tokens	~2,000 tokens	System prompt + conversation history
LLM output tokens	~600 tokens	Agent responses across all turns
TTS characters	~2,500 chars	Agent response text

What's happening

These numbers represent a typical conversation. Your volumes will vary based on user talkativeness, agent verbosity, and system prompt length. Use these as a starting point and adjust based on real data from your application.

Four stack profiles

Budget: Deepgram + GPT-4.1 mini + Cartesia

The lowest-cost pipeline that still delivers a good experience. Ideal for high-volume, straightforward interactions. All three models are available via LiveKit Inference.

Component	Unit cost	Volume	Cost
STT (Deepgram)	$0.006/min	2 min	$0.012
LLM input (GPT-4.1 mini)	$0.15/1M tokens	2,000 tokens	$0.0003
LLM output (GPT-4.1 mini)	$0.60/1M tokens	600 tokens	$0.0004
TTS (Cartesia)	$0.007/1K chars	2,500 chars	$0.018
Total			~$0.03

from livekit.agents import AgentSession, Agent, RoomInputOptions, inference

# Via LiveKit Inference — no per-provider API keys needed
session = AgentSession(
  stt=inference.STT(model="deepgram/nova-3", language="en"),
  llm=inference.LLM(model="openai/gpt-4.1-mini"),
  tts=inference.TTS(model="cartesia/sonic-3", voice="9626c31c-bec5-4cca-baa8-f8ba9e84c8bc"),
)

await session.start(
  agent=Agent(instructions="You are a helpful customer service agent. Be concise."),
  room=ctx.room,
  room_input_options=RoomInputOptions(),
)

Balanced: Deepgram + GPT-4.1 + Cartesia

Strong reasoning with the fastest TTS. Good for agents that need reliable function calling and tool use. All available via LiveKit Inference.

Component	Unit cost	Volume	Cost
STT (Deepgram)	$0.006/min	2 min	$0.012
LLM input (GPT-4.1)	$2.50/1M tokens	2,000 tokens	$0.005
LLM output (GPT-4.1)	$10.00/1M tokens	600 tokens	$0.006
TTS (Cartesia)	$0.007/1K chars	2,500 chars	$0.018
Total			~$0.04

Premium: Deepgram + Claude Sonnet + ElevenLabs

Best reasoning with the highest voice quality. Ideal for sensitive domains and brand-critical experiences. STT and TTS are available via Inference; Claude requires the Anthropic plugin.

Component	Unit cost	Volume	Cost
STT (Deepgram)	$0.006/min	2 min	$0.012
LLM input (Claude Sonnet)	$3.00/1M tokens	2,000 tokens	$0.006
LLM output (Claude Sonnet)	$15.00/1M tokens	600 tokens	$0.009
TTS (ElevenLabs)	$0.018/1K chars	2,500 chars	$0.045
Total			~$0.07

from livekit.agents import AgentSession, Agent, RoomInputOptions, inference
from livekit.plugins import anthropic

# Mix Inference (STT, TTS) with plugin (LLM) as needed
session = AgentSession(
  stt=inference.STT(model="deepgram/nova-3", language="en"),
  llm=anthropic.LLM(model="claude-sonnet-4-20250514"),
  tts=inference.TTS(model="elevenlabs/eleven_flash_v2_5", voice="your-voice-id"),
)

await session.start(
  agent=Agent(instructions="You are a medical office assistant. Never provide medical advice."),
  room=ctx.room,
  room_input_options=RoomInputOptions(),
)

Enterprise: Azure STT + GPT-4.1 + Azure TTS

Compliance-first stack for regulated industries. HIPAA, SOC 2, and GDPR certifications across STT and TTS. Azure STT/TTS require the Azure plugin for private endpoints and compliance features; the LLM can use Inference.

Component	Unit cost	Volume	Cost
STT (Azure)	$0.006-0.017/min	2 min	$0.012-0.034
LLM input (GPT-4.1)	$2.50/1M tokens	2,000 tokens	$0.005
LLM output (GPT-4.1)	$10.00/1M tokens	600 tokens	$0.006
TTS (Azure)	$0.004-0.016/1K chars	2,500 chars	$0.010-0.040
Total			~$0.03-0.09

from livekit.agents import AgentSession, Agent, RoomInputOptions, inference
from livekit.plugins import azure

# Azure plugin for compliance; Inference for LLM
session = AgentSession(
  stt=azure.STT(speech_key="your-key", speech_region="eastus", language="en-US"),
  llm=inference.LLM(model="openai/gpt-4.1"),
  tts=azure.TTS(speech_key="your-key", speech_region="eastus", voice="en-US-JennyNeural"),
)

await session.start(
  agent=Agent(instructions="You are a compliant enterprise assistant."),
  room=ctx.room,
  room_input_options=RoomInputOptions(),
)

Cost per minute summary

Stack	Cost per 5-min call	Cost per minute	Monthly cost (1K calls/day)
Budget (GPT-4.1 mini + Cartesia)	~$0.03	~$0.006	~$900
Balanced (GPT-4.1 + Cartesia)	~$0.04	~$0.008	~$1,200
Premium (Claude + ElevenLabs)	~$0.07	~$0.014	~$2,100
Enterprise (Azure + GPT-4.1 + Azure)	~$0.03-0.09	~$0.006-0.018	~$900-2,700

LLM is not always the biggest cost

A common misconception is that the LLM dominates voice AI costs. For short conversations with compact prompts, STT and TTS costs are often comparable to or greater than LLM costs. The LLM becomes dominant only with long conversations or verbose system prompts.

Pipeline vs realtime: cost comparison

Realtime models (OpenAI Realtime API, Gemini Live) charge per minute of audio rather than per token. Here is how they compare:

Mode	Cost model	5-min call cost	Notes
Pipeline (Budget)	STT + LLM tokens + TTS	~$0.03	Cheapest option
Pipeline (Premium)	STT + LLM tokens + TTS	~$0.07	Most flexible
OpenAI Realtime	~$0.06/min audio + tokens	~$0.30-0.50	Significantly more expensive
Gemini Live	Per-minute audio pricing	~$0.05-0.15	More competitive pricing

Realtime mode costs more

OpenAI's Realtime API is significantly more expensive than an equivalent pipeline setup — often 5-10x more per conversation. The lower latency comes at a steep price premium. Gemini Live is more competitive but still typically costs more than a well-tuned pipeline. Factor this into your decision.

Cost optimization strategies

Shorten your system prompt

Your system prompt is sent with every LLM turn. A 2,000-token prompt costs the same as the entire conversation history. Cut it to 500 tokens and your LLM input costs drop across every turn.

Use model tiering

Route simple queries (greetings, FAQs, straightforward tool calls) to GPT-4.1 mini and only escalate complex queries to GPT-4.1 or Claude. With LiveKit Inference, switching models is as simple as changing the model string. This can cut LLM costs by 50-70%.

Keep responses concise

Instruct your agent to respond in 2-3 sentences. A 50-word response costs 66% less in TTS than a 150-word response and is a better voice experience.

Monitor cost per conversation

Track STT, LLM, and TTS costs separately per conversation. You cannot optimize what you do not measure. Look at cost per resolution, not just cost per minute.

Use case recommendations

Customer service

High volume, straightforward interactions. Latency and reliability matter most. Fully available via LiveKit Inference.

Component	Choice	Why
STT	Deepgram Nova-3 (Inference)	Fastest streaming, best endpointing
LLM	GPT-4.1 mini (Inference), GPT-4.1 for complex	Fast and cheap with escalation path
TTS	Cartesia Sonic 3 (Inference)	Lowest latency, cost-effective

Expected: 300-500ms latency, ~$0.03 per 5-min call.

Healthcare

Sensitive information, accuracy and safety paramount, compliance required.

Component	Choice	Why
STT	Deepgram Nova-3 (Inference) or Azure Speech (plugin)	High accuracy; Azure for HIPAA
LLM	Claude Sonnet (Anthropic plugin)	Best safety, excellent instruction following
TTS	ElevenLabs (Inference) or Azure Neural TTS (plugin)	Natural trust-building voice; Azure for compliance

Expected: 400-700ms latency, ~$0.07 per 5-min call.

Education

Patient tutoring, clear explanations, engaging voice, long sessions.

Component	Choice	Why
STT	Deepgram Nova-3 (Inference)	Fast, good with diverse accents
LLM	GPT-4.1 (Inference)	Strong reasoning for explanations
TTS	ElevenLabs (Inference)	Natural, engaging voice for learning

Expected: 400-700ms latency, ~$0.07 per 5-min session.

Entertainment

Character voices, personality, emotional range, immersion.

Component	Choice	Why
STT	Deepgram Nova-3 (Inference)	Fast, reliable
LLM	GPT-4.1 (Inference)	Creative, good at staying in character
TTS	ElevenLabs (plugin) or Inworld (Inference)	Best voice quality/cloning; Inworld for gaming characters

Expected: 400-700ms latency, ~$0.07 per 5-min session.

Quick reference: all stacks

Use case	STT	LLM	TTS	Latency	Cost/5min
Budget	Deepgram Nova-3	GPT-4.1 mini	Cartesia Sonic 3	300-500ms	~$0.03
Balanced	Deepgram Nova-3	GPT-4.1	Cartesia Sonic 3	350-600ms	~$0.04
Premium	Deepgram Nova-3	Claude Sonnet	ElevenLabs	400-700ms	~$0.07
Enterprise	Azure Speech	GPT-4.1	Azure Neural TTS	400-700ms	~$0.06
Multilingual	ElevenLabs Scribe	Gemini 2.5 Flash	Cartesia Sonic 3	400-700ms	~$0.05
Lowest cost	Deepgram Nova-3	Gemini 2.5 Flash	Deepgram Aura 2	350-600ms	~$0.02

When in doubt, start here

If you are not sure which stack to choose, go with Deepgram Nova-3 + GPT-4.1 mini + Cartesia Sonic 3 via LiveKit Inference. It is the fastest, cheapest, and most forgiving combination — and requires zero per-provider API keys. You can upgrade any single component later by changing one model string.

Test your knowledge

Question 1 of 3

Which component typically dominates voice AI costs for short conversations with compact system prompts?

What you learned

A typical 5-minute voice AI call costs $0.02-0.07 depending on your stack
The Budget stack (Deepgram + GPT-4.1 mini + Cartesia via LiveKit Inference) delivers the best cost-to-quality ratio for most use cases
LiveKit Inference simplifies cost management with a single billing relationship across all providers
Realtime mode (OpenAI Realtime API) costs 5-10x more than pipeline mode
System prompt length, model tiering, and response conciseness are the biggest cost levers
Choose your stack based on your top priority: cost, quality, compliance, or latency

What to explore next

Now that you understand the AI stack, consider exploring: Voice AI Foundations to build your first agent, Pipeline Nodes to customize STT/LLM/TTS behavior at a deeper level, or Realtime vs Pipeline for an in-depth comparison of speech-to-speech models.

Cost analysis & use case recommendations