Chapter 915m

Cost optimization

Optimizing voice AI costs

A voice agent that works perfectly but costs three dollars per conversation will not survive its first invoice review. The three biggest cost drivers are LLM tokens, STT minutes, and TTS characters. In this chapter, you will learn to track, analyze, and reduce each one without sacrificing conversation quality.

Token usageModel selectionCaching

What you'll learn

  • How to track per-session costs across all AI providers
  • How to use model tiering to route simple queries to cheaper models
  • How to implement caching strategies that cut repeated work
  • How different provider choices compare on cost

Tracking per-session costs

You cannot optimize what you do not measure. Instrument your agent to emit cost metrics for every session.

agent.pypython
from dataclasses import dataclass, field
from livekit.agents import Agent, AgentSession

@dataclass
class SessionCosts:
  llm_input_tokens: int = 0
  llm_output_tokens: int = 0
  stt_audio_seconds: float = 0.0
  tts_characters: int = 0

  @property
  def estimated_cost_usd(self) -> float:
      # Adjust rates to match your provider pricing
      llm_input = self.llm_input_tokens * 0.15 / 1_000_000   # $0.15/M input
      llm_output = self.llm_output_tokens * 0.60 / 1_000_000  # $0.60/M output
      stt = self.stt_audio_seconds * 0.006 / 60               # $0.006/min
      tts = self.tts_characters * 15.00 / 1_000_000           # $15/M chars
      return llm_input + llm_output + stt + tts


class CostTrackedAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful voice assistant.",
      )
      self.costs = SessionCosts()

  async def llm_node(self, chat_ctx, tools):
      async for chunk in Agent.default.llm_node(self, chat_ctx, tools):
          if hasattr(chunk, "usage"):
              self.costs.llm_input_tokens += chunk.usage.prompt_tokens or 0
              self.costs.llm_output_tokens += chunk.usage.completion_tokens or 0
          yield chunk

  async def on_agent_session_end(self):
      print(f"Session cost: ${self.costs.estimated_cost_usd:.4f}")
      # Send to your analytics pipeline
      await report_cost(self.costs)
agent.tstypescript
import { Agent } from "@livekit/agents";

interface SessionCosts {
llmInputTokens: number;
llmOutputTokens: number;
sttAudioSeconds: number;
ttsCharacters: number;
}

function estimateCostUsd(costs: SessionCosts): number {
const llmInput = costs.llmInputTokens * 0.15 / 1_000_000;
const llmOutput = costs.llmOutputTokens * 0.60 / 1_000_000;
const stt = costs.sttAudioSeconds * 0.006 / 60;
const tts = costs.ttsCharacters * 15.0 / 1_000_000;
return llmInput + llmOutput + stt + tts;
}

class CostTrackedAgent extends Agent {
private costs: SessionCosts = {
  llmInputTokens: 0,
  llmOutputTokens: 0,
  sttAudioSeconds: 0,
  ttsCharacters: 0,
};

constructor() {
  super({ instructions: "You are a helpful voice assistant." });
}
}

Model tiering

Not every caller utterance needs your most powerful model. A simple "yes" or "what are your hours?" can be handled by a fast, cheap model. Reserve the expensive model for complex reasoning.

agent.pypython
from livekit.agents import Agent
from livekit.plugins import openai

FAST_LLM = openai.LLM(model="gpt-4o-mini")      # ~10x cheaper
POWERFUL_LLM = openai.LLM(model="gpt-4o")         # Higher quality

# Simple heuristic: short messages with common intents use the fast model
SIMPLE_PATTERNS = [
  "yes", "no", "thanks", "bye", "hello", "hi",
  "what are your hours", "where are you located",
  "can i speak to someone",
]


class TieredAgent(Agent):
  def __init__(self):
      super().__init__(
          instructions="You are a helpful voice assistant.",
      )

  async def llm_node(self, chat_ctx, tools):
      last_message = chat_ctx.messages[-1].text.lower().strip()
      is_simple = (
          len(last_message.split()) < 8
          or any(p in last_message for p in SIMPLE_PATTERNS)
      )
      model = FAST_LLM if is_simple else POWERFUL_LLM
      async for chunk in model.chat(chat_ctx=chat_ctx, tools=tools):
          yield chunk

Test tiering thoroughly

A misrouted complex query to the fast model produces a bad answer the caller hears immediately. Start with conservative routing (only the most obvious simple patterns) and expand gradually based on quality metrics.

Caching strategies

Many voice agents answer the same questions repeatedly. Caching avoids paying for the same LLM call twice.

1

Semantic response cache

Hash the last user message plus system instructions. If you have seen a sufficiently similar query before, return the cached response. Use an embedding similarity threshold (e.g., cosine similarity above 0.95) rather than exact string matching, since callers phrase the same question differently.

2

TTS audio cache

The same response text produces the same audio. Cache synthesized audio keyed on the text content and voice ID. This is especially effective for greetings, hold messages, and common answers. TTS caching alone can reduce costs by 20-30% for repetitive workloads.

3

STT result deduplication

If your agent handles many simultaneous calls to the same IVR menu, callers often say the same things. While STT caching is harder (audio varies), you can cache the downstream processing of common transcription results.

tts_cache.pypython
import hashlib
from typing import Optional

# Simple in-memory TTS cache — use Redis in production
_tts_cache: dict[str, bytes] = {}


def cache_key(text: str, voice_id: str) -> str:
  return hashlib.sha256(f"{voice_id}:{text}".encode()).hexdigest()


def get_cached_audio(text: str, voice_id: str) -> Optional[bytes]:
  return _tts_cache.get(cache_key(text, voice_id))


def store_cached_audio(text: str, voice_id: str, audio: bytes) -> None:
  key = cache_key(text, voice_id)
  _tts_cache[key] = audio

Cost comparison reference

Use this table as a starting point when choosing providers. Prices change frequently, so verify against current provider pricing.

ComponentProviderCostNotes
LLMGPT-4o-mini$0.15 / $0.60 per M tokens (in/out)Good for simple routing
LLMGPT-4o$2.50 / $10.00 per M tokens (in/out)Complex reasoning
STTDeepgram Nova-3$0.0059 / minLow latency streaming
STTGoogle Chirp$0.016 / minHigh accuracy
TTSCartesia Sonic$0.015 / 1K charsFast, natural voice
TTSElevenLabs$0.30 / 1K charsPremium voice quality
What's happening

Cost optimization is a continuous process, not a one-time setup. Track costs per session, set budgets and alerts when per-session costs exceed thresholds, and revisit your model tiering and caching strategies monthly as provider pricing and model capabilities evolve.

Test your knowledge

Question 1 of 3

Why should model tiering start with conservative routing rather than aggressively sending most queries to the cheaper model?

Concepts covered
Token usageModel selectionCaching