Embedded voice architecture

Voice AI is not confined to browsers and phone lines. Smart speakers, drive-through kiosks, warehouse robots, and wearable badges all run on microcontrollers talking to LiveKit Cloud over WiFi. This chapter covers the full architecture — from wiring an I2S microphone to an ESP32, through Opus encoding, to bidirectional audio streaming with a LiveKit Room.

ESP32-S3I2S audioWebRTC on microcontrollersEmbedded constraints

What you'll build

By the end of this chapter you will have an ESP32-S3 connected to LiveKit Cloud, streaming bidirectional audio with a voice agent. You will understand the hardware wiring, I2S configuration, Opus codec tuning, and buffer management that make reliable voice on a microcontroller possible.

Why ESP32-S3

The ESP32-S3 is a dual-core Xtensa LX7 at 240 MHz with native I2S peripherals, WiFi, and optional 8 MB PSRAM — all for under $5. The I2S bus lets you wire a MEMS microphone and Class-D amplifier directly to the chip with no external audio codec. WiFi gives you a direct path to LiveKit Cloud. Dual cores let you dedicate one to audio capture while the other handles networking.

Spec	ESP32-S3
CPU	Dual-core Xtensa LX7 @ 240 MHz
SRAM	512 KB
PSRAM	Up to 8 MB (optional)
WiFi	802.11 b/g/n, 2.4 GHz
Audio	I2S (2 independent peripherals)
Power	~100 mA active, ~10 uA deep sleep
Cost	~$3-5 USD in volume

The three constraints

Every design decision on a microcontroller is governed by:

Memory. 512 KB SRAM shared between audio buffers, WiFi stack, TLS, and your code. With PSRAM you get 8 MB of headroom. Without it, every allocation is deliberate.

Bandwidth. Real-world WiFi on congested 2.4 GHz is 2-5 Mbps. Opus at 16 kbps is well within budget, but you cannot afford uncompressed audio or chatty protocols.

Power. Battery devices need wake word detection at ~30 mA, WiFi streaming at ~100 mA, and deep sleep at ~10 uA. Your architecture determines battery life.

Architecture: ESP32 -> LiveKit Cloud -> Agent

Audio capture

The INMP441 MEMS microphone captures 16-bit PCM at 16 kHz over the I2S bus.

Encode and transmit

The ESP32 encodes audio with Opus, connects to a LiveKit Room over WebRTC, and publishes an audio Track.

Cloud processing

LiveKit routes the audio Track to your Agent. The Agent runs STT -> LLM -> TTS and publishes a response audio Track.

Playback

The ESP32 subscribes to the Agent's audio Track, decodes Opus, and sends PCM to the MAX98357A I2S amplifier.

What's happening

The ESP32 is a translator at the network edge. It captures audio, compresses it into a tiny stream, ships it to the cloud where the intelligence lives, and plays back the response. All heavy processing (STT, LLM, TTS) happens on your LiveKit agent in the cloud.

Hardware wiring

The INMP441 microphone and MAX98357A amplifier connect directly to ESP32 GPIO pins via I2S.

INMP441 Pin	ESP32-S3 GPIO	Description
SCK	GPIO 14	I2S bit clock
WS	GPIO 15	I2S word select
SD	GPIO 32	I2S data out (mic -> ESP32)
VDD	3.3V	Power
GND	GND	Ground
L/R	GND	Channel select (low = left)

MAX98357A Pin	ESP32-S3 GPIO	Description
BCLK	GPIO 26	I2S bit clock
LRC	GPIO 25	I2S word select
DIN	GPIO 22	I2S data in (ESP32 -> speaker)
VIN	5V	Power
GND	GND	Ground

The ESP32-S3 has two independent I2S peripherals — run the microphone on I2S_NUM_0 and the speaker on I2S_NUM_1 for full-duplex audio without conflicts.

I2S configuration

Voice AI needs 16 kHz sample rate, 16-bit depth, mono. Higher rates waste bandwidth without improving STT quality.

i2s_config.cppcpp

#include <driver/i2s.h>

void configureI2SMicrophone() {
  i2s_config_t mic_config = {
      .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
      .sample_rate = 16000,
      .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
      .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
      .communication_format = I2S_COMM_FORMAT_STAND_I2S,
      .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
      .dma_buf_count = 4,     // ~256ms buffer for mic
      .dma_buf_len = 1024,
      .use_apll = false,
  };

  i2s_pin_config_t mic_pins = {
      .bck_io_num = 14,
      .ws_io_num = 15,
      .data_out_num = I2S_PIN_NO_CHANGE,
      .data_in_num = 32,
  };

  i2s_driver_install(I2S_NUM_0, &mic_config, 0, NULL);
  i2s_set_pin(I2S_NUM_0, &mic_pins);
}

void configureI2SSpeaker() {
  i2s_config_t spk_config = {
      .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_TX),
      .sample_rate = 16000,
      .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
      .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
      .communication_format = I2S_COMM_FORMAT_STAND_I2S,
      .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
      .dma_buf_count = 8,     // More buffers to absorb network jitter
      .dma_buf_len = 1024,
      .tx_desc_auto_clear = true,
  };

  i2s_pin_config_t spk_pins = {
      .bck_io_num = 26,
      .ws_io_num = 25,
      .data_out_num = 22,
      .data_in_num = I2S_PIN_NO_CHANGE,
  };

  i2s_driver_install(I2S_NUM_1, &spk_config, 0, NULL);
  i2s_set_pin(I2S_NUM_1, &spk_pins);
}

DMA buffer tuning

The speaker uses 8 DMA buffers vs 4 for the mic because playback audio arrives over the network with variable timing (jitter). More buffers absorb timing variations. The mic captures locally with predictable timing so fewer buffers suffice.

Opus codec configuration

Raw 16-bit PCM at 16 kHz is 256 kbps. Opus compresses it to 16-24 kbps with excellent speech quality. The LiveKit ESP32 SDK handles encoding, but you tune the parameters:

opus_config.cppcpp

// Configure Opus encoding
lk.setOpusBitrate(16000);       // 16 kbps — good quality for speech
lk.setOpusFrameSize(20);        // 20ms frames — standard for voice
lk.setOpusComplexity(5);        // 0-10, lower = less CPU

// Battery device: reduce CPU at slight quality cost
// lk.setOpusComplexity(2);     // Saves ~15% CPU

Bitrate	Quality	Use case
12 kbps	Acceptable	Battery-constrained wearable
16 kbps	Good	Standard voice device
24 kbps	Excellent	Kiosk with reliable WiFi

Buffer management and latency

Audio flows through several buffers, each adding latency:

Buffer	Latency	Purpose
I2S DMA (mic)	~256ms	Absorb I2S timing jitter
Opus frame	20ms	Encode one frame
Network send	~20-40ms	WebRTC packet queue
Jitter buffer (playback)	40-80ms	Smooth network timing
Total one-way	~80-150ms	Device audio latency

Combined with cloud processing (STT + LLM + TTS), expect 400-700ms end-to-end for a full voice interaction.

Monitor your heap

Combined I2S buffers, Opus encoder, and network stack consume ~300 KB of 512 KB SRAM on boards without PSRAM. Monitor with ESP.getFreeHeap() and alert below 50 KB.

Complete connection example

main.cppcpp

#include <WiFi.h>
#include <LiveKitClient.h>
#include <driver/i2s.h>

const char* ssid = "your-wifi";
const char* password = "your-password";
const char* lk_url = "wss://your-project.livekit.cloud";

LiveKitClient lk;

void setup() {
  Serial.begin(115200);

  configureI2SMicrophone();
  configureI2SSpeaker();

  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
      delay(500);
      Serial.print(".");
  }
  Serial.println("\nWiFi connected");

  // Connect to LiveKit room
  // Generate token with: lk token create --api-key KEY --api-secret SECRET
  //   --join --room esp32-room --identity esp32-device
  lk.begin(lk_url, getToken());
  lk.setAudioInput(I2S_NUM_0);   // Microphone
  lk.setAudioOutput(I2S_NUM_1);  // Speaker

  Serial.println("LiveKit connected — listening");
}

void loop() {
  lk.update();  // Process bidirectional audio
}

The lk.update() call handles everything: reading I2S samples, Opus encoding, WebRTC transmission, receiving agent audio, decoding, and writing to the speaker. Deploy a simple agent and speak into the microphone to verify the full round-trip.

The agent side

Your cloud agent connects to the same LiveKit Room. It receives the ESP32's audio track and responds through its own audio track:

device_agent.pypython

from livekit.agents import Agent, AgentSession, RoomInputOptions
from livekit.plugins import deepgram, openai, cartesia

class DeviceAssistant(Agent):
  def __init__(self):
      super().__init__(
          instructions=(
              "You are a voice assistant running on a physical device. "
              "Keep responses short and conversational — the user is "
              "speaking to a small speaker, not reading a screen."
          ),
      )

async def entrypoint(ctx):
  session = AgentSession(
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),  # Fast + cheap for device use
      tts=cartesia.TTS(model="sonic", voice="friendly-assistant"),
  )

  await session.start(
      agent=DeviceAssistant(),
      room=ctx.room,
      room_input_options=RoomInputOptions(),
  )

Test your knowledge

Question 1 of 3

What role does the ESP32 play in the embedded voice AI architecture?

What you learned

The ESP32-S3 connects to LiveKit Cloud as a room participant, publishing and subscribing to audio tracks over WebRTC
I2S configuration for voice: 16 kHz, 16-bit, mono, with asymmetric DMA buffers for mic vs speaker
Opus codec tuning trades bitrate, CPU, and quality — 16 kbps is the sweet spot for speech
The full latency budget is ~80-150ms device-side plus cloud processing time
PSRAM is strongly recommended to avoid memory pressure from combined audio/network buffers

Next up

In the next chapter, you will add wake word detection so the device only connects to LiveKit Cloud when the user actually speaks — saving bandwidth and battery.