Embedded voice architecture
Embedded voice architecture
Voice AI is not confined to browsers and phone lines. Smart speakers, drive-through kiosks, warehouse robots, and wearable badges all run on microcontrollers talking to LiveKit Cloud over WiFi. This chapter covers the full architecture — from wiring an I2S microphone to an ESP32, through Opus encoding, to bidirectional audio streaming with a LiveKit Room.
What you'll build
By the end of this chapter you will have an ESP32-S3 connected to LiveKit Cloud, streaming bidirectional audio with a voice agent. You will understand the hardware wiring, I2S configuration, Opus codec tuning, and buffer management that make reliable voice on a microcontroller possible.
Why ESP32-S3
The ESP32-S3 is a dual-core Xtensa LX7 at 240 MHz with native I2S peripherals, WiFi, and optional 8 MB PSRAM — all for under $5. The I2S bus lets you wire a MEMS microphone and Class-D amplifier directly to the chip with no external audio codec. WiFi gives you a direct path to LiveKit Cloud. Dual cores let you dedicate one to audio capture while the other handles networking.
| Spec | ESP32-S3 |
|---|---|
| CPU | Dual-core Xtensa LX7 @ 240 MHz |
| SRAM | 512 KB |
| PSRAM | Up to 8 MB (optional) |
| WiFi | 802.11 b/g/n, 2.4 GHz |
| Audio | I2S (2 independent peripherals) |
| Power | ~100 mA active, ~10 uA deep sleep |
| Cost | ~$3-5 USD in volume |
The three constraints
Every design decision on a microcontroller is governed by:
Memory. 512 KB SRAM shared between audio buffers, WiFi stack, TLS, and your code. With PSRAM you get 8 MB of headroom. Without it, every allocation is deliberate.
Bandwidth. Real-world WiFi on congested 2.4 GHz is 2-5 Mbps. Opus at 16 kbps is well within budget, but you cannot afford uncompressed audio or chatty protocols.
Power. Battery devices need wake word detection at ~30 mA, WiFi streaming at ~100 mA, and deep sleep at ~10 uA. Your architecture determines battery life.
Architecture: ESP32 → LiveKit Cloud → Agent
Audio capture
The INMP441 MEMS microphone captures 16-bit PCM at 16 kHz over the I2S bus.
Encode and transmit
The ESP32 encodes audio with Opus, connects to a LiveKit Room over WebRTC, and publishes an audio Track.
Cloud processing
LiveKit routes the audio Track to your Agent. The Agent runs STT → LLM → TTS and publishes a response audio Track.
Playback
The ESP32 subscribes to the Agent's audio Track, decodes Opus, and sends PCM to the MAX98357A I2S amplifier.
The ESP32 is a translator at the network edge. It captures audio, compresses it into a tiny stream, ships it to the cloud where the intelligence lives, and plays back the response. All heavy processing (STT, LLM, TTS) happens on your LiveKit agent in the cloud.
Hardware wiring
The INMP441 microphone and MAX98357A amplifier connect directly to ESP32 GPIO pins via I2S.
| INMP441 Pin | ESP32-S3 GPIO | Description |
|---|---|---|
| SCK | GPIO 14 | I2S bit clock |
| WS | GPIO 15 | I2S word select |
| SD | GPIO 32 | I2S data out (mic → ESP32) |
| VDD | 3.3V | Power |
| GND | GND | Ground |
| L/R | GND | Channel select (low = left) |
| MAX98357A Pin | ESP32-S3 GPIO | Description |
|---|---|---|
| BCLK | GPIO 26 | I2S bit clock |
| LRC | GPIO 25 | I2S word select |
| DIN | GPIO 22 | I2S data in (ESP32 → speaker) |
| VIN | 5V | Power |
| GND | GND | Ground |
The ESP32-S3 has two independent I2S peripherals — run the microphone on I2S_NUM_0 and the speaker on I2S_NUM_1 for full-duplex audio without conflicts.
I2S configuration
Voice AI needs 16 kHz sample rate, 16-bit depth, mono. Higher rates waste bandwidth without improving STT quality.
#include <driver/i2s.h>
void configureI2SMicrophone() {
i2s_config_t mic_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = I2S_COMM_FORMAT_STAND_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 4, // ~256ms buffer for mic
.dma_buf_len = 1024,
.use_apll = false,
};
i2s_pin_config_t mic_pins = {
.bck_io_num = 14,
.ws_io_num = 15,
.data_out_num = I2S_PIN_NO_CHANGE,
.data_in_num = 32,
};
i2s_driver_install(I2S_NUM_0, &mic_config, 0, NULL);
i2s_set_pin(I2S_NUM_0, &mic_pins);
}
void configureI2SSpeaker() {
i2s_config_t spk_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_TX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = I2S_COMM_FORMAT_STAND_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8, // More buffers to absorb network jitter
.dma_buf_len = 1024,
.tx_desc_auto_clear = true,
};
i2s_pin_config_t spk_pins = {
.bck_io_num = 26,
.ws_io_num = 25,
.data_out_num = 22,
.data_in_num = I2S_PIN_NO_CHANGE,
};
i2s_driver_install(I2S_NUM_1, &spk_config, 0, NULL);
i2s_set_pin(I2S_NUM_1, &spk_pins);
}DMA buffer tuning
The speaker uses 8 DMA buffers vs 4 for the mic because playback audio arrives over the network with variable timing (jitter). More buffers absorb timing variations. The mic captures locally with predictable timing so fewer buffers suffice.
Opus codec configuration
Raw 16-bit PCM at 16 kHz is 256 kbps. Opus compresses it to 16-24 kbps with excellent speech quality. The LiveKit ESP32 SDK handles encoding, but you tune the parameters:
// Configure Opus encoding
lk.setOpusBitrate(16000); // 16 kbps — good quality for speech
lk.setOpusFrameSize(20); // 20ms frames — standard for voice
lk.setOpusComplexity(5); // 0-10, lower = less CPU
// Battery device: reduce CPU at slight quality cost
// lk.setOpusComplexity(2); // Saves ~15% CPU| Bitrate | Quality | Use case |
|---|---|---|
| 12 kbps | Acceptable | Battery-constrained wearable |
| 16 kbps | Good | Standard voice device |
| 24 kbps | Excellent | Kiosk with reliable WiFi |
Buffer management and latency
Audio flows through several buffers, each adding latency:
| Buffer | Latency | Purpose |
|---|---|---|
| I2S DMA (mic) | ~256ms | Absorb I2S timing jitter |
| Opus frame | 20ms | Encode one frame |
| Network send | ~20-40ms | WebRTC packet queue |
| Jitter buffer (playback) | 40-80ms | Smooth network timing |
| Total one-way | ~80-150ms | Device audio latency |
Combined with cloud processing (STT + LLM + TTS), expect 400-700ms end-to-end for a full voice interaction.
Monitor your heap
Combined I2S buffers, Opus encoder, and network stack consume ~300 KB of 512 KB SRAM on boards without PSRAM. Monitor with ESP.getFreeHeap() and alert below 50 KB.
Complete connection example
#include <WiFi.h>
#include <LiveKitClient.h>
#include <driver/i2s.h>
const char* ssid = "your-wifi";
const char* password = "your-password";
const char* lk_url = "wss://your-project.livekit.cloud";
LiveKitClient lk;
void setup() {
Serial.begin(115200);
configureI2SMicrophone();
configureI2SSpeaker();
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("\nWiFi connected");
// Connect to LiveKit room
// Generate token with: lk token create --api-key KEY --api-secret SECRET
// --join --room esp32-room --identity esp32-device
lk.begin(lk_url, getToken());
lk.setAudioInput(I2S_NUM_0); // Microphone
lk.setAudioOutput(I2S_NUM_1); // Speaker
Serial.println("LiveKit connected — listening");
}
void loop() {
lk.update(); // Process bidirectional audio
}The lk.update() call handles everything: reading I2S samples, Opus encoding, WebRTC transmission, receiving agent audio, decoding, and writing to the speaker. Deploy a simple agent and speak into the microphone to verify the full round-trip.
The agent side
Your cloud agent connects to the same LiveKit Room. It receives the ESP32's audio track and responds through its own audio track:
from livekit.agents import Agent, AgentSession, RoomInputOptions
from livekit.plugins import deepgram, openai, cartesia
class DeviceAssistant(Agent):
def __init__(self):
super().__init__(
instructions=(
"You are a voice assistant running on a physical device. "
"Keep responses short and conversational — the user is "
"speaking to a small speaker, not reading a screen."
),
)
async def entrypoint(ctx):
session = AgentSession(
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"), # Fast + cheap for device use
tts=cartesia.TTS(model="sonic", voice="friendly-assistant"),
)
await session.start(
agent=DeviceAssistant(),
room=ctx.room,
room_input_options=RoomInputOptions(),
)Test your knowledge
Question 1 of 3
What role does the ESP32 play in the embedded voice AI architecture?
What you learned
- The ESP32-S3 connects to LiveKit Cloud as a room participant, publishing and subscribing to audio tracks over WebRTC
- I2S configuration for voice: 16 kHz, 16-bit, mono, with asymmetric DMA buffers for mic vs speaker
- Opus codec tuning trades bitrate, CPU, and quality — 16 kbps is the sweet spot for speech
- The full latency budget is ~80-150ms device-side plus cloud processing time
- PSRAM is strongly recommended to avoid memory pressure from combined audio/network buffers
Next up
In the next chapter, you will add wake word detection so the device only connects to LiveKit Cloud when the user actually speaks — saving bandwidth and battery.