Wake word detection & audio streaming
Wake word detection & audio streaming
A voice device that streams to the cloud continuously wastes bandwidth, battery, and money. Wake word detection runs a lightweight neural network locally on the ESP32, listening for a trigger phrase and only activating the full LiveKit pipeline when the user speaks. This chapter covers engine selection, implementation, the wake-to-stream transition, and power management.
What you'll build
A device that listens for a wake word at ~30 mA, connects to WiFi and LiveKit only when triggered, streams a full voice conversation, and returns to low-power listening after the interaction ends.
Wake word engines
| Engine | Provider | Custom words | Model size | Latency | License |
|---|---|---|---|---|---|
| ESP-SR | Espressif | Yes (retrain) | ~1.5 MB | ~200ms | Apache 2.0 |
| Porcupine | Picovoice | Yes (web tool) | ~500 KB | ~100ms | Free tier available |
ESP-SR uses the ESP32-S3's vector instructions for on-chip inference. Larger model, slightly higher latency, but fully open source. Porcupine is smaller, faster, and offers a web-based custom wake word tool. Free tier covers non-commercial use.
Wake word detection is binary classification: "did I just hear the trigger phrase?" The model runs on ~512ms audio windows and produces a confidence score. When the score exceeds a threshold, the device wakes. Because the model is tiny and the task narrow, it runs on a microcontroller without cloud connectivity.
Implementation with Porcupine
#include <LiveKitClient.h>
#include <pv_porcupine.h>
#include <driver/i2s.h>
pv_porcupine_t* porcupine = NULL;
LiveKitClient lk;
bool isStreaming = false;
void setupWakeWord() {
float sensitivities[] = {0.5f};
pv_status_t status = pv_porcupine_init(
"your-picovoice-key",
1, // Number of keywords
keyword_model_data, // Embedded keyword model
sensitivities, // 0.0-1.0: lower = fewer false triggers
&porcupine
);
if (status != PV_STATUS_SUCCESS) {
Serial.println("Porcupine init failed");
}
}
void loop() {
if (!isStreaming) {
// Low-power mode: only wake word detection
int16_t pcm[512];
readI2SAudio(pcm, 512);
int32_t keyword_index = -1;
pv_porcupine_process(porcupine, pcm, &keyword_index);
if (keyword_index >= 0) {
Serial.println("Wake word detected!");
startStreaming();
}
} else {
// Active mode: stream to LiveKit
lk.update();
// Return to listening after 3s silence
if (lk.silenceDurationMs() > 3000) {
stopStreaming();
}
}
}
void startStreaming() {
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) delay(100);
lk.begin(lk_url, getToken());
lk.setAudioInput(I2S_NUM_0);
lk.setAudioOutput(I2S_NUM_1);
isStreaming = true;
}
void stopStreaming() {
lk.disconnect();
WiFi.disconnect();
isStreaming = false;
Serial.println("Returning to wake word listening");
}Sensitivity tuning
Start at 0.5 for development. In noisy environments (kitchens, workshops), lower to 0.3 to reduce false triggers. In quiet environments, raise to 0.7 for more responsive detection.
Connect-on-wake vs always-connected
Two streaming architectures, each with tradeoffs:
| Strategy | Latency after wake | Power | Best for |
|---|---|---|---|
| Connect on wake | 1-3s (WiFi + LiveKit connect) | Low | Battery devices, infrequent use |
| Always connected | Under 100ms (already in room) | High | Wall-powered kiosks, frequent use |
For always-connected mode, keep the LiveKit room connection alive but mute the audio track. On wake word, unmute and start streaming:
// Always-connected: stay in room, mute when idle
void startStreaming() {
lk.unmuteAudioInput();
isStreaming = true;
}
void stopStreaming() {
lk.muteAudioInput(); // Stay connected, stop sending audio
isStreaming = false;
}Power management
| Mode | Current | Battery life (1000 mAh) |
|---|---|---|
| Deep sleep | ~10 uA | ~11 years |
| Wake word listening (80 MHz) | ~30 mA | ~33 hours |
| Active streaming (WiFi) | ~100 mA | ~10 hours |
| Mixed (5 min active/hour) | ~35 mA | ~28 hours |
Deep sleep between interactions
After conversation ends and timeout passes, enter deep sleep. Use GPIO interrupt or ULP coprocessor to wake.
WiFi only when needed
Do not connect WiFi during wake word listening. WiFi connection takes 1-3 seconds — acceptable after a wake word trigger.
Reduce clock speed while listening
Drop from 240 MHz to 80 MHz during wake word mode. The model runs fine at lower speeds and power drops proportionally.
#include <esp_pm.h>
void enterLowPowerListening() {
// Reduce clock to 80 MHz for wake word mode
setCpuFrequencyMhz(80);
WiFi.disconnect();
WiFi.mode(WIFI_OFF);
}
void enterActiveMode() {
// Full speed for streaming
setCpuFrequencyMhz(240);
WiFi.mode(WIFI_STA);
}
void enterDeepSleep(uint32_t timeout_sec) {
// GPIO 0 button press will wake the device
esp_sleep_enable_ext0_wakeup(GPIO_NUM_0, 0);
esp_sleep_enable_timer_wakeup(timeout_sec * 1000000ULL);
esp_deep_sleep_start();
}Test your knowledge
Question 1 of 3
Why is WiFi only connected after the wake word is detected rather than kept on continuously for battery devices?
What you learned
- Wake word detection runs locally on the ESP32 using Porcupine (~500 KB, ~100ms) or ESP-SR (~1.5 MB, ~200ms)
- Connect-on-wake saves battery (1-3s latency); always-connected gives instant response for wall-powered devices
- Power management: 80 MHz clock + WiFi off during listening extends battery to ~28 hours in mixed use
- The
lk.silenceDurationMs()timeout returns the device to listening mode after conversation ends
Next up
In the next chapter, you will connect voice commands to physical hardware — controlling LEDs, relays, and servos through your LiveKit agent's function tools.