Chapter 225m

Wake word detection & audio streaming

Wake word detection & audio streaming

A voice device that streams to the cloud continuously wastes bandwidth, battery, and money. Wake word detection runs a lightweight neural network locally on the ESP32, listening for a trigger phrase and only activating the full LiveKit pipeline when the user speaks. This chapter covers engine selection, implementation, the wake-to-stream transition, and power management.

PorcupineESP-SRPower managementLow-power listening

What you'll build

A device that listens for a wake word at ~30 mA, connects to WiFi and LiveKit only when triggered, streams a full voice conversation, and returns to low-power listening after the interaction ends.

Wake word engines

EngineProviderCustom wordsModel sizeLatencyLicense
ESP-SREspressifYes (retrain)~1.5 MB~200msApache 2.0
PorcupinePicovoiceYes (web tool)~500 KB~100msFree tier available

ESP-SR uses the ESP32-S3's vector instructions for on-chip inference. Larger model, slightly higher latency, but fully open source. Porcupine is smaller, faster, and offers a web-based custom wake word tool. Free tier covers non-commercial use.

What's happening

Wake word detection is binary classification: "did I just hear the trigger phrase?" The model runs on ~512ms audio windows and produces a confidence score. When the score exceeds a threshold, the device wakes. Because the model is tiny and the task narrow, it runs on a microcontroller without cloud connectivity.

Implementation with Porcupine

wake_word.cppcpp
#include <LiveKitClient.h>
#include <pv_porcupine.h>
#include <driver/i2s.h>

pv_porcupine_t* porcupine = NULL;
LiveKitClient lk;
bool isStreaming = false;

void setupWakeWord() {
  float sensitivities[] = {0.5f};
  pv_status_t status = pv_porcupine_init(
      "your-picovoice-key",
      1,                    // Number of keywords
      keyword_model_data,   // Embedded keyword model
      sensitivities,        // 0.0-1.0: lower = fewer false triggers
      &porcupine
  );
  if (status != PV_STATUS_SUCCESS) {
      Serial.println("Porcupine init failed");
  }
}

void loop() {
  if (!isStreaming) {
      // Low-power mode: only wake word detection
      int16_t pcm[512];
      readI2SAudio(pcm, 512);

      int32_t keyword_index = -1;
      pv_porcupine_process(porcupine, pcm, &keyword_index);

      if (keyword_index >= 0) {
          Serial.println("Wake word detected!");
          startStreaming();
      }
  } else {
      // Active mode: stream to LiveKit
      lk.update();

      // Return to listening after 3s silence
      if (lk.silenceDurationMs() > 3000) {
          stopStreaming();
      }
  }
}

void startStreaming() {
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) delay(100);

  lk.begin(lk_url, getToken());
  lk.setAudioInput(I2S_NUM_0);
  lk.setAudioOutput(I2S_NUM_1);
  isStreaming = true;
}

void stopStreaming() {
  lk.disconnect();
  WiFi.disconnect();
  isStreaming = false;
  Serial.println("Returning to wake word listening");
}

Sensitivity tuning

Start at 0.5 for development. In noisy environments (kitchens, workshops), lower to 0.3 to reduce false triggers. In quiet environments, raise to 0.7 for more responsive detection.

Connect-on-wake vs always-connected

Two streaming architectures, each with tradeoffs:

StrategyLatency after wakePowerBest for
Connect on wake1-3s (WiFi + LiveKit connect)LowBattery devices, infrequent use
Always connectedUnder 100ms (already in room)HighWall-powered kiosks, frequent use

For always-connected mode, keep the LiveKit room connection alive but mute the audio track. On wake word, unmute and start streaming:

always_connected.cppcpp
// Always-connected: stay in room, mute when idle
void startStreaming() {
  lk.unmuteAudioInput();
  isStreaming = true;
}

void stopStreaming() {
  lk.muteAudioInput();  // Stay connected, stop sending audio
  isStreaming = false;
}

Power management

ModeCurrentBattery life (1000 mAh)
Deep sleep~10 uA~11 years
Wake word listening (80 MHz)~30 mA~33 hours
Active streaming (WiFi)~100 mA~10 hours
Mixed (5 min active/hour)~35 mA~28 hours
1

Deep sleep between interactions

After conversation ends and timeout passes, enter deep sleep. Use GPIO interrupt or ULP coprocessor to wake.

2

WiFi only when needed

Do not connect WiFi during wake word listening. WiFi connection takes 1-3 seconds — acceptable after a wake word trigger.

3

Reduce clock speed while listening

Drop from 240 MHz to 80 MHz during wake word mode. The model runs fine at lower speeds and power drops proportionally.

power_management.cppcpp
#include <esp_pm.h>

void enterLowPowerListening() {
  // Reduce clock to 80 MHz for wake word mode
  setCpuFrequencyMhz(80);
  WiFi.disconnect();
  WiFi.mode(WIFI_OFF);
}

void enterActiveMode() {
  // Full speed for streaming
  setCpuFrequencyMhz(240);
  WiFi.mode(WIFI_STA);
}

void enterDeepSleep(uint32_t timeout_sec) {
  // GPIO 0 button press will wake the device
  esp_sleep_enable_ext0_wakeup(GPIO_NUM_0, 0);
  esp_sleep_enable_timer_wakeup(timeout_sec * 1000000ULL);
  esp_deep_sleep_start();
}

Test your knowledge

Question 1 of 3

Why is WiFi only connected after the wake word is detected rather than kept on continuously for battery devices?

What you learned

  • Wake word detection runs locally on the ESP32 using Porcupine (~500 KB, ~100ms) or ESP-SR (~1.5 MB, ~200ms)
  • Connect-on-wake saves battery (1-3s latency); always-connected gives instant response for wall-powered devices
  • Power management: 80 MHz clock + WiFi off during listening extends battery to ~28 hours in mixed use
  • The lk.silenceDurationMs() timeout returns the device to listening mode after conversation ends

Next up

In the next chapter, you will connect voice commands to physical hardware — controlling LEDs, relays, and servos through your LiveKit agent's function tools.

Concepts covered
PorcupineESP-SRPower managementLow-power listening