Wake word detection & audio streaming

A voice device that streams to the cloud continuously wastes bandwidth, battery, and money. Wake word detection runs a lightweight neural network locally on the ESP32, listening for a trigger phrase and only activating the full LiveKit pipeline when the user speaks. This chapter covers engine selection, implementation, the wake-to-stream transition, and power management.

PorcupineESP-SRPower managementLow-power listening

What you'll build

A device that listens for a wake word at ~30 mA, connects to WiFi and LiveKit only when triggered, streams a full voice conversation, and returns to low-power listening after the interaction ends.

Wake word engines

Engine	Provider	Custom words	Model size	Latency	License
ESP-SR	Espressif	Yes (retrain)	~1.5 MB	~200ms	Apache 2.0
Porcupine	Picovoice	Yes (web tool)	~500 KB	~100ms	Free tier available

ESP-SR uses the ESP32-S3's vector instructions for on-chip inference. Larger model, slightly higher latency, but fully open source. Porcupine is smaller, faster, and offers a web-based custom wake word tool. Free tier covers non-commercial use.

What's happening

Wake word detection is binary classification: "did I just hear the trigger phrase?" The model runs on ~512ms audio windows and produces a confidence score. When the score exceeds a threshold, the device wakes. Because the model is tiny and the task narrow, it runs on a microcontroller without cloud connectivity.

Implementation with Porcupine

wake_word.cppcpp

#include <LiveKitClient.h>
#include <pv_porcupine.h>
#include <driver/i2s.h>

pv_porcupine_t* porcupine = NULL;
LiveKitClient lk;
bool isStreaming = false;

void setupWakeWord() {
  float sensitivities[] = {0.5f};
  pv_status_t status = pv_porcupine_init(
      "your-picovoice-key",
      1,                    // Number of keywords
      keyword_model_data,   // Embedded keyword model
      sensitivities,        // 0.0-1.0: lower = fewer false triggers
      &porcupine
  );
  if (status != PV_STATUS_SUCCESS) {
      Serial.println("Porcupine init failed");
  }
}

void loop() {
  if (!isStreaming) {
      // Low-power mode: only wake word detection
      int16_t pcm[512];
      readI2SAudio(pcm, 512);

      int32_t keyword_index = -1;
      pv_porcupine_process(porcupine, pcm, &keyword_index);

      if (keyword_index >= 0) {
          Serial.println("Wake word detected!");
          startStreaming();
      }
  } else {
      // Active mode: stream to LiveKit
      lk.update();

      // Return to listening after 3s silence
      if (lk.silenceDurationMs() > 3000) {
          stopStreaming();
      }
  }
}

void startStreaming() {
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) delay(100);

  lk.begin(lk_url, getToken());
  lk.setAudioInput(I2S_NUM_0);
  lk.setAudioOutput(I2S_NUM_1);
  isStreaming = true;
}

void stopStreaming() {
  lk.disconnect();
  WiFi.disconnect();
  isStreaming = false;
  Serial.println("Returning to wake word listening");
}

Sensitivity tuning

Start at 0.5 for development. In noisy environments (kitchens, workshops), lower to 0.3 to reduce false triggers. In quiet environments, raise to 0.7 for more responsive detection.

Connect-on-wake vs always-connected

Two streaming architectures, each with tradeoffs:

Strategy	Latency after wake	Power	Best for
Connect on wake	1-3s (WiFi + LiveKit connect)	Low	Battery devices, infrequent use
Always connected	Under 100ms (already in room)	High	Wall-powered kiosks, frequent use

For always-connected mode, keep the LiveKit room connection alive but mute the audio track. On wake word, unmute and start streaming:

always_connected.cppcpp

// Always-connected: stay in room, mute when idle
void startStreaming() {
  lk.unmuteAudioInput();
  isStreaming = true;
}

void stopStreaming() {
  lk.muteAudioInput();  // Stay connected, stop sending audio
  isStreaming = false;
}

Power management

Mode	Current	Battery life (1000 mAh)
Deep sleep	~10 uA	~11 years
Wake word listening (80 MHz)	~30 mA	~33 hours
Active streaming (WiFi)	~100 mA	~10 hours
Mixed (5 min active/hour)	~35 mA	~28 hours

Deep sleep between interactions

After conversation ends and timeout passes, enter deep sleep. Use GPIO interrupt or ULP coprocessor to wake.

WiFi only when needed

Do not connect WiFi during wake word listening. WiFi connection takes 1-3 seconds — acceptable after a wake word trigger.

Reduce clock speed while listening

Drop from 240 MHz to 80 MHz during wake word mode. The model runs fine at lower speeds and power drops proportionally.

power_management.cppcpp

#include <esp_pm.h>

void enterLowPowerListening() {
  // Reduce clock to 80 MHz for wake word mode
  setCpuFrequencyMhz(80);
  WiFi.disconnect();
  WiFi.mode(WIFI_OFF);
}

void enterActiveMode() {
  // Full speed for streaming
  setCpuFrequencyMhz(240);
  WiFi.mode(WIFI_STA);
}

void enterDeepSleep(uint32_t timeout_sec) {
  // GPIO 0 button press will wake the device
  esp_sleep_enable_ext0_wakeup(GPIO_NUM_0, 0);
  esp_sleep_enable_timer_wakeup(timeout_sec * 1000000ULL);
  esp_deep_sleep_start();
}

Test your knowledge

Question 1 of 3

Why is WiFi only connected after the wake word is detected rather than kept on continuously for battery devices?

What you learned

Wake word detection runs locally on the ESP32 using Porcupine (~500 KB, ~100ms) or ESP-SR (~1.5 MB, ~200ms)
Connect-on-wake saves battery (1-3s latency); always-connected gives instant response for wall-powered devices
Power management: 80 MHz clock + WiFi off during listening extends battery to ~28 hours in mixed use
The lk.silenceDurationMs() timeout returns the device to listening mode after conversation ends

Next up

In the next chapter, you will connect voice commands to physical hardware — controlling LEDs, relays, and servos through your LiveKit agent's function tools.