Chapter 615m

CI/CD integration

CI/CD integration

Tests that you forget to run are tests that do not exist. In this chapter, you will wire your behavioral tests, golden tests, and regression checks into a GitHub Actions CI/CD pipeline. Every push will run the test suite, and deployments will be gated on passing results.

GitHub ActionsAutomated testingReporting

What you'll learn

  • How to configure GitHub Actions to run voice AI agent tests
  • How to manage API keys and secrets in CI
  • How to generate and publish test reports
  • How to gate deployments on test results

The CI pipeline structure

Your CI pipeline has three stages, each progressively more expensive:

1

Unit tests (fast, no LLM calls)

Run tool function tests and utility tests. These take seconds and cost nothing.

2

Behavioral and golden tests (medium, requires LLM)

Run behavioral tests and golden tests that use judge(). These take minutes and require an LLM API key.

3

Regression evaluation (slow, requires LLM)

Run the full evaluation suite and compare against the baseline. This takes longer and is typically run only on pull requests to the main branch.

What's happening

Structure your pipeline like a funnel. Cheap tests run first and fail fast. Expensive tests only run if the cheap ones pass. This saves time and API costs on broken commits.

The complete GitHub Actions workflow

.github/workflows/agent-tests.ymlyaml
name: Agent Tests

on:
push:
  branches: [main, develop]
pull_request:
  branches: [main]

env:
PYTHON_VERSION: "3.12"
NODE_VERSION: "20"

jobs:
unit-tests:
  name: Unit Tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: ${{ env.PYTHON_VERSION }}
        cache: pip

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run unit tests
      run: python -m pytest tests/unit/ -v --tb=short

    - name: Upload unit test results
      if: always()
      uses: actions/upload-artifact@v4
      with:
        name: unit-test-results
        path: test-results/

behavioral-tests:
  name: Behavioral & Golden Tests
  needs: unit-tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: ${{ env.PYTHON_VERSION }}
        cache: pip

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run golden tests
      run: python -m pytest tests/test_golden.py -v --tb=long
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    - name: Run behavioral tests
      run: python -m pytest tests/ -v --tb=short --ignore=tests/unit/ --ignore=tests/test_golden.py
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    - name: Upload behavioral test results
      if: always()
      uses: actions/upload-artifact@v4
      with:
        name: behavioral-test-results
        path: test-results/

regression-check:
  name: Regression Check
  needs: behavioral-tests
  if: github.event_name == 'pull_request'
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: ${{ env.PYTHON_VERSION }}
        cache: pip

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run regression check
      run: python scripts/check_regression.py
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    - name: Upload evaluation report
      if: always()
      uses: actions/upload-artifact@v4
      with:
        name: evaluation-report
        path: evaluation/reports/

API keys in CI

Never hardcode API keys in workflow files. Store them as GitHub repository secrets under Settings, then Secrets and variables, then Actions. The workflow references them with the secrets context.

TypeScript workflow variant

If your agent is built in TypeScript, the workflow uses Node.js tooling instead.

.github/workflows/agent-tests-ts.ymlyaml
name: Agent Tests (TypeScript)

on:
push:
  branches: [main, develop]
pull_request:
  branches: [main]

jobs:
unit-tests:
  name: Unit Tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-node@v4
      with:
        node-version: "20"
        cache: npm

    - run: npm ci

    - name: Run unit tests
      run: npx vitest run tests/unit/ --reporter=verbose

behavioral-tests:
  name: Behavioral & Golden Tests
  needs: unit-tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-node@v4
      with:
        node-version: "20"
        cache: npm

    - run: npm ci

    - name: Run golden tests
      run: npx vitest run tests/golden.test.ts --reporter=verbose
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    - name: Run behavioral tests
      run: npx vitest run tests/ --exclude=tests/unit/ --exclude=tests/golden.test.ts --reporter=verbose
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

regression-check:
  name: Regression Check
  needs: behavioral-tests
  if: github.event_name == 'pull_request'
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-node@v4
      with:
        node-version: "20"
        cache: npm

    - run: npm ci

    - name: Run regression check
      run: npx tsx scripts/check-regression.ts
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Generating test reports

Add structured test output so you can review results in the GitHub Actions UI.

pytest.iniini
[pytest]
testpaths = tests
addopts = --junitxml=test-results/results.xml -v
markers =
  golden: critical behavior tests that must never fail
  behavioral: behavioral conversation tests
  unit: fast unit tests with no LLM calls
tests/test_golden.pypython
import pytest

@pytest.mark.golden
class TestGoldenBehaviors:
  async def test_never_gives_medical_advice(self):
      # ... test implementation ...
      pass

  @pytest.mark.golden
  async def test_always_confirms_before_booking(self):
      # ... test implementation ...
      pass

You can then add a step to your workflow that publishes the JUnit XML results:

.github/workflows/agent-tests.yml (report step)yaml
- name: Publish test results
      if: always()
      uses: dorny/test-reporter@v1
      with:
        name: Agent Test Results
        path: test-results/results.xml
        reporter: java-junit

Test result visibility

The test reporter action creates a check run on your pull request with a summary of all test results. Failures show the test name, the assertion message, and a link to the full log. This makes it easy to diagnose failures without digging through raw CI output.

Gating deployments

The final piece is preventing deployment when tests fail. Use GitHub branch protection rules combined with the workflow.

1

Configure branch protection

In your repository settings, go to Branches and add a branch protection rule for main. Enable "Require status checks to pass before merging."

2

Select required checks

Add "Unit Tests", "Behavioral & Golden Tests", and "Regression Check" as required status checks. Pull requests cannot merge until all three pass.

3

Add a deployment workflow

Create a separate deployment workflow that only triggers on pushes to main. Since main is protected, only tested code reaches deployment.

.github/workflows/deploy.ymlyaml
name: Deploy Agent

on:
push:
  branches: [main]

jobs:
deploy:
  name: Deploy to Production
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: "3.12"

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Deploy agent
      run: python scripts/deploy.py
      env:
        LIVEKIT_URL: ${{ secrets.LIVEKIT_URL }}
        LIVEKIT_API_KEY: ${{ secrets.LIVEKIT_API_KEY }}
        LIVEKIT_API_SECRET: ${{ secrets.LIVEKIT_API_SECRET }}

Cost control

Behavioral tests and evaluations make LLM API calls, which cost money. Run the full regression suite only on pull requests to main, not on every push to feature branches. The if: github.event_name == 'pull_request' condition in the workflow above handles this.

Test your knowledge

Question 1 of 3

Why is the CI pipeline structured as a funnel where unit tests run before behavioral tests?

What you learned

  • Structure your CI pipeline as a funnel: unit tests first, then behavioral tests, then regression checks
  • Use GitHub repository secrets to store API keys safely
  • Generate JUnit XML reports for visibility in the GitHub Actions UI
  • Gate deployments using branch protection rules that require test status checks to pass
  • Run expensive regression checks only on pull requests to main to control API costs

Next up

Your tests run automatically and deployments are gated. In the final chapter, you will learn how to monitor agent quality in production with live evaluation, A/B testing, and quality gates.

Concepts covered
GitHub ActionsAutomated testingReporting