CI/CD integration
CI/CD integration
Tests that you forget to run are tests that do not exist. In this chapter, you will wire your behavioral tests, golden tests, and regression checks into a GitHub Actions CI/CD pipeline. Every push will run the test suite, and deployments will be gated on passing results.
What you'll learn
- How to configure GitHub Actions to run voice AI agent tests
- How to manage API keys and secrets in CI
- How to generate and publish test reports
- How to gate deployments on test results
The CI pipeline structure
Your CI pipeline has three stages, each progressively more expensive:
Unit tests (fast, no LLM calls)
Run tool function tests and utility tests. These take seconds and cost nothing.
Behavioral and golden tests (medium, requires LLM)
Run behavioral tests and golden tests that use judge(). These take minutes and require an LLM API key.
Regression evaluation (slow, requires LLM)
Run the full evaluation suite and compare against the baseline. This takes longer and is typically run only on pull requests to the main branch.
Structure your pipeline like a funnel. Cheap tests run first and fail fast. Expensive tests only run if the cheap ones pass. This saves time and API costs on broken commits.
The complete GitHub Actions workflow
name: Agent Tests
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
PYTHON_VERSION: "3.12"
NODE_VERSION: "20"
jobs:
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: python -m pytest tests/unit/ -v --tb=short
- name: Upload unit test results
if: always()
uses: actions/upload-artifact@v4
with:
name: unit-test-results
path: test-results/
behavioral-tests:
name: Behavioral & Golden Tests
needs: unit-tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run golden tests
run: python -m pytest tests/test_golden.py -v --tb=long
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Run behavioral tests
run: python -m pytest tests/ -v --tb=short --ignore=tests/unit/ --ignore=tests/test_golden.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload behavioral test results
if: always()
uses: actions/upload-artifact@v4
with:
name: behavioral-test-results
path: test-results/
regression-check:
name: Regression Check
needs: behavioral-tests
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run regression check
run: python scripts/check_regression.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload evaluation report
if: always()
uses: actions/upload-artifact@v4
with:
name: evaluation-report
path: evaluation/reports/API keys in CI
Never hardcode API keys in workflow files. Store them as GitHub repository secrets under Settings, then Secrets and variables, then Actions. The workflow references them with the secrets context.
TypeScript workflow variant
If your agent is built in TypeScript, the workflow uses Node.js tooling instead.
name: Agent Tests (TypeScript)
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
- run: npm ci
- name: Run unit tests
run: npx vitest run tests/unit/ --reporter=verbose
behavioral-tests:
name: Behavioral & Golden Tests
needs: unit-tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
- run: npm ci
- name: Run golden tests
run: npx vitest run tests/golden.test.ts --reporter=verbose
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Run behavioral tests
run: npx vitest run tests/ --exclude=tests/unit/ --exclude=tests/golden.test.ts --reporter=verbose
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
regression-check:
name: Regression Check
needs: behavioral-tests
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
- run: npm ci
- name: Run regression check
run: npx tsx scripts/check-regression.ts
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Generating test reports
Add structured test output so you can review results in the GitHub Actions UI.
[pytest]
testpaths = tests
addopts = --junitxml=test-results/results.xml -v
markers =
golden: critical behavior tests that must never fail
behavioral: behavioral conversation tests
unit: fast unit tests with no LLM callsimport pytest
@pytest.mark.golden
class TestGoldenBehaviors:
async def test_never_gives_medical_advice(self):
# ... test implementation ...
pass
@pytest.mark.golden
async def test_always_confirms_before_booking(self):
# ... test implementation ...
passYou can then add a step to your workflow that publishes the JUnit XML results:
- name: Publish test results
if: always()
uses: dorny/test-reporter@v1
with:
name: Agent Test Results
path: test-results/results.xml
reporter: java-junitTest result visibility
The test reporter action creates a check run on your pull request with a summary of all test results. Failures show the test name, the assertion message, and a link to the full log. This makes it easy to diagnose failures without digging through raw CI output.
Gating deployments
The final piece is preventing deployment when tests fail. Use GitHub branch protection rules combined with the workflow.
Configure branch protection
In your repository settings, go to Branches and add a branch protection rule for main. Enable "Require status checks to pass before merging."
Select required checks
Add "Unit Tests", "Behavioral & Golden Tests", and "Regression Check" as required status checks. Pull requests cannot merge until all three pass.
Add a deployment workflow
Create a separate deployment workflow that only triggers on pushes to main. Since main is protected, only tested code reaches deployment.
name: Deploy Agent
on:
push:
branches: [main]
jobs:
deploy:
name: Deploy to Production
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Deploy agent
run: python scripts/deploy.py
env:
LIVEKIT_URL: ${{ secrets.LIVEKIT_URL }}
LIVEKIT_API_KEY: ${{ secrets.LIVEKIT_API_KEY }}
LIVEKIT_API_SECRET: ${{ secrets.LIVEKIT_API_SECRET }}Cost control
Behavioral tests and evaluations make LLM API calls, which cost money. Run the full regression suite only on pull requests to main, not on every push to feature branches. The if: github.event_name == 'pull_request' condition in the workflow above handles this.
Test your knowledge
Question 1 of 3
Why is the CI pipeline structured as a funnel where unit tests run before behavioral tests?
What you learned
- Structure your CI pipeline as a funnel: unit tests first, then behavioral tests, then regression checks
- Use GitHub repository secrets to store API keys safely
- Generate JUnit XML reports for visibility in the GitHub Actions UI
- Gate deployments using branch protection rules that require test status checks to pass
- Run expensive regression checks only on pull requests to main to control API costs
Next up
Your tests run automatically and deployments are gated. In the final chapter, you will learn how to monitor agent quality in production with live evaluation, A/B testing, and quality gates.