Skip to content

CI/CD Testing Strategy

This document describes the testing strategy added to the Qubital backend CI/CD pipeline: what was implemented, why each decision was made, and how the pieces fit together.


Overview

Before this work, the CI pipeline only built Docker images and ran Trivy security scans. No Go tests, no linting, no compilation checks ran in CI. A broken constructor, a nil pointer in dependency wiring, or a new linter violation could reach production undetected.

The strategy introduces three test layers that run before the Docker build, gated by needs: test. If tests fail, the build never starts — no broken code gets packaged into an image.

┌───────────────────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│         test (gate)           │ --> │   build-push    │ --> │   trivy scan    │
│                               │     │    (Docker)     │     │   (security)    │
│   ├── Unit tests (go test)    │     └─────────────────┘     └─────────────────┘
│   ├── Smoke test (container)  │
│   ├── Race detection (-race)  │
│   └── Linter (golangci-lint)  │
└───────────────────────────────┘

Test Layers

1. Unit Tests (go test ./...)

Standard Go tests across all packages. These are fast (no external dependencies) and verify individual function behavior. Tests that require a database use testing.Short() to skip when -short is passed.

Command: go test -short -count=1 ./... (dev) or go test -count=1 -race ./... (test/prod)

2. Boot Smoke Test (//go:build smoke)

A single test that calls SetupDependencies() against a real PostgreSQL testcontainer and verifies every dependency field is initialized. This catches DI wiring regressions — the most common class of startup failures in this codebase.

Command: go test -tags smoke -count=1 -timeout 2m ./internal/app/...

Separated by build tag because it requires Docker (for testcontainers) and takes ~6 seconds. It must never run accidentally during fast dev loops.

3. Linter (golangci-lint)

Static analysis with 30+ linters covering security, correctness, complexity, and code quality. Runs the same configuration (.github/golangci.yml) locally and in CI.

Command: golangci-lint run --config .github/golangci.yml

4. Race Detection (-race)

A data race occurs when two goroutines access the same variable concurrently and at least one access is a write, with no synchronization between them. This causes undefined behavior: corrupted data, crashes, or bugs that only appear under specific timing.

Compiling with -race instruments every memory access in the binary. At runtime, it tracks which goroutine touched which address and when. If two unsynchronized accesses from different goroutines conflict, it reports the race with full stack traces. The cost is 5-10x memory and 2-20x execution time (docs), and it only catches races in code paths actually exercised during the run.

In this codebase, it targets concurrent state in the webhook recorder, presence manager, and similar goroutine-heavy code.

Flag: -race on go test (test and prod branches only — too slow for dev)


Branch Strategy: Dev vs Test vs Prod

Each branch has a different risk tolerance and corresponding test depth. The principle: the closer to production, the more thorough the checks.

Dev Branch (build-push-dev.yaml)

Trigger: Automatic on push to dev

Philosophy: Fast feedback. Developers push frequently to dev. The test gate must not add significant wait time to the build. We run only unit tests with -short to skip anything that needs a database or external service. No smoke test, no race detection, no linter — those are deferred to the test branch.

test:
  timeout-minutes: 10
  steps:
    - go test -short -count=1 ./...

Rationale for -short: The -short flag causes testing.Short() to return true. Tests that depend on databases (testcontainers) check this and skip themselves. This keeps the dev test job fast (seconds, not minutes) while still catching compilation errors, logic bugs in pure functions, and basic regressions.

What it catches: - Compilation errors - Unit test failures - Panics in pure logic

What it intentionally skips: - Database-dependent tests (testcontainers) - Boot smoke test (needs Docker) - Race detection (doubles execution time) - Linter violations (caught on test branch)

Test Branch (build-push-test.yaml)

Trigger: Automatic on push to test

Philosophy: Comprehensive validation. Code on the test branch is a release candidate. Run everything: all tests without -short (including database tests with testcontainers), smoke test, race detection, and the full linter suite. If anything fails, the Docker image is not built.

test:
  timeout-minutes: 15
  steps:
    - go test -count=1 -race ./...                           # all tests + race detection
    - go test -tags smoke -count=1 -timeout 2m ./internal/app/...  # smoke test
    - golangci-lint (v1.64, --config .github/golangci.yml)   # linter

Why race detection here but not dev: The -race flag approximately doubles test execution time and memory usage. On dev, where pushes are frequent and speed matters, this cost isn't justified. On test, where we're validating a release candidate, thoroughness trumps speed.

Why the linter here but not dev: Linter violations should not block rapid iteration on dev. They should block promotion to test/prod. Developers are expected to run the linter locally, but CI enforces it on the test branch as a safety net.

Prod Branch (build-push-prod.yaml)

Trigger: Manual (workflow_dispatch with semver version input)

Philosophy: Defense in depth. The test branch should have caught everything, but prod runs the same full suite again. Code may have changed between the test branch validation and the manual prod release (cherry-picks, hotfixes). The redundant check costs ~2 minutes and provides certainty.

test:
  timeout-minutes: 15
  steps:
    - go test -count=1 -race ./...                           # all tests + race detection
    - go test -tags smoke -count=1 -timeout 2m ./internal/app/...  # smoke test
    - golangci-lint (v1.64, --config .github/golangci.yml)   # linter

The test job is identical to the test branch. The build-push-scan job has needs: test — it only runs after tests pass. The existing Trivy security scan (blocking on HIGH/CRITICAL) still runs after the Docker build.

Summary Table

Check Dev Test Prod
Unit tests -short All All
Database tests (testcontainers) Skipped Yes Yes
Boot smoke test No Yes Yes
Race detection (-race) No Yes Yes
Linter (golangci-lint) No Yes Yes
Docker build After tests After tests After tests
Trivy scan Non-blocking Non-blocking Blocking (HIGH/CRITICAL)
SBOM / Provenance No No Yes

SBOM / Provenance (prod only): The prod Docker build sets sbom: true and provenance: mode=max on docker/build-push-action. SBOM (Software Bill of Materials) is a machine-readable inventory of every dependency inside the image — useful for vulnerability tracking and answering "are we affected by CVE X?". Provenance is a signed attestation of how the image was built (which workflow, commit, runner) — proof that the image hasn't been tampered with. Both are disabled on dev/test to keep builds lighter.

📌 How SBOM & Provenance work in practice

Where do they live? They're not files on disk or printed reports. They're automatically attached as metadata to the Docker image in GHCR when the prod workflow pushes. You can inspect them anytime:

docker buildx imagetools inspect ghcr.io/<owner>/qubital-backend-prod:<version> --format '{{json .SBOM}}'
docker buildx imagetools inspect ghcr.io/<owner>/qubital-backend-prod:<version> --format '{{json .Provenance}}'

Do I need to do anything right now? No. The data is generated and stored automatically on every prod release. No setup needed.

When does it become useful? - Vulnerability response: A new CVE is announced. Instead of rebuilding or scanning every image, you query the SBOM to instantly check if the affected package is in your prod image. Even better: connect a scanning tool (Snyk, Grype, Dependabot) to GHCR and it reads SBOMs automatically, alerting you when a dependency gets a new CVE. - Deployment security: Set up admission policies in your deployment target (e.g., Kubernetes) to reject images that don't have valid provenance. This prevents rogue images — even if someone pushes one to GHCR, it can't be deployed. - Compliance: Enterprise customers asking about SOC 2, SLSA, or supply chain security — the provenance and SBOM are ready-made evidence. No extra work when the audit comes.

What's the next step? The commented-out cosign signing block in build-push-prod.yaml is the natural follow-up. It cryptographically signs the image so provenance can be verified without trusting GHCR itself. Worth enabling when you have a deployment pipeline that checks signatures.

Cost: Negligible — a few extra KB of metadata per prod image. Zero impact on dev/test builds.


Boot Smoke Test

Purpose

Verify that SetupDependencies() can wire all ~60 dependency fields end-to-end against a real PostgreSQL instance without panics, nil pointers, or initialization errors. This is a wiring test, not a behavior test.

What it catches

  • Missing field assignments in SetupDependencies() (forgot to wire a new repository)
  • Constructor signature changes that break initialization (added a required parameter)
  • New required environment variables that aren't set
  • Import cycles between packages
  • Close() panics on nil fields during teardown

What it does NOT cover (by design)

  • Migration SQL correctness (covered by internal/database/migrate_test.go)
  • Route registration (deterministic from non-nil deps)
  • Actual service behavior (unit tests per feature)
  • Network connectivity to external services (integration/E2E tests)

Implementation

Files: - internal/app/boot_smoke_test.go — test function + non-nil assertion - internal/app/boot_smoke_helpers_test.go — container setup, env vars, schema seeding

Build tag: //go:build smoke — excluded from regular go test ./..., must be explicitly enabled with -tags smoke.

File placement rationale: The test lives in internal/app/ (same package as SetupDependencies()) following Go convention: tests next to the code they test. It was considered for test/smoke/ but rejected because: - It tests a single function in a single package, not a cross-cutting E2E flow - Go convention places tests alongside the code - The project's existing pattern matches (database testcontainer tests are in internal/database/, not test/) - Discoverability is better — someone modifying app.go sees the smoke test immediately

Test Flow

TestBootSmoke
│
├── 1. Start postgres:16-alpine testcontainer (~3s)
│
├── 2. Pre-seed schema_migrations with latest version
│      (causes RunMigrations() to see ErrNoChange and skip)
│
├── 3. Set all ~30 required env vars via t.Setenv()
│      (POSTGRES_URL points to testcontainer, rest are dummies)
│
├── 4. Call SetupDependencies()
│      (connects to real PG, initializes all services)
│
├── 5. Assert ALL nillable fields are non-nil (reflection)
│
└── 6. Call deps.Close() (cleanup, verifies teardown path)

Key Design Decisions

Skipping real migrations

The migrations reference Supabase-specific extensions (pg_cron, pg_net, realtime schema) that don't exist in vanilla PostgreSQL. Instead of replicating the extension-stripping logic from migrate_test.go, we pre-seed the schema_migrations table with the latest migration version. When RunMigrations() calls m.Up() via golang-migrate, it reads the current version, finds no pending migrations, and returns ErrNoChange — which is handled gracefully.

The latest version is read dynamically from the embedded database.MigrationsFS filesystem, so the test stays in sync automatically when new migrations are added. No manual version bumps needed.

This separation of concerns is intentional — each test has one job:

  • internal/database/migrate_test.go answers: "Are the SQL migrations correct?" — It actually runs every migration file against a test database (with stubs for Supabase-specific extensions). If a migration has a syntax error, a bad column type, or breaks an existing table, this test catches it.

  • The boot smoke test answers: "Does the application start up correctly?" — It doesn't care whether migrations are valid SQL. It only checks that SetupDependencies() can wire all services together without crashing. That's why it skips migrations entirely (by faking the version number) — running them again here would add complexity and execution time for something that's already tested elsewhere.

In short: the migration test owns migration correctness, the smoke test owns startup wiring. Neither duplicates the other's job.

Non-nil field verification via reflection

The assertAllFieldsNonNil function uses reflect to walk every exported field of AppDependencies. For each field of kind Ptr, Interface, Func, Map, Slice, or Chan, it asserts non-nil. This is self-maintaining: when someone adds a new field to AppDependencies, the test automatically covers it without manual updates.

All nil fields are reported in a single failure message (not fail-fast) for easier debugging.

Environment variable handling

All ~30 required env vars are set via t.Setenv(), which: - Automatically restores original values when the test ends - Marks the test as not-parallel-safe (panics if t.Parallel() is called) - Is idiomatic Go for test env manipulation

Special dummy values: - COOKIE_HASH_KEY / COOKIE_BLOCK_KEY: Valid hex-encoded 32-byte strings (gorilla/securecookie requires valid hex for AES-256) - JWT_PRIVATE_KEY: A real RSA-2048 key generated at test time (lestrrat-go/jwx ParseKey requires valid PEM) - GIN_MODE=debug: Avoids secure cookie requirements that would need HTTPS - WORKOS_CLIENT_ID=client_smoke: The JWKS fetch hits api.workos.com/sso/jwks/client_smoke, gets a non-parseable response, logs a warning, and continues — this is the expected graceful degradation path

Cleanup ordering (LIFO)

Both cleanup functions use t.Cleanup() (not defer) to ensure correct LIFO ordering:

t.Cleanup(cleanup)          // registered first → runs LAST (container termination)
t.Cleanup(func() {
    deps.Close(closeCtx)    // registered second → runs FIRST (DB disconnect)
})

This ensures the database connection is closed cleanly before the PostgreSQL container is terminated.

Network calls during SetupDependencies()

Only two constructors make network calls:

Constructor What it does Smoke test impact
database.New() Connects to PostgreSQL Handled by testcontainer
NewWorkOSTokenValidator() Fetches JWKS from WorkOS API Fails gracefully with fake client ID (logs warning, continues). Has 15s timeout so won't hang.

All other constructors (LiveKit, Google, R2, Cloudflare, WebSocket) are lazy — meaning they don't connect to anything when created. They just save the configuration (API keys, URLs, credentials) into a struct and return it. The actual network call happens later, only when the service is first used (e.g., when an API request triggers a LiveKit room creation or an R2 file upload). This is why the smoke test can initialize these services with fake credentials and not fail — no connection is attempted during startup, so there's nothing to fail.


Linter Configuration

New Linters Added

Three linters were added to .github/golangci.yml:

Linter Category What it catches
errorlint Bugs & Correctness Using == instead of errors.Is() for error comparison. Critical for wrapped errors.
nilerr Bugs & Correctness Returning nil after checking an error (e.g., if err != nil { return nil }). Almost always a bug.
contextcheck Security Using context.Background() when a context.Context parameter is available. Ensures trace correlation IDs propagate through the call chain.

Running Tests Locally

# Unit tests only (fast, no Docker needed)
go test -short -count=1 ./...

# All tests including database tests (needs Docker for testcontainers)
go test -count=1 ./...

# All tests with race detection
go test -count=1 -race ./...

# Smoke test only (needs Docker)
go test -tags smoke -count=1 -timeout 2m ./internal/app/...

# Smoke test with verbose output
go test -tags smoke -count=1 -timeout 2m -v ./internal/app/...

# Linter
golangci-lint run --config .github/golangci.yml

# Full CI-equivalent check (what test/prod branches run)
go test -count=1 -race ./... && \
go test -tags smoke -count=1 -timeout 2m ./internal/app/... && \
golangci-lint run --config .github/golangci.yml