Skip to content

Metrics System

Overview

This document describes the metrics observability system for the Qubital backend. The system is designed to provide real-time visibility into LiveKit-powered virtual office usage, enabling both operational monitoring and business intelligence for a multi-tenant SaaS platform.

Purpose and Goals

The metrics system serves three primary objectives:

  1. Operational Visibility: Monitor system health, detect anomalies, and troubleshoot issues in real-time. This includes tracking webhook delivery reliability, recording success rates, and identifying potential data integrity problems.

  2. Business Intelligence: Understand how customers use the platform - how long they spend in rooms, which features (camera, microphone, screen share) are most adopted, and how usage patterns vary across organizations.

  3. Multi-Tenant Isolation: Every metric is tagged with an organization identifier (org_id), enabling per-customer dashboards and billing analytics while maintaining strict data isolation between tenants.

Design Principles

The system follows several key design principles:

  • Event-Driven Architecture: Metrics are derived from LiveKit webhook events, ensuring real-time accuracy without polling overhead.
  • Stable Identifiers: Organization IDs (from WorkOS) are used instead of names to ensure time series continuity when customers rename their organizations.
  • Lightweight Footprint: The backend runs on resource-constrained pods (0.5 CPU, 300MB RAM), so the metrics system is designed to be stateless where possible and use minimal memory.
  • Separation of Concerns: The backend only collects and exposes metrics; storage, querying, and visualization are delegated to Grafana Cloud.

Architecture

High-Level Components

The metrics pipeline consists of four distinct components, each with a specific responsibility:

1. LiveKit Cloud (Event Source)

LiveKit Cloud is the real-time video/audio infrastructure that powers Qubital's virtual office rooms. It acts as the authoritative source of truth for all room, participant, track, and recording lifecycle events. When any significant event occurs (a user joins a room, enables their camera, starts a recording, etc.), LiveKit sends an HTTP webhook to our backend within milliseconds.

2. Qubital Backend (Event Processor)

The Go backend receives webhook events, validates their authenticity using cryptographic signatures, and transforms them into Prometheus metrics. It enriches each event with organization context by looking up the room's owner in the database. The backend exposes a /metrics endpoint in Prometheus text format that can be scraped by any compatible collector.

3. Grafana Alloy (Metrics Collector)

Grafana Alloy runs as a sidecar container alongside the backend. Every 15 seconds, it scrapes the /metrics endpoint, adds infrastructure labels (environment, region), batches the samples, and ships them to Grafana Cloud using the Prometheus remote write protocol. This decouples the backend from Grafana Cloud's specifics and provides retry logic, backpressure handling, and efficient batching.

4. Grafana Cloud (Storage and Visualization)

Grafana Cloud provides managed Prometheus storage with 13-month retention, PromQL query capabilities, and the Grafana dashboard/alerting interface. This is where metrics are persisted, queried, and visualized. The backend and Alloy are stateless - if they restart, no historical data is lost because it's already in Grafana Cloud.

Data Flow Diagram

The following diagram illustrates how data flows through the system, from a LiveKit event to a Grafana dashboard:

flowchart TD
    subgraph External["External Services"]
        LK["LiveKit Cloud<br/>(Video Infrastructure)"]
    end

    subgraph Backend["Qubital Backend Pod"]
        WH["/webhook<br/>HTTP POST Handler"]
        EL["EventListener<br/>Signature Validation"]
        REC["WebhookRecorder<br/>Event Processing"]

        subgraph Lookup["Organization Resolution"]
            OL["OrgLookup Service"]
            DB[(PostgreSQL)]
        end

        subgraph MetricsPkg["Prometheus Integration"]
            LKM["LiveKitMetrics<br/>Gauge/Counter/Histogram"]
            REG["Prometheus Registry"]
            ME["/metrics<br/>HTTP GET Handler"]
        end
    end

    subgraph Collector["Grafana Alloy Pod"]
        SCRAPE["prometheus.scrape<br/>15s interval"]
        LABELS["Add external_labels<br/>environment, region, service"]
        RW["prometheus.remote_write<br/>Batched shipping"]
    end

    subgraph Cloud["Grafana Cloud"]
        PROM[("Prometheus<br/>13-month retention")]
        DASH["Dashboards"]
        ALERT["Alerting"]
    end

    %% Event flow
    LK -->|"Webhook Event<br/>(room_started, participant_joined, etc.)"| WH
    WH --> EL
    EL -->|"Valid event"| REC

    %% Processing
    REC -->|"room_id lookup"| OL
    OL <-->|"SQL query"| DB
    REC -->|"Update metrics"| LKM
    LKM --> REG
    REG --> ME

    %% Scraping
    SCRAPE -->|"GET /metrics"| ME
    SCRAPE --> LABELS
    LABELS --> RW
    RW -->|"remote_write"| PROM

    %% Visualization
    PROM --> DASH
    PROM --> ALERT

Component Details

LiveKit Cloud

Role in the System

LiveKit Cloud serves as the event source for the entire metrics system. It is a managed WebRTC infrastructure service that handles all the complexity of real-time video/audio - media routing, bandwidth estimation, codec negotiation, etc. From a metrics perspective, LiveKit is valuable because it provides authoritative lifecycle events for everything happening in virtual office rooms.

How Webhooks Work

When configured with a webhook URL, LiveKit sends HTTP POST requests to that endpoint whenever significant events occur. Each webhook request includes:

  • Authentication: A cryptographic signature in the Authorization header, computed using the shared API secret. The backend validates this signature to ensure the webhook genuinely originated from LiveKit.
  • Event Type: A string identifier like room_started, participant_joined, or track_published.
  • Event ID: A unique identifier for deduplication. LiveKit may retry failed webhooks, so this ID prevents double-counting.
  • Timestamp: The Unix timestamp when the event actually occurred in LiveKit (not when the webhook was sent).
  • Payload: Event-specific data such as room information, participant details, or track metadata.

Webhook Events Reference

Event When It Fires Payload Includes
room_started A room is created (first participant joins or explicit creation) Room SID, room name (our UUID), metadata
room_finished A room closes (empty timeout or explicit deletion) Room SID, room name, duration
participant_joined A user successfully connects to a room Participant SID, identity (WorkOS user ID), room info
participant_left A user disconnects (intentional or timeout) Participant SID, room info
participant_connection_aborted Connection failed during setup Participant SID, error details
track_published A user enables camera/microphone/screen share Track SID, source type, media type, participant info
track_unpublished A user disables a media track Track SID, source type, participant info
egress_started A recording job begins Egress ID, room name, output configuration
egress_ended A recording completes or fails Egress ID, status, file results, duration
ingress_started An external stream import begins Ingress ID, input type, room name
ingress_ended A stream import ends Ingress ID, status, error (if any)

Qubital Backend

Webhook Endpoint (POST /webhook)

Location: internal/features/metrics/api/event_listener.go

The webhook endpoint is the entry point for all LiveKit events. It is intentionally simple and fast - the goal is to accept the webhook, validate it, and return a 200 OK as quickly as possible. Any slow processing would cause LiveKit to retry, potentially creating duplicate events.

The handler performs three steps:

  1. Receive and Parse: The LiveKit SDK's ReceiveWebhookEvent function reads the request body and parses it into a structured event object.

  2. Validate Signature: The SDK verifies the Authorization header against the shared API secret. Invalid signatures are rejected with an error, protecting against spoofed webhooks.

  3. Enqueue for Processing: The event is passed to the WebhookRecorder via a non-blocking Handle(event) call. This immediately returns, allowing the HTTP response to be sent while processing happens asynchronously.

func (h *WebhookApiHandler) EventListener(c *gin.Context) {
    event, err := h.client.ReceiveWebhookEvent(c, h.livekitService)
    if err != nil {
        logger.ErrorAPICtx(ctx, "Webhook event listener failed", err, nil)
        c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to receive webhook event"})
        return
    }

    // Non-blocking enqueue - returns immediately
    h.recorder.Handle(event)

    logger.InfoCtx(ctx, "Webhook event received", []slog.Attr{
        slog.String("event_type", event.Event),
        slog.String("event_id", event.Id),
    })
}

Webhook Recorder Service

Location: internal/features/metrics/service/recorder.go

The WebhookRecorder is a background service that processes webhook events asynchronously. It runs as a single goroutine with a buffered channel, ensuring events are processed in order while decoupling the HTTP handler from potentially slow operations like database lookups.

Processing Pipeline:

  1. Deduplication: Each event has a unique ID. The recorder maintains a map of recently seen IDs (10-minute TTL) and skips duplicates. This handles LiveKit's retry behavior gracefully.

  2. Health Metrics: Before any business logic, the recorder updates system health metrics - incrementing the event counter and recording the delivery lag (time between event creation and receipt).

  3. Organization Resolution: The recorder looks up the organization that owns the room. This is necessary because LiveKit doesn't know about our multi-tenant structure - it just sends room IDs. We query the database to map room_id → organization → org_id.

  4. Metric Updates: Based on the event type, the recorder updates the appropriate Prometheus metrics. Counters are incremented, gauges are adjusted, and histograms observe duration values.

Ephemeral State for Duration Calculations

Why This State Exists

To compute duration-based metrics (e.g., lk_room_duration_seconds, lk_participant_lifetime_seconds), the recorder needs to know when things started. When a room_finished webhook arrives, the recorder must calculate finished_time - started_time. But the room_finished event doesn't include the start time - that information was only available in the earlier room_started event.

The solution is simple: when room_started arrives, the recorder stores the start timestamp in memory. When room_finished arrives later, it retrieves that timestamp, calculates the duration, and then deletes the entry.

What's Stored

State Map Key Stored Value Purpose
rooms room SID Start timestamp + org_id Calculate room duration on room_finished
participants participant SID Join timestamp + room SID + org_id Calculate session duration on participant_left
tracks room SID + source + type Integer count Track active media streams per org_id
egress egress ID Start timestamp + org_id + request_type + room_name Calculate recording duration on egress_ended
ingress ingress ID Start timestamp + org_id + input_type + room_name Calculate stream duration on ingress_ended

Key Characteristics

  • Ephemeral: This state exists only in memory. If the pod restarts, it's lost. This is acceptable because we only lose in-flight duration calculations - the counters in Prometheus (which are persisted in Grafana Cloud) remain accurate.

  • Bounded Size: The state only contains currently active entities. Once a room finishes or a participant leaves, their entry is deleted. The maps cannot grow unbounded - they shrink back down as activity ends.

  • Not Critical for Counters: Counters (*_total metrics) don't need this state - they just increment on each event. Only histograms (durations) require start/end pairing.

  • Trade-off: We could eliminate this state by storing start times in the database or Redis, but that would add latency and complexity for minimal benefit. The current approach is simple and fits the resource constraints.

Memory Usage (Not a Concern)

A common question is whether these in-memory maps will cause problems as customer usage grows. The answer is no - the memory usage is negligible:

Scenario Concurrent Entities Estimated Memory
Small scale 50 participants, 10 rooms ~15 KB
Medium scale 500 participants, 100 rooms ~100 KB
Large scale 5,000 participants, 1,000 rooms ~1 MB
Extreme scale 50,000 participants, 10,000 rooms ~10 MB

Each map entry is roughly 100-150 bytes (Go string headers, timestamps, a few pointers). Even at extreme scales that far exceed expected usage, the memory footprint is tiny compared to the pod's 300MB allocation.

Why This Isn't a Scaling Concern

  1. Maps are bounded by concurrent activity, not total activity: If 1 million participants join and leave throughout the day, but only 500 are ever connected simultaneously, the maps only ever hold ~500 entries.

  2. Entries are automatically cleaned up: When a participant_left event arrives, the entry is immediately deleted. There's no accumulation over time.

  3. The data is trivially small: Timestamps and short strings. No large objects, no nested structures, no media data.


📌 Note for Future Scaling

The in-memory maps are not the bottleneck for scaling this system. If scaling becomes a concern, the limiting factor would be the single-goroutine event processing model (see Scaling Considerations below), not memory usage.


Scaling Considerations

This section separates two distinct concerns that are sometimes conflated: memory usage and throughput. They are independent issues with different thresholds and solutions.

Memory Usage: Not a Concern

As explained above, the in-memory state maps use negligible memory (~1-2MB even at thousands of concurrent participants). This is a non-issue for the foreseeable future and doesn't require any action.

Throughput: The Single-Goroutine Model

The WebhookRecorder processes events in a single goroutine, reading from a buffered channel. This design is intentionally simple:

[Webhook Handler] --enqueue--> [Buffered Channel (1000)] --dequeue--> [Single Processing Goroutine]

Why Single-Threaded?

  1. No synchronization needed: Metric updates and state map operations don't require mutexes because only one goroutine accesses them.
  2. Predictable ordering: Events for the same room/participant are processed in order.
  3. Simple debugging: No race conditions, no deadlocks, no concurrent access bugs.

When Would This Become a Problem?

The single goroutine can process roughly 500-2,000 events per second (depending on database latency for org lookups). Each event involves: - Deduplication check (fast, in-memory) - Metric updates (fast, in-memory) - Optional database lookup (slower, ~5-50ms)

To stress this system, you'd need sustained event rates that exceed processing capacity. Here's a rough estimate:

Concurrent Participants Estimated Events/Second System Status
100 ~5-10 Comfortable
1,000 ~50-100 Comfortable
5,000 ~250-500 Approaching limit
10,000+ ~500-1000+ May need optimization

Events per second is estimated based on typical activity: participants joining/leaving, enabling/disabling tracks. Heavy activity (frequent track toggles) increases this.

Signs of Throughput Issues

  • "Webhook recorder: queue full, dropping event" in logs
  • Growing lag between event timestamps and processing time
  • lk_webhook_delivery_lag_seconds p99 increasing over time

Future Solutions (If Needed)

If throughput becomes a concern at 10,000+ concurrent participants:

  1. Shard by room: Multiple processing goroutines, each handling a subset of rooms. Maintains per-room ordering while increasing parallelism.
  2. Batch database lookups: Cache org_id by room for a short TTL, reducing database round-trips.
  3. Horizontal scaling: Multiple backend pods behind a load balancer, each handling a portion of webhooks.

These optimizations are not currently implemented because they add complexity that isn't justified at current scale. The single-goroutine model handles Qubital's expected load with significant headroom.


Organization Lookup

Location: internal/features/metrics/service/org_lookup.go

Every metric needs an org_id label for multi-tenant filtering. The OrgLookup service resolves room IDs (which LiveKit knows) to organization IDs (which LiveKit doesn't know).

Resolution Process:

  1. Receive the room name from the webhook (this is the room's UUID in our database)
  2. Query: SELECT organizations.workos_org_id, organizations.name FROM rooms JOIN organizations ON rooms.organization_id = organizations.id WHERE rooms.id = ?
  3. Return OrgData{OrgID, OrgName} for metric labeling

Why org_id Instead of org_name?

Organization names are mutable - customers can rename their organization at any time. If we used names as metric labels, renaming would create a new time series and break historical continuity. WorkOS organization IDs are immutable, ensuring metrics remain linked to the same organization forever.

To display friendly names in Grafana, we emit a separate lk_org_info{org_id, org_name} metric that maps IDs to names. Grafana can join this using PromQL's group_left.

Metrics Endpoint (GET /metrics)

Location: pkg/metrics/handler.go and cmd/main.go

The /metrics endpoint exposes all registered Prometheus metrics in the standard text exposition format. When Grafana Alloy (or any Prometheus-compatible scraper) sends a GET request, it receives output like:

# HELP lk_participants_active active participants
# TYPE lk_participants_active gauge
lk_participants_active{org_id="org_01ABC",project="qubital",region="eu"} 15
lk_participants_active{org_id="org_02XYZ",project="qubital",region="eu"} 8

# HELP lk_participant_sessions_total participant sessions started
# TYPE lk_participant_sessions_total counter
lk_participant_sessions_total{org_id="org_01ABC",project="qubital",region="eu"} 1247

The endpoint is stateless - it simply reads current values from the Prometheus registry and formats them. There's no computation or database access during scraping.


Grafana Alloy

Role in the System

Grafana Alloy is a vendor-neutral observability collector that bridges the gap between the backend and Grafana Cloud. While we could have the backend push metrics directly to Grafana Cloud, using Alloy provides several benefits:

  • Decoupling: The backend doesn't need Grafana Cloud credentials or awareness of the remote write protocol.
  • Reliability: Alloy handles retries, backpressure, and temporary network failures gracefully.
  • Batching: Instead of sending each metric individually, Alloy batches samples for efficient transmission.
  • Label Enrichment: Alloy adds infrastructure-level labels (environment, region) that the backend shouldn't need to know about.

Scraping Configuration

Alloy scrapes the backend every 15 seconds. This interval balances freshness (metrics are at most 15 seconds old) against load (200+ scrapes per hour is reasonable for a small pod).

prometheus.scrape "backend" {
  job_name        = "qubital-backend"
  scrape_interval = "15s"
  scrape_timeout  = "10s"
  scheme          = "http"
  metrics_path    = "/metrics"

  targets = [
    { "__address__" = env("BACKEND_SCRAPE_TARGET") },  // e.g., "backend-web:3001"
  ]

  forward_to = [prometheus.remote_write.grafanacloud.receiver]
}

Remote Write Configuration

After scraping, Alloy forwards metrics to Grafana Cloud's Prometheus endpoint. It adds external labels that apply to all metrics from this Alloy instance:

prometheus.remote_write "grafanacloud" {
  external_labels = {
    environment = env("ENVIRONMENT"),  // "test", "dev", or "prod"
    region      = env("REGION"),        // "eu", "us", etc.
    service     = "qubital-backend",
  }

  endpoint {
    url = env("GRAFANA_PROM_REMOTE_WRITE_URL")

    basic_auth {
      username = env("GRAFANA_PROM_USERNAME")
      password = env("GRAFANA_API_KEY")
    }

    queue_config {
      max_samples_per_send = 5000
      batch_send_deadline  = "20s"
      retry_on_http_429    = true
    }
  }
}

Label Strategy

Labels are added at two levels, creating a clear separation of concerns:

Label Added By Scope Purpose
org_id Backend Per-metric Multi-tenant customer identification
project Backend Per-metric LiveKit project identifier
source, type Backend Track metrics only Media track classification
request_type Backend Egress metrics only Recording type classification
environment Alloy All metrics Deployment environment (test/dev/prod)
region Alloy All metrics Infrastructure region
service Alloy All metrics Service identifier for filtering

This separation means the backend doesn't need to know which environment it's running in - that's infrastructure configuration handled by Alloy.


Grafana Cloud

Prometheus Storage

Grafana Cloud provides managed Prometheus storage with:

  • 13-Month Retention: Metrics are stored for over a year, enabling long-term trend analysis and year-over-year comparisons.
  • High Availability: Data is replicated across multiple availability zones.
  • Automatic Scaling: Storage and query capacity scale automatically based on usage.
  • PromQL Interface: Full PromQL support for complex queries and aggregations.

Querying Metrics

Example PromQL queries for common use cases:

# Active participants for a specific organization in production
lk_participants_active{org_id="org_01ABC", environment="prod"}

# Participant session duration percentiles (p50, p90, p99)
histogram_quantile(0.50, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))

# Recording success rate over the last hour
sum(rate(lk_egress_ended_total{result="success", org_id="$org_id"}[1h]))
/
sum(rate(lk_egress_ended_total{org_id="$org_id"}[1h])) * 100

# Camera adoption rate (% of participants with camera enabled)
sum(lk_tracks_active{source="camera", org_id="$org_id"})
/
sum(lk_participants_active{org_id="$org_id"}) * 100

Multi-Tenant Dashboard Variables

To display organization names while filtering by stable IDs:

  1. Create org_name variable: label_values(lk_org_info, org_name) - This shows a dropdown of human-readable organization names.

  2. Create hidden org_id variable: label_values(lk_org_info{org_name="$org_name"}, org_id) - This automatically resolves the selected name to its ID.

  3. Use in queries: lk_participants_active{org_id="$org_id"} - Filtering happens on the stable ID, but users see friendly names.


Metrics Reference

Room Metrics

Metric Type Labels Webhook Trigger Description
lk_room_active Gauge project, region, org_id +1 on room_started, -1 on room_finished Number of currently active rooms
lk_rooms_started_total Counter project, region, org_id room_started Cumulative count of rooms created
lk_rooms_finished_total Counter project, region, org_id room_finished Cumulative count of rooms closed
lk_room_duration_seconds Histogram project, region, org_id room_finished Distribution of room durations

Histogram Buckets: Exponential from 30s to ~3 hours (30, 54, 97, 175, 315, 567, 1020, 1837, 3307, 5953 seconds)

Participant Metrics

Metric Type Labels Webhook Trigger Description
lk_participants_active Gauge project, region, org_id +1 on participant_joined, -1 on participant_left/connection_aborted Currently connected participants
lk_participant_sessions_total Counter project, region, org_id participant_joined Cumulative participant sessions
lk_participant_lifetime_seconds Histogram project, region, org_id participant_left Distribution of session durations

Histogram Buckets: Exponential from 15s to ~3 hours (15, 27, 48, 87, 157, 283, 509, 916, 1649, 2969 seconds)

Track Metrics

Metric Type Labels Webhook Trigger Description
lk_tracks_active Gauge project, region, org_id, source, type +1 on track_published, -1 on track_unpublished Currently published media tracks
lk_tracks_published_total Counter project, region, org_id, source, type track_published Cumulative track publishes

Source Values: camera, microphone, screen_share, screen_share_audio, unknown

Type Values: audio, video, unknown

Egress (Recording) Metrics

Metric Type Labels Webhook Trigger Description
lk_egress_active Gauge project, region, org_id, request_type +1 on egress_started, -1 on egress_ended Currently running recordings
lk_egress_started_total Counter project, region, org_id, request_type egress_started Cumulative recordings started
lk_egress_ended_total Counter project, region, org_id, request_type, result egress_ended Cumulative recordings ended
lk_egress_duration_seconds Histogram project, region, org_id, request_type, result egress_ended Distribution of recording durations

request_type Values: room_composite, web, track, participant, unknown

result Values: success, failed

Histogram Buckets: 30s, 60s, 120s, 300s, 600s, 1200s, 1800s, 3600s, 7200s, 14400s

Ingress (Stream Import) Metrics

Metric Type Labels Webhook Trigger Description
lk_ingress_active Gauge project, region, org_id, input_type +1 on ingress_started, -1 on ingress_ended Currently running stream imports
lk_ingress_started_total Counter project, region, org_id, input_type ingress_started Cumulative stream imports started
lk_ingress_ended_total Counter project, region, org_id, input_type, result ingress_ended Cumulative stream imports ended
lk_ingress_duration_seconds Histogram project, region, org_id, input_type, result ingress_ended Distribution of stream durations

input_type Values: rtmp, whip, url, unknown

result Values: success, failed

Webhook Health Metrics

Metric Type Labels Description
lk_webhook_events_total Counter event Total webhook events received, by event type
lk_webhook_delivery_lag_seconds Histogram event Time between event creation in LiveKit and receipt by backend
lk_webhook_duplicates_total Counter event Duplicate webhook events detected (same event ID seen twice)

System Health Metrics

Metric Type Labels Description
lk_org_lookup_failures_total Counter reason Failed room → organization lookups (indicates data integrity issues)
lk_org_info Gauge org_id, org_name Metadata metric mapping org_id to org_name for Grafana joins

reason Values: not_configured, room_not_found, no_organization, no_org_id


Complete Data Flow

Sequence Diagram: Webhook to Dashboard

sequenceDiagram
    participant LK as LiveKit Cloud
    participant BE as Backend /webhook
    participant REC as WebhookRecorder
    participant DB as PostgreSQL
    participant REG as Prometheus Registry
    participant ME as Backend /metrics
    participant AL as Grafana Alloy
    participant GC as Grafana Cloud

    Note over LK,GC: 1. Event Occurs (participant joins room)

    LK->>BE: POST /webhook<br/>{event: "participant_joined", id: "evt_123", ...}
    BE->>BE: Validate webhook signature
    BE->>REC: Handle(event) [non-blocking enqueue]
    BE-->>LK: 200 OK

    Note over REC,REG: 2. Async Processing (in background goroutine)

    REC->>REC: Check deduplication (event.id not seen?)
    REC->>REC: Record lk_webhook_events_total++
    REC->>REC: Record lk_webhook_delivery_lag_seconds
    REC->>DB: SELECT org_id FROM rooms JOIN organizations
    DB-->>REC: {org_id: "org_01ABC", org_name: "Acme Corp"}
    REC->>REC: Store participant start time (for duration calc later)
    REC->>REG: lk_participants_active{org_id="org_01ABC"}.Inc()
    REC->>REG: lk_participant_sessions_total{org_id="org_01ABC"}.Inc()

    Note over AL,GC: 3. Periodic Scrape (every 15 seconds)

    AL->>ME: GET /metrics
    ME->>REG: Collect all metric values
    REG-->>ME: Prometheus text format
    ME-->>AL: lk_participants_active{org_id="org_01ABC"} 15\n...
    AL->>AL: Add external_labels (environment, region, service)
    AL->>GC: remote_write (batched, compressed, with retries)
    GC->>GC: Store in Prometheus

    Note over GC: 4. Query & Visualize

    GC->>GC: Dashboard executes PromQL queries
    GC->>GC: Render panels with current and historical data

Configuration Reference

Backend Environment Variables

Variable Required Description Example
LIVEKIT_API_KEY Yes LiveKit API key for webhook signature validation APIxxxxxxxx
LIVEKIT_API_SECRET Yes LiveKit API secret for webhook signature validation secret-key
LIVEKIT_PROJECT No Project name for metric labels qubital

Grafana Alloy Environment Variables

Variable Required Description Example
ENVIRONMENT Yes Environment name for external label prod
REGION Yes Deployment region for external label eu
BACKEND_SCRAPE_TARGET Yes Backend service DNS and port backend-web:3001
GRAFANA_PROM_REMOTE_WRITE_URL Yes Grafana Cloud Prometheus endpoint https://prometheus-xxx.grafana.net/api/prom/push
GRAFANA_PROM_USERNAME Yes Grafana Cloud instance ID 123456
GRAFANA_API_KEY Yes Grafana Cloud API key with write permissions glc_eyJ...

Recorder Configuration

The WebhookRecorder has two tunable parameters:

type Config struct {
    QueueSize int           // Default: 1000 events
    DedupTTL  time.Duration // Default: 10 minutes
}
  • QueueSize: Maximum number of events waiting to be processed. If exceeded, new events are dropped (and logged). Increase if you see "queue full" warnings.
  • DedupTTL: How long to remember event IDs for deduplication. Must be longer than LiveKit's retry window (typically a few minutes).

Operational Considerations

Gauge Drift

Gauge metrics (lk_*_active) track current counts by incrementing on "start" events and decrementing on "end" events. This approach can drift from reality if events are lost:

Causes of Drift: - Network issues between LiveKit and the backend - Backend pod restarts (in-flight events are lost) - Queue overflow (events dropped due to backpressure)

Detection: - Compare lk_rooms_started_total - lk_rooms_finished_total with lk_room_active - they should match - Monitor lk_webhook_duplicates_total - high rates may indicate delivery issues - Alert on lk_org_lookup_failures_total > 0 - indicates data integrity problems

Mitigation: - Gauges self-correct over time as rooms finish and participants leave - Counter metrics (*_total) remain accurate regardless of drift - For critical accuracy, use counter differences: increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h])

Deduplication Behavior

LiveKit may retry webhook delivery if it doesn't receive a timely response. The recorder handles this by tracking event IDs for 10 minutes:

  • First occurrence: Processed normally
  • Subsequent occurrences (same ID within 10 minutes): Skipped, counted in lk_webhook_duplicates_total
  • After 10 minutes: ID is forgotten (old events won't be reprocessed anyway)

Backpressure Handling

If webhook events arrive faster than they can be processed, the queue (1000 events) acts as a buffer. If it fills completely:

  • New events are dropped (not queued)
  • A warning is logged: "Webhook recorder: queue full, dropping event"
  • The event_type and event_id are included in the log for debugging

This is a safety valve to prevent unbounded memory growth. If you see these warnings, investigate why processing is slow (database latency? high event volume?) or increase the queue size.


Alerting Recommendations

Alert Name PromQL Condition Severity Description
WebhookDeliveryLagHigh histogram_quantile(0.99, sum(rate(lk_webhook_delivery_lag_seconds_bucket[5m])) by (le)) > 10 Warning 99th percentile delivery lag exceeds 10 seconds
OrgLookupFailures increase(lk_org_lookup_failures_total[1h]) > 0 Critical Any org lookup failure indicates data integrity issues
RecordingFailures increase(lk_egress_ended_total{result="failed"}[1h]) > 0 Warning Recording jobs are failing
NoWebhookEvents rate(lk_webhook_events_total[5m]) == 0 Critical No webhook events received (during business hours)
HighDuplicateRate rate(lk_webhook_duplicates_total[5m]) > 1 Warning More than 1 duplicate per second indicates delivery issues

File Reference

File Path Purpose
pkg/metrics/livekit.go Metric definitions (names, types, labels, help text) and Prometheus registration
pkg/metrics/handler.go HTTP handler wrapper for the /metrics endpoint
internal/features/metrics/api/event_listener.go Webhook HTTP handler - receives and validates LiveKit webhooks
internal/features/metrics/service/recorder.go Background service - processes events and updates metrics
internal/features/metrics/service/models.go Data structures (Config, OrgData, RoomState, etc.) and interfaces
internal/features/metrics/service/org_lookup.go Database queries to resolve room IDs to organization IDs
internal/features/metrics/service/presence_manager.go Single-session enforcement (kicks user from old room when joining new one)
cmd/main.go Application entry point - registers /metrics and /webhook endpoints
internal/app/app.go Dependency injection - creates and wires all metrics components