Skip to content

Metrics System

Overview

This document describes the metrics observability system for the Qubital backend. The system is designed to provide real-time visibility into LiveKit-powered virtual office usage, enabling both operational monitoring and business intelligence for a multi-tenant SaaS platform.

Purpose and Goals

The metrics system serves three primary objectives:

  1. Operational Visibility: Monitor system health, detect anomalies, and troubleshoot issues in real-time. This includes tracking webhook delivery reliability, recording success rates, and identifying potential data integrity problems.

  2. Business Intelligence: Understand how customers use the platform - how long they spend in rooms, which features (camera, microphone, screen share) are most adopted, and how usage patterns vary across organizations.

  3. Multi-Tenant Isolation: Every metric is tagged with an organization identifier (org_id), enabling per-customer dashboards and billing analytics while maintaining strict data isolation between tenants.

Design Principles

The system follows several key design principles:

  • Event-Driven Architecture: Metrics are derived from LiveKit webhook events, ensuring real-time accuracy without polling overhead.
  • Stable Identifiers: Organization IDs (from WorkOS) are used instead of names to ensure time series continuity when customers rename their organizations.
  • Lightweight Footprint: The backend runs on resource-constrained pods (0.5 CPU, 300MB RAM), so the metrics system is designed to be stateless where possible and use minimal memory.
  • Separation of Concerns: The backend only collects and exposes metrics; storage, querying, and visualization are delegated to Grafana Cloud.
  • Hybrid State Model: Most metrics are derived purely from ephemeral in-memory state (webhook events). However, certain business KPIs require durable state in a relational database: WAU/MAU unique user counts (Prometheus cannot count distinct identities over sliding time windows), egress completion stats (in-memory counters reset on pod restart, losing billing-critical data), and connected participant counts (in-memory gauges drift if events are lost and reset to zero on restart).

Architecture

High-Level Components

The metrics pipeline consists of four distinct components, each with a specific responsibility:

1. LiveKit Cloud (Event Source)

LiveKit Cloud is the real-time video/audio infrastructure that powers Qubital's virtual office rooms. It acts as the authoritative source of truth for all room, participant, track, and recording lifecycle events. When any significant event occurs (a user joins a room, enables their camera, starts a recording, etc.), LiveKit sends an HTTP webhook to our backend within milliseconds.

2. Qubital Backend (Event Processor)

The Go backend receives webhook events, validates their authenticity using cryptographic signatures, and transforms them into Prometheus metrics. It enriches each event with organization context by looking up the room's owner in the database. The backend exposes a /metrics endpoint in Prometheus text format that can be scraped by any compatible collector.

3. Grafana Alloy (Metrics Collector)

Grafana Alloy runs as a sidecar container alongside the backend. Every 15 seconds, it scrapes the /metrics endpoint, adds infrastructure labels (environment, region), batches the samples, and ships them to Grafana Cloud using the Prometheus remote write protocol. This decouples the backend from Grafana Cloud's specifics and provides retry logic, backpressure handling, and efficient batching.

4. Grafana Cloud (Storage and Visualization)

Grafana Cloud provides managed Prometheus storage with 13-month retention, PromQL query capabilities, and the Grafana dashboard/alerting interface. This is where metrics are persisted, queried, and visualized. The backend and Alloy are stateless - if they restart, no historical data is lost because it's already in Grafana Cloud.

Data Flow Diagram

The following diagram illustrates how data flows through the system, from a LiveKit event to a Grafana dashboard:

flowchart TD
    subgraph External["External Services"]
        LK["LiveKit Cloud<br/>(Video Infrastructure)"]
    end

    subgraph Backend["Qubital Backend Pod"]
        WH["/webhook<br/>HTTP POST Handler"]
        EL["EventListener<br/>Signature Validation"]
        REC["WebhookRecorder<br/>Event Processing"]

        subgraph Lookup["Organization Resolution"]
            OL["OrgLookup Service"]
            DB[(PostgreSQL)]
        end

        subgraph MetricsPkg["Prometheus Integration"]
            LKM["LiveKitMetrics<br/>Gauge/Counter/Histogram"]
            REG["Prometheus Registry"]
            ME["/metrics<br/>HTTP GET Handler"]
        end
    end

    subgraph Collector["Grafana Alloy Pod"]
        SCRAPE["prometheus.scrape<br/>15s interval"]
        LABELS["Add external_labels<br/>environment, region, service"]
        RW["prometheus.remote_write<br/>Batched shipping"]
    end

    subgraph Cloud["Grafana Cloud"]
        PROM[("Prometheus<br/>13-month retention")]
        DASH["Dashboards"]
        ALERT["Alerting"]
    end

    %% Event flow
    LK -->|"Webhook Event<br/>(room_started, participant_joined, etc.)"| WH
    WH --> EL
    EL -->|"Valid event"| REC

    %% Processing
    REC -->|"room_id lookup"| OL
    OL <-->|"SQL query"| DB
    REC -->|"Update metrics"| LKM
    LKM --> REG
    REG --> ME

    %% Scraping
    SCRAPE -->|"GET /metrics"| ME
    SCRAPE --> LABELS
    LABELS --> RW
    RW -->|"remote_write"| PROM

    %% Visualization
    PROM --> DASH
    PROM --> ALERT

Component Details

LiveKit Cloud

Role in the System

LiveKit Cloud serves as the event source for the entire metrics system. It is a managed WebRTC infrastructure service that handles all the complexity of real-time video/audio - media routing, bandwidth estimation, codec negotiation, etc. From a metrics perspective, LiveKit is valuable because it provides authoritative lifecycle events for everything happening in virtual office rooms.

How Webhooks Work

When configured with a webhook URL, LiveKit sends HTTP POST requests to that endpoint whenever significant events occur. Each webhook request includes:

  • Authentication: A cryptographic signature in the Authorization header, computed using the shared API secret. The backend validates this signature to ensure the webhook genuinely originated from LiveKit.
  • Event Type: A string identifier like room_started, participant_joined, or track_published.
  • Event ID: A unique identifier for deduplication. LiveKit may retry failed webhooks, so this ID prevents double-counting.
  • Timestamp: The Unix timestamp when the event actually occurred in LiveKit (not when the webhook was sent).
  • Payload: Event-specific data such as room information, participant details, or track metadata.

Webhook Events Reference

Event When It Fires Payload Includes
room_started A room is created (first participant joins or explicit creation) Room SID, room name (our UUID), metadata
room_finished A room closes (empty timeout or explicit deletion) Room SID, room name, duration
participant_joined A user successfully connects to a room Participant SID, identity (internal DB user ID as string), room info
participant_left A user disconnects (intentional or timeout) Participant SID, room info
participant_connection_aborted Connection failed during setup Participant SID, error details
track_published A user enables camera/microphone/screen share Track SID, source type, media type, participant info
track_unpublished A user disables a media track Track SID, source type, participant info
egress_started A recording job begins Egress ID, room name, output configuration
egress_ended A recording completes or fails Egress ID, status, file results, duration
ingress_started An external stream import begins Ingress ID, input type, room name
ingress_ended A stream import ends Ingress ID, status, error (if any)

Qubital Backend

Webhook Endpoint (POST /webhook)

Location: internal/features/metrics/api/event_listener.go

The webhook endpoint is the entry point for all LiveKit events. It is intentionally simple and fast - the goal is to accept the webhook, validate it, and return a 200 OK as quickly as possible. Any slow processing would cause LiveKit to retry, potentially creating duplicate events.

The handler performs three steps:

  1. Receive and Parse: The LiveKit SDK's ReceiveWebhookEvent function reads the request body and parses it into a structured event object.

  2. Validate Signature: The SDK verifies the Authorization header against the shared API secret. Invalid signatures are rejected with an error, protecting against spoofed webhooks.

  3. Enqueue for Processing: The event is passed to the WebhookRecorder via a non-blocking Handle(event) call. This immediately returns, allowing the HTTP response to be sent while processing happens asynchronously.

func (h *WebhookApiHandler) EventListener(c *gin.Context) {
    event, err := h.client.ReceiveWebhookEvent(c, h.livekitService)
    if err != nil {
        logger.ErrorAPICtx(ctx, "Webhook event listener failed", err, nil)
        c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to receive webhook event"})
        return
    }

    // Non-blocking enqueue - returns immediately
    h.recorder.Handle(event)

    logger.InfoCtx(ctx, "Webhook event received", []slog.Attr{
        slog.String("event_type", event.Event),
        slog.String("event_id", event.Id),
    })
}

Webhook Recorder Service

Location: internal/features/metrics/service/recorder.go

The WebhookRecorder is a background service that processes webhook events asynchronously. It runs as a single goroutine with a buffered channel, ensuring events are processed in order while decoupling the HTTP handler from potentially slow operations like database lookups.

Processing Pipeline:

  1. Deduplication: Each event has a unique ID. The recorder maintains a map of recently seen IDs (10-minute TTL) and skips duplicates. This handles LiveKit's retry behavior gracefully.

  2. Health Metrics: Before any business logic, the recorder updates system health metrics - incrementing the event counter and recording the delivery lag (time between event creation and receipt).

  3. Organization Resolution: The recorder looks up the organization that owns the room. This is necessary because LiveKit doesn't know about our multi-tenant structure - it just sends room IDs. We query the database to map room_id → organization → org_id.

  4. Metric Updates: Based on the event type, the recorder updates the appropriate Prometheus metrics. Counters are incremented, gauges are adjusted, and histograms observe duration values.

  5. Participant Activity Persistence: On participant_left, after computing session duration, the recorder persists the completed session to the participant_activity_metrics database table via a fire-and-forget write. This durable state enables WAU/MAU computation that Prometheus alone cannot handle (see Participant Activity Metrics below).

Background Tickers

In addition to event-driven processing, the recorder's goroutine runs several periodic tickers:

Ticker Interval Purpose
Dedup cleanup 10 min (configurable) Removes expired event IDs from the deduplication map to prevent unbounded memory growth
License refresh 1 hour Re-reads each known organization's monthly_recording_minutes from the org_licenses table and updates the lk_org_recording_limit_minutes gauge. Picks up plan changes without restart.
Activity metrics refresh 5 min Queries the participant_activity_metrics table to compute per-org WAU (7-day) and MAU (30-day) aggregates, then updates the lk_unique_participants and lk_participant_total_online_seconds gauges.
Egress stats refresh 5 min Queries the recordings table to compute per-org completed egress count and total duration for sliding windows (1d, 7d, 30d), then updates the lk_egress_completed and lk_egress_total_duration_seconds gauges. Uses the partial index idx_recordings_completed_ended_at (migration 000020).
Presence count refresh 5 min Queries the room_presence table to count currently connected participants per organization, then updates the lk_participants_connected gauge.
Activity purge 24 hours Deletes rows older than 90 days from participant_activity_metrics to keep the table bounded. 90 days covers the MAU window (30d) with margin for future quarterly metrics.

Ephemeral State for Duration Calculations

Why This State Exists

To compute duration-based metrics (e.g., lk_room_duration_seconds, lk_participant_lifetime_seconds), the recorder needs to know when things started. When a room_finished webhook arrives, the recorder must calculate finished_time - started_time. But the room_finished event doesn't include the start time - that information was only available in the earlier room_started event.

The solution is simple: when room_started arrives, the recorder stores the start timestamp in memory. When room_finished arrives later, it retrieves that timestamp, calculates the duration, and then deletes the entry.

What's Stored

State Map Key Stored Value Purpose
rooms room SID Start timestamp + org_id Calculate room duration on room_finished
participants participant SID Join timestamp + room SID + org_id + identity (internal DB user ID as string) + userID (parsed int64) Calculate session duration on participant_left and persist session to DB for WAU/MAU

| tracks | room SID + source + type | Integer count | Track active media streams per org_id | | egress | egress ID | Start timestamp + org_id + request_type + room_name | Calculate recording duration on egress_ended | | ingress | ingress ID | Start timestamp + org_id + input_type + room_name | Calculate stream duration on ingress_ended |

Key Characteristics

  • Ephemeral: This state exists only in memory. If the pod restarts, it's lost. This is acceptable because we only lose in-flight duration calculations - the counters in Prometheus (which are persisted in Grafana Cloud) remain accurate.

  • Bounded Size: The state only contains currently active entities. Once a room finishes or a participant leaves, their entry is deleted. The maps cannot grow unbounded - they shrink back down as activity ends.

  • Not Critical for Counters: Counters (*_total metrics) don't need this state - they just increment on each event. Only histograms (durations) require start/end pairing.

  • Trade-off: We could eliminate this state by storing start times in the database or Redis, but that would add latency and complexity for minimal benefit. The current approach is simple and fits the resource constraints.

Memory Usage (Not a Concern)

A common question is whether these in-memory maps will cause problems as customer usage grows. The answer is no - the memory usage is negligible:

Scenario Concurrent Entities Estimated Memory
Small scale 50 participants, 10 rooms ~15 KB
Medium scale 500 participants, 100 rooms ~100 KB
Large scale 5,000 participants, 1,000 rooms ~1 MB
Extreme scale 50,000 participants, 10,000 rooms ~10 MB

Each map entry is roughly 100-150 bytes (Go string headers, timestamps, a few pointers). Even at extreme scales that far exceed expected usage, the memory footprint is tiny compared to the pod's 300MB allocation.

Why This Isn't a Scaling Concern

  1. Maps are bounded by concurrent activity, not total activity: If 1 million participants join and leave throughout the day, but only 500 are ever connected simultaneously, the maps only ever hold ~500 entries.

  2. Entries are automatically cleaned up: When a participant_left event arrives, the entry is immediately deleted. There's no accumulation over time.

  3. The data is trivially small: Timestamps and short strings. No large objects, no nested structures, no media data.


📌 Note for Future Scaling

The in-memory maps are not the bottleneck for scaling this system. If scaling becomes a concern, the limiting factor would be the single-goroutine event processing model (see Scaling Considerations below), not memory usage.


Scaling Considerations

This section separates two distinct concerns that are sometimes conflated: memory usage and throughput. They are independent issues with different thresholds and solutions.

Memory Usage: Not a Concern

As explained above, the in-memory state maps use negligible memory (~1-2MB even at thousands of concurrent participants). This is a non-issue for the foreseeable future and doesn't require any action.

Throughput: The Single-Goroutine Model

The WebhookRecorder processes events in a single goroutine, reading from a buffered channel. This design is intentionally simple:

[Webhook Handler] --enqueue--> [Buffered Channel (1000)] --dequeue--> [Single Processing Goroutine]

Why Single-Threaded?

  1. No synchronization needed: Metric updates and state map operations don't require mutexes because only one goroutine accesses them.
  2. Predictable ordering: Events for the same room/participant are processed in order.
  3. Simple debugging: No race conditions, no deadlocks, no concurrent access bugs.

When Would This Become a Problem?

The single goroutine can process roughly 500-2,000 events per second (depending on database latency for org lookups). Each event involves: - Deduplication check (fast, in-memory) - Metric updates (fast, in-memory) - Optional database lookup (slower, ~5-50ms)

To stress this system, you'd need sustained event rates that exceed processing capacity. Here's a rough estimate:

Concurrent Participants Estimated Events/Second System Status
100 ~5-10 Comfortable
1,000 ~50-100 Comfortable
5,000 ~250-500 Approaching limit
10,000+ ~500-1000+ May need optimization

Events per second is estimated based on typical activity: participants joining/leaving, enabling/disabling tracks. Heavy activity (frequent track toggles) increases this.

Signs of Throughput Issues

  • "Webhook recorder: queue full, dropping event" in logs
  • Growing lag between event timestamps and processing time
  • lk_webhook_delivery_lag_seconds p99 increasing over time

Future Solutions (If Needed)

If throughput becomes a concern at 10,000+ concurrent participants:

  1. Shard by room: Multiple processing goroutines, each handling a subset of rooms. Maintains per-room ordering while increasing parallelism.
  2. Batch database lookups: Cache org_id by room for a short TTL, reducing database round-trips.
  3. Horizontal scaling: Multiple backend pods behind a load balancer, each handling a portion of webhooks.

These optimizations are not currently implemented because they add complexity that isn't justified at current scale. The single-goroutine model handles Qubital's expected load with significant headroom.


Organization Lookup

Location: internal/features/metrics/service/org_lookup.go

Every metric needs an org_id label for multi-tenant filtering. The OrgLookup service resolves room IDs (which LiveKit knows) to organization IDs (which LiveKit doesn't know).

Resolution Process:

  1. Receive the room name from the webhook (this is the room's UUID in our database)
  2. Query: SELECT organizations.workos_org_id, organizations.name FROM rooms JOIN organizations ON rooms.organization_id = organizations.id WHERE rooms.id = ?
  3. Return OrgData{OrgID, OrgName, MonthlyRecordingMinutes} for metric labeling and license tracking

Why org_id Instead of org_name?

Organization names are mutable - customers can rename their organization at any time. If we used names as metric labels, renaming would create a new time series and break historical continuity. WorkOS organization IDs are immutable, ensuring metrics remain linked to the same organization forever.

To display friendly names in Grafana, we emit a separate lk_org_info{org_id, org_name} metric that maps IDs to names. Grafana can join this using PromQL's group_left.

Recording Limit Resolution

In addition to resolving org identity, OrgLookup also fetches each organization's monthly recording limit from the org_licenses table. This limit (e.g., 1200 minutes for a standard plan) is included in the OrgData struct and emitted as the lk_org_recording_limit_minutes gauge when an organization is first discovered. A background ticker re-reads the limits from the database every hour to pick up plan changes (upgrades or downgrades) without requiring a restart. This enables Grafana to draw dynamic threshold lines per tenant on the "Monthly Recording Minutes" panel, replacing the previous hardcoded vector(1200) approach.

Metrics Endpoint (GET /metrics)

Location: pkg/metrics/handler.go and cmd/main.go

The /metrics endpoint exposes all registered Prometheus metrics in the standard text exposition format. When Grafana Alloy (or any Prometheus-compatible scraper) sends a GET request, it receives output like:

# HELP lk_participants_active active participants
# TYPE lk_participants_active gauge
lk_participants_active{org_id="org_01ABC",project="qubital",region="eu"} 15
lk_participants_active{org_id="org_02XYZ",project="qubital",region="eu"} 8

# HELP lk_participant_sessions_total participant sessions started
# TYPE lk_participant_sessions_total counter
lk_participant_sessions_total{org_id="org_01ABC",project="qubital",region="eu"} 1247

The endpoint is stateless - it simply reads current values from the Prometheus registry and formats them. There's no computation or database access during scraping.


Grafana Alloy

Role in the System

Grafana Alloy is a vendor-neutral observability collector that bridges the gap between the backend and Grafana Cloud. While we could have the backend push metrics directly to Grafana Cloud, using Alloy provides several benefits:

  • Decoupling: The backend doesn't need Grafana Cloud credentials or awareness of the remote write protocol.
  • Reliability: Alloy handles retries, backpressure, and temporary network failures gracefully.
  • Batching: Instead of sending each metric individually, Alloy batches samples for efficient transmission.
  • Label Enrichment: Alloy adds infrastructure-level labels (environment, region) that the backend shouldn't need to know about.

Scraping Configuration

Alloy scrapes the backend every 15 seconds. This interval balances freshness (metrics are at most 15 seconds old) against load (200+ scrapes per hour is reasonable for a small pod).

prometheus.scrape "backend" {
  job_name        = "qubital-backend"
  scrape_interval = "15s"
  scrape_timeout  = "10s"
  scheme          = "http"
  metrics_path    = "/metrics"

  targets = [
    { "__address__" = env("BACKEND_SCRAPE_TARGET") },  // e.g., "backend-web:3001"
  ]

  forward_to = [prometheus.remote_write.grafanacloud.receiver]
}

Remote Write Configuration

After scraping, Alloy forwards metrics to Grafana Cloud's Prometheus endpoint. It adds external labels that apply to all metrics from this Alloy instance:

prometheus.remote_write "grafanacloud" {
  external_labels = {
    environment = env("ENVIRONMENT"),  // "test", "dev", or "prod"
    region      = env("REGION"),        // "eu", "us", etc.
    service     = "qubital-backend",
  }

  endpoint {
    url = env("GRAFANA_PROM_REMOTE_WRITE_URL")

    basic_auth {
      username = env("GRAFANA_PROM_USERNAME")
      password = env("GRAFANA_API_KEY")
    }

    queue_config {
      max_samples_per_send = 5000
      batch_send_deadline  = "20s"
      retry_on_http_429    = true
    }
  }
}

Label Strategy

Labels are added at two levels, creating a clear separation of concerns:

Label Added By Scope Purpose
org_id Backend Per-metric Multi-tenant customer identification
project Backend Per-metric LiveKit project identifier
source, type Backend Track metrics only Media track classification
request_type Backend Egress metrics only Recording type classification
environment Alloy All metrics Deployment environment (test/dev/prod)
region Alloy All metrics Infrastructure region
service Alloy All metrics Service identifier for filtering

This separation means the backend doesn't need to know which environment it's running in - that's infrastructure configuration handled by Alloy.


Grafana Cloud

Prometheus Storage

Grafana Cloud provides managed Prometheus storage with:

  • 13-Month Retention: Metrics are stored for over a year, enabling long-term trend analysis and year-over-year comparisons.
  • High Availability: Data is replicated across multiple availability zones.
  • Automatic Scaling: Storage and query capacity scale automatically based on usage.
  • PromQL Interface: Full PromQL support for complex queries and aggregations.

Querying Metrics

Example PromQL queries for common use cases:

# Active participants for a specific organization in production
lk_participants_active{org_id="org_01ABC", environment="prod"}

# Participant session duration percentiles (p50, p90, p99)
histogram_quantile(0.50, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))

# Recording success rate over the last hour
sum(rate(lk_egress_ended_total{result="success", org_id="$org_id"}[1h]))
/
sum(rate(lk_egress_ended_total{org_id="$org_id"}[1h])) * 100

# Camera adoption rate (% of participants with camera enabled)
sum(lk_tracks_active{source="camera", org_id="$org_id"})
/
sum(lk_participants_active{org_id="$org_id"}) * 100

# Weekly Active Users (WAU) per tenant
lk_unique_participants{org_id="$org_id", window="7d"}

# Monthly Active Users (MAU) per tenant
lk_unique_participants{org_id="$org_id", window="30d"}

# Average time online per unique user (7-day window)
lk_participant_total_online_seconds{org_id="$org_id", window="7d"}
/
lk_unique_participants{org_id="$org_id", window="7d"}

# Recording limit threshold line (dynamic per tenant, read from DB)
lk_org_recording_limit_minutes{org_id="$org_id"}

Multi-Tenant Dashboard Variables

To display organization names while filtering by stable IDs:

  1. Create org_name variable: label_values(lk_org_info, org_name) - This shows a dropdown of human-readable organization names.

  2. Create hidden org_id variable: label_values(lk_org_info{org_name="$org_name"}, org_id) - This automatically resolves the selected name to its ID.

  3. Use in queries: lk_participants_active{org_id="$org_id"} - Filtering happens on the stable ID, but users see friendly names.


Metrics Reference

Room Metrics

Metric Type Labels Webhook Trigger Description
lk_room_active Gauge project, region, org_id +1 on room_started, -1 on room_finished Number of currently active rooms. Incremented when a room is created and decremented when the last participant leaves (or the room is explicitly deleted). Subject to gauge drift if events are lost during pod restarts.
lk_rooms_started_total Counter project, region, org_id room_started Cumulative count of rooms created since the last pod restart. Survives Prometheus scrape resets via rate()/increase(). Use increase(lk_rooms_started_total[1h]) for rooms created in the last hour.
lk_rooms_finished_total Counter project, region, org_id room_finished Cumulative count of rooms closed. The difference increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h]) gives a drift-resistant approximation of active rooms.
lk_room_duration_seconds Histogram project, region, org_id room_finished Distribution of room lifetimes in seconds (time between room_started and room_finished). Use histogram_quantile() for percentiles. Requires the in-memory rooms state map to pair start/end events.

Histogram Buckets: Exponential from 30s to ~3 hours (30, 54, 97, 175, 315, 567, 1020, 1837, 3307, 5953 seconds)

Participant Metrics

Metric Type Labels Webhook Trigger Description
lk_participants_active Gauge project, region, org_id +1 on participant_joined, -1 on participant_left/connection_aborted Currently connected participants. Only counts standard participants (STANDARD kind) — service participants like egress bots, ingress, SIP bridges, and agents are excluded.
lk_participant_sessions_total Counter project, region, org_id participant_joined Cumulative count of participant sessions started. One session = one join event. If a user disconnects and reconnects, that counts as two sessions.
lk_participant_lifetime_seconds Histogram project, region, org_id participant_left Distribution of individual session durations in seconds (time between participant_joined and participant_left). On participant_left, the completed session is also persisted to the participant_activity_metrics DB table for WAU/MAU computation.

Histogram Buckets: Exponential from 15s to ~3 hours (15, 27, 48, 87, 157, 283, 509, 916, 1649, 2969 seconds)

Participant Activity Metrics (WAU/MAU)

These metrics are not derived from ephemeral in-memory state like the other metrics above. They are computed from the participant_activity_metrics database table, which stores daily pre-aggregated participant sessions. A background ticker queries the table every 5 minutes to refresh these gauges.

Why a database table? Prometheus counters and histograms handle restarts via reset detection, but counting distinct identities over a sliding time window (e.g., "how many unique users were active in the last 7 days?") requires durable state. Prometheus has no COUNT(DISTINCT) equivalent. The database table provides the per-user granularity needed for these business KPIs.

Daily pre-aggregation: Instead of storing one row per session (which would scale to millions of rows), the table stores one row per user per organization per day via UPSERT. On each participant_left webhook, the session duration is added to the existing row for that user+org+day, or a new row is inserted if none exists. This reduces storage by 3-5x while preserving the per-user granularity needed for COUNT(DISTINCT user_id).

Data retention: Rows older than 90 days are purged daily by a background ticker. 90 days covers the current MAU window (30d) and WAU window (7d) with margin for future quarterly metrics. If a quarterly (90d) aggregation window is added, retention should be extended to at least 120 days.

Metric Type Labels Data Source Description
lk_unique_participants Gauge project, region, org_id, window DB query (5 min refresh) Number of distinct users active per organization within the sliding time window. COUNT(DISTINCT user_id) from participant_activity_metrics where activity_date > now() - window.
lk_participant_total_online_seconds Gauge project, region, org_id, window DB query (5 min refresh) Total seconds spent online by all participants per organization within the sliding time window. SUM(total_duration_seconds) from participant_activity_metrics.

window Values: 7d (WAU - Weekly Active Users), 30d (MAU - Monthly Active Users)

Grafana Usage:

# Weekly Active Users for a specific tenant
lk_unique_participants{org_id="$org_id", window="7d"}

# Monthly Active Users for a specific tenant
lk_unique_participants{org_id="$org_id", window="30d"}

# Average time spent online per unique user (7-day window)
lk_participant_total_online_seconds{org_id="$org_id", window="7d"}
/
lk_unique_participants{org_id="$org_id", window="7d"}

Egress Stats Metrics (DB-Backed)

These metrics are not derived from ephemeral in-memory state. They are computed from the recordings database table, which stores one row per completed egress (recording) job. A background ticker queries the table every 5 minutes to refresh these gauges with sliding time windows (1d, 7d, 30d).

Why DB-backed gauges when we already have in-memory egress counters? The in-memory metrics (lk_egress_ended_total, lk_egress_duration_seconds) reset to zero on pod restart. For billing-critical data like "how many recordings did this customer complete this month" or "how many total recording minutes were consumed", losing state on restart is unacceptable. These DB-backed gauges read from the recordings table which persists across restarts, providing durable billing and usage metrics.

Sliding windows: Same pattern as the WAU/MAU gauges above. The ticker queries the recordings table three times per cycle — once for each window (1d, 7d, 30d) — using the GetCompletedStatsSince repository method. Each query filters on status = 'completed' AND ended_at >= now() - window and groups by organization. The partial index idx_recordings_completed_ended_at (migration 000020) ensures these queries use an index scan, not a sequential scan, even as the recordings table grows.

Metric Type Labels Data Source Description
lk_egress_completed Gauge project, region, org_id, window DB query (5 min refresh) Number of completed egress (recording) jobs per organization within the sliding time window. COUNT(*) from recordings where status = 'completed' AND ended_at >= now() - window.
lk_egress_total_duration_seconds Gauge project, region, org_id, window DB query (5 min refresh) Total recording seconds per organization within the sliding time window. SUM(duration_seconds) from recordings where status = 'completed' AND ended_at >= now() - window.

window Values: 1d (daily), 7d (weekly), 30d (monthly)

Grafana Usage:

# Monthly completed recordings for a specific tenant
lk_egress_completed{org_id="$org_id", window="30d"}

# Daily completed recordings for a specific tenant
lk_egress_completed{org_id="$org_id", window="1d"}

# Monthly recording minutes used (for billing)
lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60

# Recording usage as percentage of plan limit
(lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60)
/
lk_org_recording_limit_minutes{org_id="$org_id"} * 100

Connected Participants Metric (DB-Backed)

This metric is not derived from ephemeral in-memory state. It is a point-in-time snapshot computed from the room_presence database table, which stores one row per currently connected participant. A background ticker queries the table every 5 minutes to refresh this gauge.

Why DB-backed when we already have lk_participants_active? The in-memory gauge lk_participants_active tracks join/leave events and is subject to two failure modes: (1) it resets to zero on pod restart, and (2) it can drift from reality if events are lost. The DB-backed lk_participants_connected reads the actual room_presence table state, so it self-corrects every 5 minutes regardless of missed webhooks or restarts.

No time window needed: Unlike the egress stats above, this metric is inherently a snapshot — "how many participants are connected right now." Grafana can apply any time-based function on top of the scraped history: max_over_time for daily peaks, avg_over_time for averages, etc.

Metric Type Labels Data Source Description
lk_participants_connected Gauge project, region, org_id DB query (5 min refresh) Currently connected participants per organization, read from the room_presence table. COUNT(*) grouped by organization. Self-corrects every 5 minutes — immune to gauge drift and pod restarts.

Grafana Usage:

# Current connected participants for a tenant
lk_participants_connected{org_id="$org_id"}

# Peak connected participants over the last day
max_over_time(lk_participants_connected{org_id="$org_id"}[1d])

# Average connected participants over the last week
avg_over_time(lk_participants_connected{org_id="$org_id"}[7d])

Track Metrics

Metric Type Labels Webhook Trigger Description
lk_tracks_active Gauge project, region, org_id, source, type +1 on track_published, -1 on track_unpublished Currently published media tracks, broken down by source (camera, microphone, screen share) and type (audio, video). Useful for computing feature adoption rates, e.g., lk_tracks_active{source="camera"} / lk_participants_active gives the camera-on percentage.
lk_tracks_published_total Counter project, region, org_id, source, type track_published Cumulative count of tracks published. Each time a participant enables their camera, microphone, or starts a screen share, this counter increments.

Source Values: camera, microphone, screen_share, screen_share_audio, unknown

Type Values: audio, video, unknown

Egress (Recording) Metrics

Metric Type Labels Webhook Trigger Description
lk_egress_active Gauge project, region, org_id, request_type +1 on egress_started, -1 on egress_ended Currently running recording jobs. Broken down by request_type to distinguish room composite recordings from track-level or web recordings.
lk_egress_started_total Counter project, region, org_id, request_type egress_started Cumulative count of recording jobs started. On egress_started, the recorder also stores the start state (timestamp, org, request type) in-memory for duration calculation when the recording ends.
lk_egress_ended_total Counter project, region, org_id, request_type, result egress_ended Cumulative count of recording jobs completed, labeled by result (success or failed). On egress_ended, the recorder also persists the recording completion data (status, duration, expiry) to the recordings database table.
lk_egress_duration_seconds Histogram project, region, org_id, request_type, result egress_ended Distribution of recording durations in seconds. Useful for monitoring typical recording lengths and detecting abnormally short recordings (potential failures).

request_type Values: room_composite, web, track, participant, unknown

result Values: success, failed

Histogram Buckets: 30s, 60s, 120s, 300s, 600s, 1200s, 1800s, 3600s, 7200s, 14400s

Ingress (Stream Import) Metrics

Metric Type Labels Webhook Trigger Description
lk_ingress_active Gauge project, region, org_id, input_type +1 on ingress_started, -1 on ingress_ended Currently running stream import jobs, broken down by input protocol (RTMP, WHIP, URL).
lk_ingress_started_total Counter project, region, org_id, input_type ingress_started Cumulative count of stream import jobs started.
lk_ingress_ended_total Counter project, region, org_id, input_type, result ingress_ended Cumulative count of stream import jobs completed, labeled by result (success or failed). Failures are detected from the ingress state's error field or ENDPOINT_ERROR status.
lk_ingress_duration_seconds Histogram project, region, org_id, input_type, result ingress_ended Distribution of stream import durations in seconds.

input_type Values: rtmp, whip, url, unknown

result Values: success, failed

Organization License Metrics

Metric Type Labels Data Source Description
lk_org_recording_limit_minutes Gauge project, region, org_id DB query (hourly refresh) Monthly recording limit in minutes for the organization, read from the org_licenses table. Emitted on first org discovery and refreshed hourly. Used in Grafana to draw a dynamic per-tenant threshold line on the "Monthly Recording Minutes" panel. Zero if the organization has no license.

Grafana Usage:

# Dynamic threshold line per tenant on the recording minutes panel
lk_org_recording_limit_minutes{org_id="$org_id"}

# Recording usage as percentage of plan limit (uses DB-backed gauge)
(lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60)
/
lk_org_recording_limit_minutes{org_id="$org_id"} * 100

Webhook Health Metrics

Metric Type Labels Description
lk_webhook_events_total Counter event Total webhook events received, broken down by event type (e.g., room_started, participant_joined). Use rate(lk_webhook_events_total[5m]) to monitor event throughput. A sudden drop to zero during business hours indicates a connectivity issue between LiveKit and the backend.
lk_webhook_delivery_lag_seconds Histogram event Time in seconds between when the event actually occurred in LiveKit (CreatedAt timestamp) and when the backend received the webhook. High p99 values indicate network latency or backend processing delays. Only recorded for events with a positive CreatedAt timestamp.
lk_webhook_duplicates_total Counter event Count of duplicate webhook events detected via the deduplication map (same event ID seen within the 10-minute TTL window). LiveKit retries webhook delivery on timeout, so some duplicates are expected. A sustained high rate may indicate the backend is responding too slowly, causing LiveKit to retry aggressively.

System Health Metrics

Metric Type Labels Description
lk_org_lookup_failures_total Counter reason Failed room → organization lookups. With proper FK constraints, any increment indicates a data integrity issue requiring immediate investigation.
lk_org_info Gauge org_id, org_name Metadata metric mapping org_id to org_name for Grafana group_left joins. Always has value 1. Emitted once per organization on first discovery via a webhook event.

reason Values: not_configured (OrgLookup interface not provided), room_not_found (room doesn't exist in database), no_organization (room exists but has no organization - FK violation), no_org_id (organization exists but has no WorkOS ID - sync issue)


Complete Data Flow

Sequence Diagram: Webhook to Dashboard

sequenceDiagram
    participant LK as LiveKit Cloud
    participant BE as Backend /webhook
    participant REC as WebhookRecorder
    participant DB as PostgreSQL
    participant REG as Prometheus Registry
    participant ME as Backend /metrics
    participant AL as Grafana Alloy
    participant GC as Grafana Cloud

    Note over LK,GC: 1. Event Occurs (participant joins room)

    LK->>BE: POST /webhook<br/>{event: "participant_joined", id: "evt_123", ...}
    BE->>BE: Validate webhook signature
    BE->>REC: Handle(event) [non-blocking enqueue]
    BE-->>LK: 200 OK

    Note over REC,REG: 2. Async Processing (in background goroutine)

    REC->>REC: Check deduplication (event.id not seen?)
    REC->>REC: Record lk_webhook_events_total++
    REC->>REC: Record lk_webhook_delivery_lag_seconds
    REC->>DB: SELECT org_id FROM rooms JOIN organizations
    DB-->>REC: {org_id: "org_01ABC", org_name: "Acme Corp"}
    REC->>REC: Store participant start time (for duration calc later)
    REC->>REG: lk_participants_active{org_id="org_01ABC"}.Inc()
    REC->>REG: lk_participant_sessions_total{org_id="org_01ABC"}.Inc()

    Note over AL,GC: 3. Periodic Scrape (every 15 seconds)

    AL->>ME: GET /metrics
    ME->>REG: Collect all metric values
    REG-->>ME: Prometheus text format
    ME-->>AL: lk_participants_active{org_id="org_01ABC"} 15\n...
    AL->>AL: Add external_labels (environment, region, service)
    AL->>GC: remote_write (batched, compressed, with retries)
    GC->>GC: Store in Prometheus

    Note over GC: 4. Query & Visualize

    GC->>GC: Dashboard executes PromQL queries
    GC->>GC: Render panels with current and historical data

Configuration Reference

Backend Environment Variables

Variable Required Description Example
LIVEKIT_API_KEY Yes LiveKit API key for webhook signature validation APIxxxxxxxx
LIVEKIT_API_SECRET Yes LiveKit API secret for webhook signature validation secret-key
LIVEKIT_PROJECT No Project name for metric labels qubital

Grafana Alloy Environment Variables

Variable Required Description Example
ENVIRONMENT Yes Environment name for external label prod
REGION Yes Deployment region for external label eu
BACKEND_SCRAPE_TARGET Yes Backend service DNS and port backend-web:3001
GRAFANA_PROM_REMOTE_WRITE_URL Yes Grafana Cloud Prometheus endpoint https://prometheus-xxx.grafana.net/api/prom/push
GRAFANA_PROM_USERNAME Yes Grafana Cloud instance ID 123456
GRAFANA_API_KEY Yes Grafana Cloud API key with write permissions glc_eyJ...

Recorder Configuration

The WebhookRecorder has two tunable parameters:

type Config struct {
    QueueSize int           // Default: 1000 events
    DedupTTL  time.Duration // Default: 10 minutes
}
  • QueueSize: Maximum number of events waiting to be processed. If exceeded, new events are dropped (and logged). Increase if you see "queue full" warnings.
  • DedupTTL: How long to remember event IDs for deduplication. Must be longer than LiveKit's retry window (typically a few minutes).

Operational Considerations

Gauge Drift

Gauge metrics (lk_*_active) track current counts by incrementing on "start" events and decrementing on "end" events. This approach can drift from reality if events are lost:

Causes of Drift: - Network issues between LiveKit and the backend - Backend pod restarts (in-flight events are lost) - Queue overflow (events dropped due to backpressure)

Detection: - Compare lk_rooms_started_total - lk_rooms_finished_total with lk_room_active - they should match - Monitor lk_webhook_duplicates_total - high rates may indicate delivery issues - Alert on lk_org_lookup_failures_total > 0 - indicates data integrity problems

Mitigation: - Gauges self-correct over time as rooms finish and participants leave - Counter metrics (*_total) remain accurate regardless of drift - For critical accuracy, use counter differences: increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h])

Deduplication Behavior

LiveKit may retry webhook delivery if it doesn't receive a timely response. The recorder handles this by tracking event IDs for 10 minutes:

  • First occurrence: Processed normally
  • Subsequent occurrences (same ID within 10 minutes): Skipped, counted in lk_webhook_duplicates_total
  • After 10 minutes: ID is forgotten (old events won't be reprocessed anyway)

Backpressure Handling

If webhook events arrive faster than they can be processed, the queue (1000 events) acts as a buffer. If it fills completely:

  • New events are dropped (not queued)
  • A warning is logged: "Webhook recorder: queue full, dropping event"
  • The event_type and event_id are included in the log for debugging

This is a safety valve to prevent unbounded memory growth. If you see these warnings, investigate why processing is slow (database latency? high event volume?) or increase the queue size.


Alerting Recommendations

Alert Name PromQL Condition Severity Description
WebhookDeliveryLagHigh histogram_quantile(0.99, sum(rate(lk_webhook_delivery_lag_seconds_bucket[5m])) by (le)) > 10 Warning 99th percentile delivery lag exceeds 10 seconds
OrgLookupFailures increase(lk_org_lookup_failures_total[1h]) > 0 Critical Any org lookup failure indicates data integrity issues
RecordingFailures increase(lk_egress_ended_total{result="failed"}[1h]) > 0 Warning Recording jobs are failing
NoWebhookEvents rate(lk_webhook_events_total[5m]) == 0 Critical No webhook events received (during business hours)
HighDuplicateRate rate(lk_webhook_duplicates_total[5m]) > 1 Warning More than 1 duplicate per second indicates delivery issues
RecordingLimitApproaching lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60 > lk_org_recording_limit_minutes{org_id="$org_id"} * 0.8 Warning Organization has used 80% of monthly recording limit. Uses the DB-backed gauge which survives pod restarts, replacing the previous increase(lk_egress_duration_seconds_sum[30d]) approach.
WAUDropped lk_unique_participants{window="7d"} == 0 Warning No active users in the last 7 days for an organization (during business periods)

File Reference

File Path Purpose
pkg/metrics/livekit.go Metric definitions (names, types, labels, help text) and Prometheus registration
pkg/metrics/handler.go HTTP handler wrapper for the /metrics endpoint
internal/features/metrics/api/event_listener.go Webhook HTTP handler - receives and validates LiveKit webhooks
internal/features/metrics/service/recorder.go Background service - processes events and updates metrics
internal/features/metrics/service/models.go Data structures (Config, OrgData, RoomState, etc.) and interfaces
internal/features/metrics/service/org_lookup.go Database queries to resolve room IDs to organization IDs and fetch recording limits
internal/features/metrics/service/activity_recorder.go Adapters bridging repository interfaces to service-layer interfaces: participant activity (ActivityRecorder), egress stats (EgressStatsLoader), and connected participant counts (PresenceCounter). Each adapter defines a narrow repository interface, converts repository types to service-layer types, and provides a constructor for dependency injection.
internal/features/metrics/service/presence_manager.go Single-session enforcement (kicks user from old room when joining new one)
internal/repository/participant_activity_metric_repository.go Database operations for daily pre-aggregated participant activity (UPSERT, aggregate queries, purge)
internal/repository/recording_repository.go Database operations for recordings. Includes GetCompletedStatsSince which powers the lk_egress_completed and lk_egress_total_duration_seconds DB-backed gauges via a GROUP BY query on completed recordings per organization. Also defines the OrgRecordingStats scan target struct.
internal/repository/room_presence_repository.go Database operations for room presence tracking. Includes CountByOrg which powers the lk_participants_connected DB-backed gauge via a GROUP BY query on currently connected participants per organization. Also defines the OrgPresenceCount scan target struct.
internal/domain/database/participant_activity_metric.go GORM model for the participant_activity_metrics table
internal/database/migrations/000010_participant_activity_metrics.up.sql Migration creating the participant_activity_metrics table with daily pre-aggregation schema
internal/database/migrations/000020_recordings_completed_index.up.sql Migration adding a partial index idx_recordings_completed_ended_at on recordings(ended_at) WHERE status = 'completed' to support efficient GetCompletedStatsSince queries for the egress stats DB-backed gauges
cmd/main.go Application entry point - registers /metrics and /webhook endpoints
internal/app/app.go Dependency injection - creates and wires all metrics components, including the egress stats and presence count adapters