Metrics System¶

Overview¶

This document describes the metrics observability system for the Qubital backend. The system is designed to provide real-time visibility into LiveKit-powered virtual office usage, enabling both operational monitoring and business intelligence for a multi-tenant SaaS platform.

Purpose and Goals¶

The metrics system serves three primary objectives:

Operational Visibility: Monitor system health, detect anomalies, and troubleshoot issues in real-time. This includes tracking webhook delivery reliability, recording success rates, and identifying potential data integrity problems.
Business Intelligence: Understand how customers use the platform - how long they spend in rooms, which features (camera, microphone, screen share) are most adopted, and how usage patterns vary across organizations.
Multi-Tenant Isolation: Every metric is tagged with an organization identifier (org_id), enabling per-customer dashboards and billing analytics while maintaining strict data isolation between tenants.

Design Principles¶

The system follows several key design principles:

Event-Driven Architecture: Metrics are derived from LiveKit webhook events, ensuring real-time accuracy without polling overhead.
Stable Identifiers: Organization IDs (from WorkOS) are used instead of names to ensure time series continuity when customers rename their organizations.
Lightweight Footprint: The backend runs on resource-constrained pods (0.5 CPU, 300MB RAM), so the metrics system is designed to be stateless where possible and use minimal memory.
Separation of Concerns: The backend only collects and exposes metrics; storage, querying, and visualization are delegated to Grafana Cloud.

Architecture¶

High-Level Components¶

The metrics pipeline consists of four distinct components, each with a specific responsibility:

1. LiveKit Cloud (Event Source)

LiveKit Cloud is the real-time video/audio infrastructure that powers Qubital's virtual office rooms. It acts as the authoritative source of truth for all room, participant, track, and recording lifecycle events. When any significant event occurs (a user joins a room, enables their camera, starts a recording, etc.), LiveKit sends an HTTP webhook to our backend within milliseconds.

2. Qubital Backend (Event Processor)

The Go backend receives webhook events, validates their authenticity using cryptographic signatures, and transforms them into Prometheus metrics. It enriches each event with organization context by looking up the room's owner in the database. The backend exposes a /metrics endpoint in Prometheus text format that can be scraped by any compatible collector.

3. Grafana Alloy (Metrics Collector)

Grafana Alloy runs as a sidecar container alongside the backend. Every 15 seconds, it scrapes the /metrics endpoint, adds infrastructure labels (environment, region), batches the samples, and ships them to Grafana Cloud using the Prometheus remote write protocol. This decouples the backend from Grafana Cloud's specifics and provides retry logic, backpressure handling, and efficient batching.

4. Grafana Cloud (Storage and Visualization)

Grafana Cloud provides managed Prometheus storage with 13-month retention, PromQL query capabilities, and the Grafana dashboard/alerting interface. This is where metrics are persisted, queried, and visualized. The backend and Alloy are stateless - if they restart, no historical data is lost because it's already in Grafana Cloud.

Data Flow Diagram¶

The following diagram illustrates how data flows through the system, from a LiveKit event to a Grafana dashboard:

flowchart TD
    subgraph External["External Services"]
        LK["LiveKit Cloud<br/>(Video Infrastructure)"]
    end

    subgraph Backend["Qubital Backend Pod"]
        WH["/webhook<br/>HTTP POST Handler"]
        EL["EventListener<br/>Signature Validation"]
        REC["WebhookRecorder<br/>Event Processing"]

        subgraph Lookup["Organization Resolution"]
            OL["OrgLookup Service"]
            DB[(PostgreSQL)]
        end

        subgraph MetricsPkg["Prometheus Integration"]
            LKM["LiveKitMetrics<br/>Gauge/Counter/Histogram"]
            REG["Prometheus Registry"]
            ME["/metrics<br/>HTTP GET Handler"]
        end
    end

    subgraph Collector["Grafana Alloy Pod"]
        SCRAPE["prometheus.scrape<br/>15s interval"]
        LABELS["Add external_labels<br/>environment, region, service"]
        RW["prometheus.remote_write<br/>Batched shipping"]
    end

    subgraph Cloud["Grafana Cloud"]
        PROM[("Prometheus<br/>13-month retention")]
        DASH["Dashboards"]
        ALERT["Alerting"]
    end

    %% Event flow
    LK -->|"Webhook Event<br/>(room_started, participant_joined, etc.)"| WH
    WH --> EL
    EL -->|"Valid event"| REC

    %% Processing
    REC -->|"room_id lookup"| OL
    OL <-->|"SQL query"| DB
    REC -->|"Update metrics"| LKM
    LKM --> REG
    REG --> ME

    %% Scraping
    SCRAPE -->|"GET /metrics"| ME
    SCRAPE --> LABELS
    LABELS --> RW
    RW -->|"remote_write"| PROM

    %% Visualization
    PROM --> DASH
    PROM --> ALERT

Component Details¶

LiveKit Cloud¶

Role in the System¶

LiveKit Cloud serves as the event source for the entire metrics system. It is a managed WebRTC infrastructure service that handles all the complexity of real-time video/audio - media routing, bandwidth estimation, codec negotiation, etc. From a metrics perspective, LiveKit is valuable because it provides authoritative lifecycle events for everything happening in virtual office rooms.

How Webhooks Work¶

When configured with a webhook URL, LiveKit sends HTTP POST requests to that endpoint whenever significant events occur. Each webhook request includes:

Authentication: A cryptographic signature in the Authorization header, computed using the shared API secret. The backend validates this signature to ensure the webhook genuinely originated from LiveKit.
Event Type: A string identifier like room_started, participant_joined, or track_published.
Event ID: A unique identifier for deduplication. LiveKit may retry failed webhooks, so this ID prevents double-counting.
Timestamp: The Unix timestamp when the event actually occurred in LiveKit (not when the webhook was sent).
Payload: Event-specific data such as room information, participant details, or track metadata.

Webhook Events Reference¶

Event	When It Fires	Payload Includes
`room_started`	A room is created (first participant joins or explicit creation)	Room SID, room name (our UUID), metadata
`room_finished`	A room closes (empty timeout or explicit deletion)	Room SID, room name, duration
`participant_joined`	A user successfully connects to a room	Participant SID, identity (WorkOS user ID), room info
`participant_left`	A user disconnects (intentional or timeout)	Participant SID, room info
`participant_connection_aborted`	Connection failed during setup	Participant SID, error details
`track_published`	A user enables camera/microphone/screen share	Track SID, source type, media type, participant info
`track_unpublished`	A user disables a media track	Track SID, source type, participant info
`egress_started`	A recording job begins	Egress ID, room name, output configuration
`egress_ended`	A recording completes or fails	Egress ID, status, file results, duration
`ingress_started`	An external stream import begins	Ingress ID, input type, room name
`ingress_ended`	A stream import ends	Ingress ID, status, error (if any)

Qubital Backend¶

Webhook Endpoint (`POST /webhook`)¶

Location: internal/features/metrics/api/event_listener.go

The webhook endpoint is the entry point for all LiveKit events. It is intentionally simple and fast - the goal is to accept the webhook, validate it, and return a 200 OK as quickly as possible. Any slow processing would cause LiveKit to retry, potentially creating duplicate events.

The handler performs three steps:

Receive and Parse: The LiveKit SDK's ReceiveWebhookEvent function reads the request body and parses it into a structured event object.
Validate Signature: The SDK verifies the Authorization header against the shared API secret. Invalid signatures are rejected with an error, protecting against spoofed webhooks.
Enqueue for Processing: The event is passed to the WebhookRecorder via a non-blocking Handle(event) call. This immediately returns, allowing the HTTP response to be sent while processing happens asynchronously.

func (h *WebhookApiHandler) EventListener(c *gin.Context) {
    event, err := h.client.ReceiveWebhookEvent(c, h.livekitService)
    if err != nil {
        logger.ErrorAPICtx(ctx, "Webhook event listener failed", err, nil)
        c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to receive webhook event"})
        return
    }

    // Non-blocking enqueue - returns immediately
    h.recorder.Handle(event)

    logger.InfoCtx(ctx, "Webhook event received", []slog.Attr{
        slog.String("event_type", event.Event),
        slog.String("event_id", event.Id),
    })
}

Webhook Recorder Service¶

Location: internal/features/metrics/service/recorder.go

The WebhookRecorder is a background service that processes webhook events asynchronously. It runs as a single goroutine with a buffered channel, ensuring events are processed in order while decoupling the HTTP handler from potentially slow operations like database lookups.

Processing Pipeline:

Deduplication: Each event has a unique ID. The recorder maintains a map of recently seen IDs (10-minute TTL) and skips duplicates. This handles LiveKit's retry behavior gracefully.
Health Metrics: Before any business logic, the recorder updates system health metrics - incrementing the event counter and recording the delivery lag (time between event creation and receipt).
Organization Resolution: The recorder looks up the organization that owns the room. This is necessary because LiveKit doesn't know about our multi-tenant structure - it just sends room IDs. We query the database to map room_id → organization → org_id.
Metric Updates: Based on the event type, the recorder updates the appropriate Prometheus metrics. Counters are incremented, gauges are adjusted, and histograms observe duration values.

Ephemeral State for Duration Calculations

Why This State Exists

To compute duration-based metrics (e.g., lk_room_duration_seconds, lk_participant_lifetime_seconds), the recorder needs to know when things started. When a room_finished webhook arrives, the recorder must calculate finished_time - started_time. But the room_finished event doesn't include the start time - that information was only available in the earlier room_started event.

The solution is simple: when room_started arrives, the recorder stores the start timestamp in memory. When room_finished arrives later, it retrieves that timestamp, calculates the duration, and then deletes the entry.

What's Stored

State Map	Key	Stored Value	Purpose
`rooms`	room SID	Start timestamp + org_id	Calculate room duration on `room_finished`
`participants`	participant SID	Join timestamp + room SID + org_id	Calculate session duration on `participant_left`
`tracks`	room SID + source + type	Integer count	Track active media streams per org_id
`egress`	egress ID	Start timestamp + org_id + request_type + room_name	Calculate recording duration on `egress_ended`
`ingress`	ingress ID	Start timestamp + org_id + input_type + room_name	Calculate stream duration on `ingress_ended`

Key Characteristics

Ephemeral: This state exists only in memory. If the pod restarts, it's lost. This is acceptable because we only lose in-flight duration calculations - the counters in Prometheus (which are persisted in Grafana Cloud) remain accurate.
Bounded Size: The state only contains currently active entities. Once a room finishes or a participant leaves, their entry is deleted. The maps cannot grow unbounded - they shrink back down as activity ends.
Not Critical for Counters: Counters (*_total metrics) don't need this state - they just increment on each event. Only histograms (durations) require start/end pairing.
Trade-off: We could eliminate this state by storing start times in the database or Redis, but that would add latency and complexity for minimal benefit. The current approach is simple and fits the resource constraints.

Memory Usage (Not a Concern)

A common question is whether these in-memory maps will cause problems as customer usage grows. The answer is no - the memory usage is negligible:

Scenario	Concurrent Entities	Estimated Memory
Small scale	50 participants, 10 rooms	~15 KB
Medium scale	500 participants, 100 rooms	~100 KB
Large scale	5,000 participants, 1,000 rooms	~1 MB
Extreme scale	50,000 participants, 10,000 rooms	~10 MB

Each map entry is roughly 100-150 bytes (Go string headers, timestamps, a few pointers). Even at extreme scales that far exceed expected usage, the memory footprint is tiny compared to the pod's 300MB allocation.

Why This Isn't a Scaling Concern

Maps are bounded by concurrent activity, not total activity: If 1 million participants join and leave throughout the day, but only 500 are ever connected simultaneously, the maps only ever hold ~500 entries.
Entries are automatically cleaned up: When a participant_left event arrives, the entry is immediately deleted. There's no accumulation over time.
The data is trivially small: Timestamps and short strings. No large objects, no nested structures, no media data.

📌 Note for Future Scaling

The in-memory maps are not the bottleneck for scaling this system. If scaling becomes a concern, the limiting factor would be the single-goroutine event processing model (see Scaling Considerations below), not memory usage.

Scaling Considerations

This section separates two distinct concerns that are sometimes conflated: memory usage and throughput. They are independent issues with different thresholds and solutions.

Memory Usage: Not a Concern

As explained above, the in-memory state maps use negligible memory (~1-2MB even at thousands of concurrent participants). This is a non-issue for the foreseeable future and doesn't require any action.

Throughput: The Single-Goroutine Model

The WebhookRecorder processes events in a single goroutine, reading from a buffered channel. This design is intentionally simple:

[Webhook Handler] --enqueue--> [Buffered Channel (1000)] --dequeue--> [Single Processing Goroutine]

Why Single-Threaded?

No synchronization needed: Metric updates and state map operations don't require mutexes because only one goroutine accesses them.
Predictable ordering: Events for the same room/participant are processed in order.
Simple debugging: No race conditions, no deadlocks, no concurrent access bugs.

When Would This Become a Problem?

The single goroutine can process roughly 500-2,000 events per second (depending on database latency for org lookups). Each event involves: - Deduplication check (fast, in-memory) - Metric updates (fast, in-memory) - Optional database lookup (slower, ~5-50ms)

To stress this system, you'd need sustained event rates that exceed processing capacity. Here's a rough estimate:

Concurrent Participants	Estimated Events/Second	System Status
100	~5-10	Comfortable
1,000	~50-100	Comfortable
5,000	~250-500	Approaching limit
10,000+	~500-1000+	May need optimization

Events per second is estimated based on typical activity: participants joining/leaving, enabling/disabling tracks. Heavy activity (frequent track toggles) increases this.

Signs of Throughput Issues

"Webhook recorder: queue full, dropping event" in logs
Growing lag between event timestamps and processing time
lk_webhook_delivery_lag_seconds p99 increasing over time

Future Solutions (If Needed)

If throughput becomes a concern at 10,000+ concurrent participants:

Shard by room: Multiple processing goroutines, each handling a subset of rooms. Maintains per-room ordering while increasing parallelism.
Batch database lookups: Cache org_id by room for a short TTL, reducing database round-trips.
Horizontal scaling: Multiple backend pods behind a load balancer, each handling a portion of webhooks.

These optimizations are not currently implemented because they add complexity that isn't justified at current scale. The single-goroutine model handles Qubital's expected load with significant headroom.

Organization Lookup¶

Location: internal/features/metrics/service/org_lookup.go

Every metric needs an org_id label for multi-tenant filtering. The OrgLookup service resolves room IDs (which LiveKit knows) to organization IDs (which LiveKit doesn't know).

Resolution Process:

Receive the room name from the webhook (this is the room's UUID in our database)
Query: SELECT organizations.workos_org_id, organizations.name FROM rooms JOIN organizations ON rooms.organization_id = organizations.id WHERE rooms.id = ?
Return OrgData{OrgID, OrgName} for metric labeling

Why org_id Instead of org_name?

Organization names are mutable - customers can rename their organization at any time. If we used names as metric labels, renaming would create a new time series and break historical continuity. WorkOS organization IDs are immutable, ensuring metrics remain linked to the same organization forever.

To display friendly names in Grafana, we emit a separate lk_org_info{org_id, org_name} metric that maps IDs to names. Grafana can join this using PromQL's group_left.

Metrics Endpoint (`GET /metrics`)¶

Location: pkg/metrics/handler.go and cmd/main.go

The /metrics endpoint exposes all registered Prometheus metrics in the standard text exposition format. When Grafana Alloy (or any Prometheus-compatible scraper) sends a GET request, it receives output like:

# HELP lk_participants_active active participants
# TYPE lk_participants_active gauge
lk_participants_active{org_id="org_01ABC",project="qubital",region="eu"} 15
lk_participants_active{org_id="org_02XYZ",project="qubital",region="eu"} 8

# HELP lk_participant_sessions_total participant sessions started
# TYPE lk_participant_sessions_total counter
lk_participant_sessions_total{org_id="org_01ABC",project="qubital",region="eu"} 1247

The endpoint is stateless - it simply reads current values from the Prometheus registry and formats them. There's no computation or database access during scraping.

Grafana Alloy¶

Role in the System¶

Grafana Alloy is a vendor-neutral observability collector that bridges the gap between the backend and Grafana Cloud. While we could have the backend push metrics directly to Grafana Cloud, using Alloy provides several benefits:

Decoupling: The backend doesn't need Grafana Cloud credentials or awareness of the remote write protocol.
Reliability: Alloy handles retries, backpressure, and temporary network failures gracefully.
Batching: Instead of sending each metric individually, Alloy batches samples for efficient transmission.
Label Enrichment: Alloy adds infrastructure-level labels (environment, region) that the backend shouldn't need to know about.

Scraping Configuration¶

Alloy scrapes the backend every 15 seconds. This interval balances freshness (metrics are at most 15 seconds old) against load (200+ scrapes per hour is reasonable for a small pod).

prometheus.scrape "backend" {
  job_name        = "qubital-backend"
  scrape_interval = "15s"
  scrape_timeout  = "10s"
  scheme          = "http"
  metrics_path    = "/metrics"

  targets = [
    { "__address__" = env("BACKEND_SCRAPE_TARGET") },  // e.g., "backend-web:3001"
  ]

  forward_to = [prometheus.remote_write.grafanacloud.receiver]
}

Remote Write Configuration¶

After scraping, Alloy forwards metrics to Grafana Cloud's Prometheus endpoint. It adds external labels that apply to all metrics from this Alloy instance:

prometheus.remote_write "grafanacloud" {
  external_labels = {
    environment = env("ENVIRONMENT"),  // "test", "dev", or "prod"
    region      = env("REGION"),        // "eu", "us", etc.
    service     = "qubital-backend",
  }

  endpoint {
    url = env("GRAFANA_PROM_REMOTE_WRITE_URL")

    basic_auth {
      username = env("GRAFANA_PROM_USERNAME")
      password = env("GRAFANA_API_KEY")
    }

    queue_config {
      max_samples_per_send = 5000
      batch_send_deadline  = "20s"
      retry_on_http_429    = true
    }
  }
}

Label Strategy¶

Labels are added at two levels, creating a clear separation of concerns:

Label	Added By	Scope	Purpose
`org_id`	Backend	Per-metric	Multi-tenant customer identification
`project`	Backend	Per-metric	LiveKit project identifier
`source`, `type`	Backend	Track metrics only	Media track classification
`request_type`	Backend	Egress metrics only	Recording type classification
`environment`	Alloy	All metrics	Deployment environment (test/dev/prod)
`region`	Alloy	All metrics	Infrastructure region
`service`	Alloy	All metrics	Service identifier for filtering

This separation means the backend doesn't need to know which environment it's running in - that's infrastructure configuration handled by Alloy.

Grafana Cloud¶

Prometheus Storage¶

Grafana Cloud provides managed Prometheus storage with:

13-Month Retention: Metrics are stored for over a year, enabling long-term trend analysis and year-over-year comparisons.
High Availability: Data is replicated across multiple availability zones.
Automatic Scaling: Storage and query capacity scale automatically based on usage.
PromQL Interface: Full PromQL support for complex queries and aggregations.

Querying Metrics¶

Example PromQL queries for common use cases:

# Active participants for a specific organization in production
lk_participants_active{org_id="org_01ABC", environment="prod"}

# Participant session duration percentiles (p50, p90, p99)
histogram_quantile(0.50, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))

# Recording success rate over the last hour
sum(rate(lk_egress_ended_total{result="success", org_id="$org_id"}[1h]))
/
sum(rate(lk_egress_ended_total{org_id="$org_id"}[1h])) * 100

# Camera adoption rate (% of participants with camera enabled)
sum(lk_tracks_active{source="camera", org_id="$org_id"})
/
sum(lk_participants_active{org_id="$org_id"}) * 100

Multi-Tenant Dashboard Variables¶

To display organization names while filtering by stable IDs:

Create org_name variable: label_values(lk_org_info, org_name) - This shows a dropdown of human-readable organization names.
Create hidden org_id variable: label_values(lk_org_info{org_name="$org_name"}, org_id) - This automatically resolves the selected name to its ID.
Use in queries: lk_participants_active{org_id="$org_id"} - Filtering happens on the stable ID, but users see friendly names.

Metrics Reference¶

Room Metrics¶

Metric	Type	Labels	Webhook Trigger	Description
`lk_room_active`	Gauge	project, region, org_id	+1 on `room_started`, -1 on `room_finished`	Number of currently active rooms
`lk_rooms_started_total`	Counter	project, region, org_id	`room_started`	Cumulative count of rooms created
`lk_rooms_finished_total`	Counter	project, region, org_id	`room_finished`	Cumulative count of rooms closed
`lk_room_duration_seconds`	Histogram	project, region, org_id	`room_finished`	Distribution of room durations

Histogram Buckets: Exponential from 30s to ~3 hours (30, 54, 97, 175, 315, 567, 1020, 1837, 3307, 5953 seconds)

Participant Metrics¶

Metric	Type	Labels	Webhook Trigger	Description
`lk_participants_active`	Gauge	project, region, org_id	+1 on `participant_joined`, -1 on `participant_left`/`connection_aborted`	Currently connected participants
`lk_participant_sessions_total`	Counter	project, region, org_id	`participant_joined`	Cumulative participant sessions
`lk_participant_lifetime_seconds`	Histogram	project, region, org_id	`participant_left`	Distribution of session durations

Histogram Buckets: Exponential from 15s to ~3 hours (15, 27, 48, 87, 157, 283, 509, 916, 1649, 2969 seconds)

Track Metrics¶

Metric	Type	Labels	Webhook Trigger	Description
`lk_tracks_active`	Gauge	project, region, org_id, source, type	+1 on `track_published`, -1 on `track_unpublished`	Currently published media tracks
`lk_tracks_published_total`	Counter	project, region, org_id, source, type	`track_published`	Cumulative track publishes

Source Values: camera, microphone, screen_share, screen_share_audio, unknown

Type Values: audio, video, unknown

Egress (Recording) Metrics¶

Metric	Type	Labels	Webhook Trigger	Description
`lk_egress_active`	Gauge	project, region, org_id, request_type	+1 on `egress_started`, -1 on `egress_ended`	Currently running recordings
`lk_egress_started_total`	Counter	project, region, org_id, request_type	`egress_started`	Cumulative recordings started
`lk_egress_ended_total`	Counter	project, region, org_id, request_type, result	`egress_ended`	Cumulative recordings ended
`lk_egress_duration_seconds`	Histogram	project, region, org_id, request_type, result	`egress_ended`	Distribution of recording durations

request_type Values: room_composite, web, track, participant, unknown

result Values: success, failed

Histogram Buckets: 30s, 60s, 120s, 300s, 600s, 1200s, 1800s, 3600s, 7200s, 14400s

Ingress (Stream Import) Metrics¶

Metric	Type	Labels	Webhook Trigger	Description
`lk_ingress_active`	Gauge	project, region, org_id, input_type	+1 on `ingress_started`, -1 on `ingress_ended`	Currently running stream imports
`lk_ingress_started_total`	Counter	project, region, org_id, input_type	`ingress_started`	Cumulative stream imports started
`lk_ingress_ended_total`	Counter	project, region, org_id, input_type, result	`ingress_ended`	Cumulative stream imports ended
`lk_ingress_duration_seconds`	Histogram	project, region, org_id, input_type, result	`ingress_ended`	Distribution of stream durations

input_type Values: rtmp, whip, url, unknown

result Values: success, failed

Webhook Health Metrics¶

Metric	Type	Labels	Description
`lk_webhook_events_total`	Counter	event	Total webhook events received, by event type
`lk_webhook_delivery_lag_seconds`	Histogram	event	Time between event creation in LiveKit and receipt by backend
`lk_webhook_duplicates_total`	Counter	event	Duplicate webhook events detected (same event ID seen twice)

System Health Metrics¶

Metric	Type	Labels	Description
`lk_org_lookup_failures_total`	Counter	reason	Failed room → organization lookups (indicates data integrity issues)
`lk_org_info`	Gauge	org_id, org_name	Metadata metric mapping org_id to org_name for Grafana joins

reason Values: not_configured, room_not_found, no_organization, no_org_id

Complete Data Flow¶

Sequence Diagram: Webhook to Dashboard¶

sequenceDiagram
    participant LK as LiveKit Cloud
    participant BE as Backend /webhook
    participant REC as WebhookRecorder
    participant DB as PostgreSQL
    participant REG as Prometheus Registry
    participant ME as Backend /metrics
    participant AL as Grafana Alloy
    participant GC as Grafana Cloud

    Note over LK,GC: 1. Event Occurs (participant joins room)

    LK->>BE: POST /webhook<br/>{event: "participant_joined", id: "evt_123", ...}
    BE->>BE: Validate webhook signature
    BE->>REC: Handle(event) [non-blocking enqueue]
    BE-->>LK: 200 OK

    Note over REC,REG: 2. Async Processing (in background goroutine)

    REC->>REC: Check deduplication (event.id not seen?)
    REC->>REC: Record lk_webhook_events_total++
    REC->>REC: Record lk_webhook_delivery_lag_seconds
    REC->>DB: SELECT org_id FROM rooms JOIN organizations
    DB-->>REC: {org_id: "org_01ABC", org_name: "Acme Corp"}
    REC->>REC: Store participant start time (for duration calc later)
    REC->>REG: lk_participants_active{org_id="org_01ABC"}.Inc()
    REC->>REG: lk_participant_sessions_total{org_id="org_01ABC"}.Inc()

    Note over AL,GC: 3. Periodic Scrape (every 15 seconds)

    AL->>ME: GET /metrics
    ME->>REG: Collect all metric values
    REG-->>ME: Prometheus text format
    ME-->>AL: lk_participants_active{org_id="org_01ABC"} 15\n...
    AL->>AL: Add external_labels (environment, region, service)
    AL->>GC: remote_write (batched, compressed, with retries)
    GC->>GC: Store in Prometheus

    Note over GC: 4. Query & Visualize

    GC->>GC: Dashboard executes PromQL queries
    GC->>GC: Render panels with current and historical data

Configuration Reference¶

Backend Environment Variables¶

Variable	Required	Description	Example
`LIVEKIT_API_KEY`	Yes	LiveKit API key for webhook signature validation	`APIxxxxxxxx`
`LIVEKIT_API_SECRET`	Yes	LiveKit API secret for webhook signature validation	`secret-key`
`LIVEKIT_PROJECT`	No	Project name for metric labels	`qubital`

Grafana Alloy Environment Variables¶

Variable	Required	Description	Example
`ENVIRONMENT`	Yes	Environment name for external label	`prod`
`REGION`	Yes	Deployment region for external label	`eu`
`BACKEND_SCRAPE_TARGET`	Yes	Backend service DNS and port	`backend-web:3001`
`GRAFANA_PROM_REMOTE_WRITE_URL`	Yes	Grafana Cloud Prometheus endpoint	`https://prometheus-xxx.grafana.net/api/prom/push`
`GRAFANA_PROM_USERNAME`	Yes	Grafana Cloud instance ID	`123456`
`GRAFANA_API_KEY`	Yes	Grafana Cloud API key with write permissions	`glc_eyJ...`

Recorder Configuration¶

The WebhookRecorder has two tunable parameters:

type Config struct {
    QueueSize int           // Default: 1000 events
    DedupTTL  time.Duration // Default: 10 minutes
}

QueueSize: Maximum number of events waiting to be processed. If exceeded, new events are dropped (and logged). Increase if you see "queue full" warnings.
DedupTTL: How long to remember event IDs for deduplication. Must be longer than LiveKit's retry window (typically a few minutes).

Operational Considerations¶

Gauge Drift¶

Gauge metrics (lk_*_active) track current counts by incrementing on "start" events and decrementing on "end" events. This approach can drift from reality if events are lost:

Causes of Drift: - Network issues between LiveKit and the backend - Backend pod restarts (in-flight events are lost) - Queue overflow (events dropped due to backpressure)

Detection: - Compare lk_rooms_started_total - lk_rooms_finished_total with lk_room_active - they should match - Monitor lk_webhook_duplicates_total - high rates may indicate delivery issues - Alert on lk_org_lookup_failures_total > 0 - indicates data integrity problems

Mitigation: - Gauges self-correct over time as rooms finish and participants leave - Counter metrics (*_total) remain accurate regardless of drift - For critical accuracy, use counter differences: increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h])

Deduplication Behavior¶

LiveKit may retry webhook delivery if it doesn't receive a timely response. The recorder handles this by tracking event IDs for 10 minutes:

First occurrence: Processed normally
Subsequent occurrences (same ID within 10 minutes): Skipped, counted in lk_webhook_duplicates_total
After 10 minutes: ID is forgotten (old events won't be reprocessed anyway)

Backpressure Handling¶

If webhook events arrive faster than they can be processed, the queue (1000 events) acts as a buffer. If it fills completely:

New events are dropped (not queued)
A warning is logged: "Webhook recorder: queue full, dropping event"
The event_type and event_id are included in the log for debugging

This is a safety valve to prevent unbounded memory growth. If you see these warnings, investigate why processing is slow (database latency? high event volume?) or increase the queue size.

Alerting Recommendations¶

Alert Name	PromQL Condition	Severity	Description
WebhookDeliveryLagHigh	`histogram_quantile(0.99, sum(rate(lk_webhook_delivery_lag_seconds_bucket[5m])) by (le)) > 10`	Warning	99th percentile delivery lag exceeds 10 seconds
OrgLookupFailures	`increase(lk_org_lookup_failures_total[1h]) > 0`	Critical	Any org lookup failure indicates data integrity issues
RecordingFailures	`increase(lk_egress_ended_total{result="failed"}[1h]) > 0`	Warning	Recording jobs are failing
NoWebhookEvents	`rate(lk_webhook_events_total[5m]) == 0`	Critical	No webhook events received (during business hours)
HighDuplicateRate	`rate(lk_webhook_duplicates_total[5m]) > 1`	Warning	More than 1 duplicate per second indicates delivery issues

File Reference¶

File Path	Purpose
`pkg/metrics/livekit.go`	Metric definitions (names, types, labels, help text) and Prometheus registration
`pkg/metrics/handler.go`	HTTP handler wrapper for the `/metrics` endpoint
`internal/features/metrics/api/event_listener.go`	Webhook HTTP handler - receives and validates LiveKit webhooks
`internal/features/metrics/service/recorder.go`	Background service - processes events and updates metrics
`internal/features/metrics/service/models.go`	Data structures (Config, OrgData, RoomState, etc.) and interfaces
`internal/features/metrics/service/org_lookup.go`	Database queries to resolve room IDs to organization IDs
`internal/features/metrics/service/presence_manager.go`	Single-session enforcement (kicks user from old room when joining new one)
`cmd/main.go`	Application entry point - registers `/metrics` and `/webhook` endpoints
`internal/app/app.go`	Dependency injection - creates and wires all metrics components

Metrics System¶

Overview¶

Purpose and Goals¶

Design Principles¶

Architecture¶

High-Level Components¶

Data Flow Diagram¶

Component Details¶

LiveKit Cloud¶

Role in the System¶

How Webhooks Work¶

Webhook Events Reference¶

Qubital Backend¶

Webhook Endpoint (POST /webhook)¶

Webhook Recorder Service¶

Organization Lookup¶

Metrics Endpoint (GET /metrics)¶

Grafana Alloy¶

Role in the System¶

Scraping Configuration¶

Remote Write Configuration¶

Label Strategy¶

Grafana Cloud¶

Prometheus Storage¶

Querying Metrics¶

Multi-Tenant Dashboard Variables¶

Metrics Reference¶

Room Metrics¶

Participant Metrics¶

Track Metrics¶

Egress (Recording) Metrics¶

Ingress (Stream Import) Metrics¶

Webhook Health Metrics¶

System Health Metrics¶

Complete Data Flow¶

Sequence Diagram: Webhook to Dashboard¶

Configuration Reference¶

Backend Environment Variables¶

Grafana Alloy Environment Variables¶

Recorder Configuration¶

Operational Considerations¶

Gauge Drift¶

Deduplication Behavior¶

Backpressure Handling¶

Alerting Recommendations¶

File Reference¶

Webhook Endpoint (`POST /webhook`)¶

Metrics Endpoint (`GET /metrics`)¶