Metrics System¶
Overview¶
This document describes the metrics observability system for the Qubital backend. The system is designed to provide real-time visibility into LiveKit-powered virtual office usage, enabling both operational monitoring and business intelligence for a multi-tenant SaaS platform.
Purpose and Goals¶
The metrics system serves three primary objectives:
-
Operational Visibility: Monitor system health, detect anomalies, and troubleshoot issues in real-time. This includes tracking webhook delivery reliability, recording success rates, and identifying potential data integrity problems.
-
Business Intelligence: Understand how customers use the platform - how long they spend in rooms, which features (camera, microphone, screen share) are most adopted, and how usage patterns vary across organizations.
-
Multi-Tenant Isolation: Every metric is tagged with an organization identifier (
org_id), enabling per-customer dashboards and billing analytics while maintaining strict data isolation between tenants.
Design Principles¶
The system follows several key design principles:
- Event-Driven Architecture: Metrics are derived from LiveKit webhook events, ensuring real-time accuracy without polling overhead.
- Stable Identifiers: Organization IDs (from WorkOS) are used instead of names to ensure time series continuity when customers rename their organizations.
- Lightweight Footprint: The backend runs on resource-constrained pods (0.5 CPU, 300MB RAM), so the metrics system is designed to be stateless where possible and use minimal memory.
- Separation of Concerns: The backend only collects and exposes metrics; storage, querying, and visualization are delegated to Grafana Cloud.
Architecture¶
High-Level Components¶
The metrics pipeline consists of four distinct components, each with a specific responsibility:
1. LiveKit Cloud (Event Source)
LiveKit Cloud is the real-time video/audio infrastructure that powers Qubital's virtual office rooms. It acts as the authoritative source of truth for all room, participant, track, and recording lifecycle events. When any significant event occurs (a user joins a room, enables their camera, starts a recording, etc.), LiveKit sends an HTTP webhook to our backend within milliseconds.
2. Qubital Backend (Event Processor)
The Go backend receives webhook events, validates their authenticity using cryptographic signatures, and transforms them into Prometheus metrics. It enriches each event with organization context by looking up the room's owner in the database. The backend exposes a /metrics endpoint in Prometheus text format that can be scraped by any compatible collector.
3. Grafana Alloy (Metrics Collector)
Grafana Alloy runs as a sidecar container alongside the backend. Every 15 seconds, it scrapes the /metrics endpoint, adds infrastructure labels (environment, region), batches the samples, and ships them to Grafana Cloud using the Prometheus remote write protocol. This decouples the backend from Grafana Cloud's specifics and provides retry logic, backpressure handling, and efficient batching.
4. Grafana Cloud (Storage and Visualization)
Grafana Cloud provides managed Prometheus storage with 13-month retention, PromQL query capabilities, and the Grafana dashboard/alerting interface. This is where metrics are persisted, queried, and visualized. The backend and Alloy are stateless - if they restart, no historical data is lost because it's already in Grafana Cloud.
Data Flow Diagram¶
The following diagram illustrates how data flows through the system, from a LiveKit event to a Grafana dashboard:
flowchart TD
subgraph External["External Services"]
LK["LiveKit Cloud<br/>(Video Infrastructure)"]
end
subgraph Backend["Qubital Backend Pod"]
WH["/webhook<br/>HTTP POST Handler"]
EL["EventListener<br/>Signature Validation"]
REC["WebhookRecorder<br/>Event Processing"]
subgraph Lookup["Organization Resolution"]
OL["OrgLookup Service"]
DB[(PostgreSQL)]
end
subgraph MetricsPkg["Prometheus Integration"]
LKM["LiveKitMetrics<br/>Gauge/Counter/Histogram"]
REG["Prometheus Registry"]
ME["/metrics<br/>HTTP GET Handler"]
end
end
subgraph Collector["Grafana Alloy Pod"]
SCRAPE["prometheus.scrape<br/>15s interval"]
LABELS["Add external_labels<br/>environment, region, service"]
RW["prometheus.remote_write<br/>Batched shipping"]
end
subgraph Cloud["Grafana Cloud"]
PROM[("Prometheus<br/>13-month retention")]
DASH["Dashboards"]
ALERT["Alerting"]
end
%% Event flow
LK -->|"Webhook Event<br/>(room_started, participant_joined, etc.)"| WH
WH --> EL
EL -->|"Valid event"| REC
%% Processing
REC -->|"room_id lookup"| OL
OL <-->|"SQL query"| DB
REC -->|"Update metrics"| LKM
LKM --> REG
REG --> ME
%% Scraping
SCRAPE -->|"GET /metrics"| ME
SCRAPE --> LABELS
LABELS --> RW
RW -->|"remote_write"| PROM
%% Visualization
PROM --> DASH
PROM --> ALERT
Component Details¶
LiveKit Cloud¶
Role in the System¶
LiveKit Cloud serves as the event source for the entire metrics system. It is a managed WebRTC infrastructure service that handles all the complexity of real-time video/audio - media routing, bandwidth estimation, codec negotiation, etc. From a metrics perspective, LiveKit is valuable because it provides authoritative lifecycle events for everything happening in virtual office rooms.
How Webhooks Work¶
When configured with a webhook URL, LiveKit sends HTTP POST requests to that endpoint whenever significant events occur. Each webhook request includes:
- Authentication: A cryptographic signature in the
Authorizationheader, computed using the shared API secret. The backend validates this signature to ensure the webhook genuinely originated from LiveKit. - Event Type: A string identifier like
room_started,participant_joined, ortrack_published. - Event ID: A unique identifier for deduplication. LiveKit may retry failed webhooks, so this ID prevents double-counting.
- Timestamp: The Unix timestamp when the event actually occurred in LiveKit (not when the webhook was sent).
- Payload: Event-specific data such as room information, participant details, or track metadata.
Webhook Events Reference¶
| Event | When It Fires | Payload Includes |
|---|---|---|
room_started |
A room is created (first participant joins or explicit creation) | Room SID, room name (our UUID), metadata |
room_finished |
A room closes (empty timeout or explicit deletion) | Room SID, room name, duration |
participant_joined |
A user successfully connects to a room | Participant SID, identity (WorkOS user ID), room info |
participant_left |
A user disconnects (intentional or timeout) | Participant SID, room info |
participant_connection_aborted |
Connection failed during setup | Participant SID, error details |
track_published |
A user enables camera/microphone/screen share | Track SID, source type, media type, participant info |
track_unpublished |
A user disables a media track | Track SID, source type, participant info |
egress_started |
A recording job begins | Egress ID, room name, output configuration |
egress_ended |
A recording completes or fails | Egress ID, status, file results, duration |
ingress_started |
An external stream import begins | Ingress ID, input type, room name |
ingress_ended |
A stream import ends | Ingress ID, status, error (if any) |
Qubital Backend¶
Webhook Endpoint (POST /webhook)¶
Location: internal/features/metrics/api/event_listener.go
The webhook endpoint is the entry point for all LiveKit events. It is intentionally simple and fast - the goal is to accept the webhook, validate it, and return a 200 OK as quickly as possible. Any slow processing would cause LiveKit to retry, potentially creating duplicate events.
The handler performs three steps:
-
Receive and Parse: The LiveKit SDK's
ReceiveWebhookEventfunction reads the request body and parses it into a structured event object. -
Validate Signature: The SDK verifies the
Authorizationheader against the shared API secret. Invalid signatures are rejected with an error, protecting against spoofed webhooks. -
Enqueue for Processing: The event is passed to the WebhookRecorder via a non-blocking
Handle(event)call. This immediately returns, allowing the HTTP response to be sent while processing happens asynchronously.
func (h *WebhookApiHandler) EventListener(c *gin.Context) {
event, err := h.client.ReceiveWebhookEvent(c, h.livekitService)
if err != nil {
logger.ErrorAPICtx(ctx, "Webhook event listener failed", err, nil)
c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to receive webhook event"})
return
}
// Non-blocking enqueue - returns immediately
h.recorder.Handle(event)
logger.InfoCtx(ctx, "Webhook event received", []slog.Attr{
slog.String("event_type", event.Event),
slog.String("event_id", event.Id),
})
}
Webhook Recorder Service¶
Location: internal/features/metrics/service/recorder.go
The WebhookRecorder is a background service that processes webhook events asynchronously. It runs as a single goroutine with a buffered channel, ensuring events are processed in order while decoupling the HTTP handler from potentially slow operations like database lookups.
Processing Pipeline:
-
Deduplication: Each event has a unique ID. The recorder maintains a map of recently seen IDs (10-minute TTL) and skips duplicates. This handles LiveKit's retry behavior gracefully.
-
Health Metrics: Before any business logic, the recorder updates system health metrics - incrementing the event counter and recording the delivery lag (time between event creation and receipt).
-
Organization Resolution: The recorder looks up the organization that owns the room. This is necessary because LiveKit doesn't know about our multi-tenant structure - it just sends room IDs. We query the database to map
room_id → organization → org_id. -
Metric Updates: Based on the event type, the recorder updates the appropriate Prometheus metrics. Counters are incremented, gauges are adjusted, and histograms observe duration values.
Ephemeral State for Duration Calculations
Why This State Exists
To compute duration-based metrics (e.g., lk_room_duration_seconds, lk_participant_lifetime_seconds), the recorder needs to know when things started. When a room_finished webhook arrives, the recorder must calculate finished_time - started_time. But the room_finished event doesn't include the start time - that information was only available in the earlier room_started event.
The solution is simple: when room_started arrives, the recorder stores the start timestamp in memory. When room_finished arrives later, it retrieves that timestamp, calculates the duration, and then deletes the entry.
What's Stored
| State Map | Key | Stored Value | Purpose |
|---|---|---|---|
rooms |
room SID | Start timestamp + org_id | Calculate room duration on room_finished |
participants |
participant SID | Join timestamp + room SID + org_id | Calculate session duration on participant_left |
tracks |
room SID + source + type | Integer count | Track active media streams per org_id |
egress |
egress ID | Start timestamp + org_id + request_type + room_name | Calculate recording duration on egress_ended |
ingress |
ingress ID | Start timestamp + org_id + input_type + room_name | Calculate stream duration on ingress_ended |
Key Characteristics
-
Ephemeral: This state exists only in memory. If the pod restarts, it's lost. This is acceptable because we only lose in-flight duration calculations - the counters in Prometheus (which are persisted in Grafana Cloud) remain accurate.
-
Bounded Size: The state only contains currently active entities. Once a room finishes or a participant leaves, their entry is deleted. The maps cannot grow unbounded - they shrink back down as activity ends.
-
Not Critical for Counters: Counters (
*_totalmetrics) don't need this state - they just increment on each event. Only histograms (durations) require start/end pairing. -
Trade-off: We could eliminate this state by storing start times in the database or Redis, but that would add latency and complexity for minimal benefit. The current approach is simple and fits the resource constraints.
Memory Usage (Not a Concern)
A common question is whether these in-memory maps will cause problems as customer usage grows. The answer is no - the memory usage is negligible:
| Scenario | Concurrent Entities | Estimated Memory |
|---|---|---|
| Small scale | 50 participants, 10 rooms | ~15 KB |
| Medium scale | 500 participants, 100 rooms | ~100 KB |
| Large scale | 5,000 participants, 1,000 rooms | ~1 MB |
| Extreme scale | 50,000 participants, 10,000 rooms | ~10 MB |
Each map entry is roughly 100-150 bytes (Go string headers, timestamps, a few pointers). Even at extreme scales that far exceed expected usage, the memory footprint is tiny compared to the pod's 300MB allocation.
Why This Isn't a Scaling Concern
-
Maps are bounded by concurrent activity, not total activity: If 1 million participants join and leave throughout the day, but only 500 are ever connected simultaneously, the maps only ever hold ~500 entries.
-
Entries are automatically cleaned up: When a
participant_leftevent arrives, the entry is immediately deleted. There's no accumulation over time. -
The data is trivially small: Timestamps and short strings. No large objects, no nested structures, no media data.
📌 Note for Future Scaling
The in-memory maps are not the bottleneck for scaling this system. If scaling becomes a concern, the limiting factor would be the single-goroutine event processing model (see Scaling Considerations below), not memory usage.
Scaling Considerations
This section separates two distinct concerns that are sometimes conflated: memory usage and throughput. They are independent issues with different thresholds and solutions.
Memory Usage: Not a Concern
As explained above, the in-memory state maps use negligible memory (~1-2MB even at thousands of concurrent participants). This is a non-issue for the foreseeable future and doesn't require any action.
Throughput: The Single-Goroutine Model
The WebhookRecorder processes events in a single goroutine, reading from a buffered channel. This design is intentionally simple:
[Webhook Handler] --enqueue--> [Buffered Channel (1000)] --dequeue--> [Single Processing Goroutine]
Why Single-Threaded?
- No synchronization needed: Metric updates and state map operations don't require mutexes because only one goroutine accesses them.
- Predictable ordering: Events for the same room/participant are processed in order.
- Simple debugging: No race conditions, no deadlocks, no concurrent access bugs.
When Would This Become a Problem?
The single goroutine can process roughly 500-2,000 events per second (depending on database latency for org lookups). Each event involves: - Deduplication check (fast, in-memory) - Metric updates (fast, in-memory) - Optional database lookup (slower, ~5-50ms)
To stress this system, you'd need sustained event rates that exceed processing capacity. Here's a rough estimate:
| Concurrent Participants | Estimated Events/Second | System Status |
|---|---|---|
| 100 | ~5-10 | Comfortable |
| 1,000 | ~50-100 | Comfortable |
| 5,000 | ~250-500 | Approaching limit |
| 10,000+ | ~500-1000+ | May need optimization |
Events per second is estimated based on typical activity: participants joining/leaving, enabling/disabling tracks. Heavy activity (frequent track toggles) increases this.
Signs of Throughput Issues
"Webhook recorder: queue full, dropping event"in logs- Growing lag between event timestamps and processing time
lk_webhook_delivery_lag_secondsp99 increasing over time
Future Solutions (If Needed)
If throughput becomes a concern at 10,000+ concurrent participants:
- Shard by room: Multiple processing goroutines, each handling a subset of rooms. Maintains per-room ordering while increasing parallelism.
- Batch database lookups: Cache org_id by room for a short TTL, reducing database round-trips.
- Horizontal scaling: Multiple backend pods behind a load balancer, each handling a portion of webhooks.
These optimizations are not currently implemented because they add complexity that isn't justified at current scale. The single-goroutine model handles Qubital's expected load with significant headroom.
Organization Lookup¶
Location: internal/features/metrics/service/org_lookup.go
Every metric needs an org_id label for multi-tenant filtering. The OrgLookup service resolves room IDs (which LiveKit knows) to organization IDs (which LiveKit doesn't know).
Resolution Process:
- Receive the room name from the webhook (this is the room's UUID in our database)
- Query:
SELECT organizations.workos_org_id, organizations.name FROM rooms JOIN organizations ON rooms.organization_id = organizations.id WHERE rooms.id = ? - Return
OrgData{OrgID, OrgName}for metric labeling
Why org_id Instead of org_name?
Organization names are mutable - customers can rename their organization at any time. If we used names as metric labels, renaming would create a new time series and break historical continuity. WorkOS organization IDs are immutable, ensuring metrics remain linked to the same organization forever.
To display friendly names in Grafana, we emit a separate lk_org_info{org_id, org_name} metric that maps IDs to names. Grafana can join this using PromQL's group_left.
Metrics Endpoint (GET /metrics)¶
Location: pkg/metrics/handler.go and cmd/main.go
The /metrics endpoint exposes all registered Prometheus metrics in the standard text exposition format. When Grafana Alloy (or any Prometheus-compatible scraper) sends a GET request, it receives output like:
# HELP lk_participants_active active participants
# TYPE lk_participants_active gauge
lk_participants_active{org_id="org_01ABC",project="qubital",region="eu"} 15
lk_participants_active{org_id="org_02XYZ",project="qubital",region="eu"} 8
# HELP lk_participant_sessions_total participant sessions started
# TYPE lk_participant_sessions_total counter
lk_participant_sessions_total{org_id="org_01ABC",project="qubital",region="eu"} 1247
The endpoint is stateless - it simply reads current values from the Prometheus registry and formats them. There's no computation or database access during scraping.
Grafana Alloy¶
Role in the System¶
Grafana Alloy is a vendor-neutral observability collector that bridges the gap between the backend and Grafana Cloud. While we could have the backend push metrics directly to Grafana Cloud, using Alloy provides several benefits:
- Decoupling: The backend doesn't need Grafana Cloud credentials or awareness of the remote write protocol.
- Reliability: Alloy handles retries, backpressure, and temporary network failures gracefully.
- Batching: Instead of sending each metric individually, Alloy batches samples for efficient transmission.
- Label Enrichment: Alloy adds infrastructure-level labels (environment, region) that the backend shouldn't need to know about.
Scraping Configuration¶
Alloy scrapes the backend every 15 seconds. This interval balances freshness (metrics are at most 15 seconds old) against load (200+ scrapes per hour is reasonable for a small pod).
prometheus.scrape "backend" {
job_name = "qubital-backend"
scrape_interval = "15s"
scrape_timeout = "10s"
scheme = "http"
metrics_path = "/metrics"
targets = [
{ "__address__" = env("BACKEND_SCRAPE_TARGET") }, // e.g., "backend-web:3001"
]
forward_to = [prometheus.remote_write.grafanacloud.receiver]
}
Remote Write Configuration¶
After scraping, Alloy forwards metrics to Grafana Cloud's Prometheus endpoint. It adds external labels that apply to all metrics from this Alloy instance:
prometheus.remote_write "grafanacloud" {
external_labels = {
environment = env("ENVIRONMENT"), // "test", "dev", or "prod"
region = env("REGION"), // "eu", "us", etc.
service = "qubital-backend",
}
endpoint {
url = env("GRAFANA_PROM_REMOTE_WRITE_URL")
basic_auth {
username = env("GRAFANA_PROM_USERNAME")
password = env("GRAFANA_API_KEY")
}
queue_config {
max_samples_per_send = 5000
batch_send_deadline = "20s"
retry_on_http_429 = true
}
}
}
Label Strategy¶
Labels are added at two levels, creating a clear separation of concerns:
| Label | Added By | Scope | Purpose |
|---|---|---|---|
org_id |
Backend | Per-metric | Multi-tenant customer identification |
project |
Backend | Per-metric | LiveKit project identifier |
source, type |
Backend | Track metrics only | Media track classification |
request_type |
Backend | Egress metrics only | Recording type classification |
environment |
Alloy | All metrics | Deployment environment (test/dev/prod) |
region |
Alloy | All metrics | Infrastructure region |
service |
Alloy | All metrics | Service identifier for filtering |
This separation means the backend doesn't need to know which environment it's running in - that's infrastructure configuration handled by Alloy.
Grafana Cloud¶
Prometheus Storage¶
Grafana Cloud provides managed Prometheus storage with:
- 13-Month Retention: Metrics are stored for over a year, enabling long-term trend analysis and year-over-year comparisons.
- High Availability: Data is replicated across multiple availability zones.
- Automatic Scaling: Storage and query capacity scale automatically based on usage.
- PromQL Interface: Full PromQL support for complex queries and aggregations.
Querying Metrics¶
Example PromQL queries for common use cases:
# Active participants for a specific organization in production
lk_participants_active{org_id="org_01ABC", environment="prod"}
# Participant session duration percentiles (p50, p90, p99)
histogram_quantile(0.50, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
# Recording success rate over the last hour
sum(rate(lk_egress_ended_total{result="success", org_id="$org_id"}[1h]))
/
sum(rate(lk_egress_ended_total{org_id="$org_id"}[1h])) * 100
# Camera adoption rate (% of participants with camera enabled)
sum(lk_tracks_active{source="camera", org_id="$org_id"})
/
sum(lk_participants_active{org_id="$org_id"}) * 100
Multi-Tenant Dashboard Variables¶
To display organization names while filtering by stable IDs:
-
Create
org_namevariable:label_values(lk_org_info, org_name)- This shows a dropdown of human-readable organization names. -
Create hidden
org_idvariable:label_values(lk_org_info{org_name="$org_name"}, org_id)- This automatically resolves the selected name to its ID. -
Use in queries:
lk_participants_active{org_id="$org_id"}- Filtering happens on the stable ID, but users see friendly names.
Metrics Reference¶
Room Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_room_active |
Gauge | project, region, org_id | +1 on room_started, -1 on room_finished |
Number of currently active rooms |
lk_rooms_started_total |
Counter | project, region, org_id | room_started |
Cumulative count of rooms created |
lk_rooms_finished_total |
Counter | project, region, org_id | room_finished |
Cumulative count of rooms closed |
lk_room_duration_seconds |
Histogram | project, region, org_id | room_finished |
Distribution of room durations |
Histogram Buckets: Exponential from 30s to ~3 hours (30, 54, 97, 175, 315, 567, 1020, 1837, 3307, 5953 seconds)
Participant Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_participants_active |
Gauge | project, region, org_id | +1 on participant_joined, -1 on participant_left/connection_aborted |
Currently connected participants |
lk_participant_sessions_total |
Counter | project, region, org_id | participant_joined |
Cumulative participant sessions |
lk_participant_lifetime_seconds |
Histogram | project, region, org_id | participant_left |
Distribution of session durations |
Histogram Buckets: Exponential from 15s to ~3 hours (15, 27, 48, 87, 157, 283, 509, 916, 1649, 2969 seconds)
Track Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_tracks_active |
Gauge | project, region, org_id, source, type | +1 on track_published, -1 on track_unpublished |
Currently published media tracks |
lk_tracks_published_total |
Counter | project, region, org_id, source, type | track_published |
Cumulative track publishes |
Source Values: camera, microphone, screen_share, screen_share_audio, unknown
Type Values: audio, video, unknown
Egress (Recording) Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_egress_active |
Gauge | project, region, org_id, request_type | +1 on egress_started, -1 on egress_ended |
Currently running recordings |
lk_egress_started_total |
Counter | project, region, org_id, request_type | egress_started |
Cumulative recordings started |
lk_egress_ended_total |
Counter | project, region, org_id, request_type, result | egress_ended |
Cumulative recordings ended |
lk_egress_duration_seconds |
Histogram | project, region, org_id, request_type, result | egress_ended |
Distribution of recording durations |
request_type Values: room_composite, web, track, participant, unknown
result Values: success, failed
Histogram Buckets: 30s, 60s, 120s, 300s, 600s, 1200s, 1800s, 3600s, 7200s, 14400s
Ingress (Stream Import) Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_ingress_active |
Gauge | project, region, org_id, input_type | +1 on ingress_started, -1 on ingress_ended |
Currently running stream imports |
lk_ingress_started_total |
Counter | project, region, org_id, input_type | ingress_started |
Cumulative stream imports started |
lk_ingress_ended_total |
Counter | project, region, org_id, input_type, result | ingress_ended |
Cumulative stream imports ended |
lk_ingress_duration_seconds |
Histogram | project, region, org_id, input_type, result | ingress_ended |
Distribution of stream durations |
input_type Values: rtmp, whip, url, unknown
result Values: success, failed
Webhook Health Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
lk_webhook_events_total |
Counter | event | Total webhook events received, by event type |
lk_webhook_delivery_lag_seconds |
Histogram | event | Time between event creation in LiveKit and receipt by backend |
lk_webhook_duplicates_total |
Counter | event | Duplicate webhook events detected (same event ID seen twice) |
System Health Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
lk_org_lookup_failures_total |
Counter | reason | Failed room → organization lookups (indicates data integrity issues) |
lk_org_info |
Gauge | org_id, org_name | Metadata metric mapping org_id to org_name for Grafana joins |
reason Values: not_configured, room_not_found, no_organization, no_org_id
Complete Data Flow¶
Sequence Diagram: Webhook to Dashboard¶
sequenceDiagram
participant LK as LiveKit Cloud
participant BE as Backend /webhook
participant REC as WebhookRecorder
participant DB as PostgreSQL
participant REG as Prometheus Registry
participant ME as Backend /metrics
participant AL as Grafana Alloy
participant GC as Grafana Cloud
Note over LK,GC: 1. Event Occurs (participant joins room)
LK->>BE: POST /webhook<br/>{event: "participant_joined", id: "evt_123", ...}
BE->>BE: Validate webhook signature
BE->>REC: Handle(event) [non-blocking enqueue]
BE-->>LK: 200 OK
Note over REC,REG: 2. Async Processing (in background goroutine)
REC->>REC: Check deduplication (event.id not seen?)
REC->>REC: Record lk_webhook_events_total++
REC->>REC: Record lk_webhook_delivery_lag_seconds
REC->>DB: SELECT org_id FROM rooms JOIN organizations
DB-->>REC: {org_id: "org_01ABC", org_name: "Acme Corp"}
REC->>REC: Store participant start time (for duration calc later)
REC->>REG: lk_participants_active{org_id="org_01ABC"}.Inc()
REC->>REG: lk_participant_sessions_total{org_id="org_01ABC"}.Inc()
Note over AL,GC: 3. Periodic Scrape (every 15 seconds)
AL->>ME: GET /metrics
ME->>REG: Collect all metric values
REG-->>ME: Prometheus text format
ME-->>AL: lk_participants_active{org_id="org_01ABC"} 15\n...
AL->>AL: Add external_labels (environment, region, service)
AL->>GC: remote_write (batched, compressed, with retries)
GC->>GC: Store in Prometheus
Note over GC: 4. Query & Visualize
GC->>GC: Dashboard executes PromQL queries
GC->>GC: Render panels with current and historical data
Configuration Reference¶
Backend Environment Variables¶
| Variable | Required | Description | Example |
|---|---|---|---|
LIVEKIT_API_KEY |
Yes | LiveKit API key for webhook signature validation | APIxxxxxxxx |
LIVEKIT_API_SECRET |
Yes | LiveKit API secret for webhook signature validation | secret-key |
LIVEKIT_PROJECT |
No | Project name for metric labels | qubital |
Grafana Alloy Environment Variables¶
| Variable | Required | Description | Example |
|---|---|---|---|
ENVIRONMENT |
Yes | Environment name for external label | prod |
REGION |
Yes | Deployment region for external label | eu |
BACKEND_SCRAPE_TARGET |
Yes | Backend service DNS and port | backend-web:3001 |
GRAFANA_PROM_REMOTE_WRITE_URL |
Yes | Grafana Cloud Prometheus endpoint | https://prometheus-xxx.grafana.net/api/prom/push |
GRAFANA_PROM_USERNAME |
Yes | Grafana Cloud instance ID | 123456 |
GRAFANA_API_KEY |
Yes | Grafana Cloud API key with write permissions | glc_eyJ... |
Recorder Configuration¶
The WebhookRecorder has two tunable parameters:
type Config struct {
QueueSize int // Default: 1000 events
DedupTTL time.Duration // Default: 10 minutes
}
- QueueSize: Maximum number of events waiting to be processed. If exceeded, new events are dropped (and logged). Increase if you see "queue full" warnings.
- DedupTTL: How long to remember event IDs for deduplication. Must be longer than LiveKit's retry window (typically a few minutes).
Operational Considerations¶
Gauge Drift¶
Gauge metrics (lk_*_active) track current counts by incrementing on "start" events and decrementing on "end" events. This approach can drift from reality if events are lost:
Causes of Drift: - Network issues between LiveKit and the backend - Backend pod restarts (in-flight events are lost) - Queue overflow (events dropped due to backpressure)
Detection:
- Compare lk_rooms_started_total - lk_rooms_finished_total with lk_room_active - they should match
- Monitor lk_webhook_duplicates_total - high rates may indicate delivery issues
- Alert on lk_org_lookup_failures_total > 0 - indicates data integrity problems
Mitigation:
- Gauges self-correct over time as rooms finish and participants leave
- Counter metrics (*_total) remain accurate regardless of drift
- For critical accuracy, use counter differences: increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h])
Deduplication Behavior¶
LiveKit may retry webhook delivery if it doesn't receive a timely response. The recorder handles this by tracking event IDs for 10 minutes:
- First occurrence: Processed normally
- Subsequent occurrences (same ID within 10 minutes): Skipped, counted in
lk_webhook_duplicates_total - After 10 minutes: ID is forgotten (old events won't be reprocessed anyway)
Backpressure Handling¶
If webhook events arrive faster than they can be processed, the queue (1000 events) acts as a buffer. If it fills completely:
- New events are dropped (not queued)
- A warning is logged:
"Webhook recorder: queue full, dropping event" - The
event_typeandevent_idare included in the log for debugging
This is a safety valve to prevent unbounded memory growth. If you see these warnings, investigate why processing is slow (database latency? high event volume?) or increase the queue size.
Alerting Recommendations¶
| Alert Name | PromQL Condition | Severity | Description |
|---|---|---|---|
| WebhookDeliveryLagHigh | histogram_quantile(0.99, sum(rate(lk_webhook_delivery_lag_seconds_bucket[5m])) by (le)) > 10 |
Warning | 99th percentile delivery lag exceeds 10 seconds |
| OrgLookupFailures | increase(lk_org_lookup_failures_total[1h]) > 0 |
Critical | Any org lookup failure indicates data integrity issues |
| RecordingFailures | increase(lk_egress_ended_total{result="failed"}[1h]) > 0 |
Warning | Recording jobs are failing |
| NoWebhookEvents | rate(lk_webhook_events_total[5m]) == 0 |
Critical | No webhook events received (during business hours) |
| HighDuplicateRate | rate(lk_webhook_duplicates_total[5m]) > 1 |
Warning | More than 1 duplicate per second indicates delivery issues |
File Reference¶
| File Path | Purpose |
|---|---|
pkg/metrics/livekit.go |
Metric definitions (names, types, labels, help text) and Prometheus registration |
pkg/metrics/handler.go |
HTTP handler wrapper for the /metrics endpoint |
internal/features/metrics/api/event_listener.go |
Webhook HTTP handler - receives and validates LiveKit webhooks |
internal/features/metrics/service/recorder.go |
Background service - processes events and updates metrics |
internal/features/metrics/service/models.go |
Data structures (Config, OrgData, RoomState, etc.) and interfaces |
internal/features/metrics/service/org_lookup.go |
Database queries to resolve room IDs to organization IDs |
internal/features/metrics/service/presence_manager.go |
Single-session enforcement (kicks user from old room when joining new one) |
cmd/main.go |
Application entry point - registers /metrics and /webhook endpoints |
internal/app/app.go |
Dependency injection - creates and wires all metrics components |