Metrics System¶
Overview¶
This document describes the metrics observability system for the Qubital backend. The system is designed to provide real-time visibility into LiveKit-powered virtual office usage, enabling both operational monitoring and business intelligence for a multi-tenant SaaS platform.
Purpose and Goals¶
The metrics system serves three primary objectives:
-
Operational Visibility: Monitor system health, detect anomalies, and troubleshoot issues in real-time. This includes tracking webhook delivery reliability, recording success rates, and identifying potential data integrity problems.
-
Business Intelligence: Understand how customers use the platform - how long they spend in rooms, which features (camera, microphone, screen share) are most adopted, and how usage patterns vary across organizations.
-
Multi-Tenant Isolation: Every metric is tagged with an organization identifier (
org_id), enabling per-customer dashboards and billing analytics while maintaining strict data isolation between tenants.
Design Principles¶
The system follows several key design principles:
- Event-Driven Architecture: Metrics are derived from LiveKit webhook events, ensuring real-time accuracy without polling overhead.
- Stable Identifiers: Organization IDs (from WorkOS) are used instead of names to ensure time series continuity when customers rename their organizations.
- Lightweight Footprint: The backend runs on resource-constrained pods (0.5 CPU, 300MB RAM), so the metrics system is designed to be stateless where possible and use minimal memory.
- Separation of Concerns: The backend only collects and exposes metrics; storage, querying, and visualization are delegated to Grafana Cloud.
- Hybrid State Model: Most metrics are derived purely from ephemeral in-memory state (webhook events). However, certain business KPIs require durable state in a relational database: WAU/MAU unique user counts (Prometheus cannot count distinct identities over sliding time windows), egress completion stats (in-memory counters reset on pod restart, losing billing-critical data), and connected participant counts (in-memory gauges drift if events are lost and reset to zero on restart).
Architecture¶
High-Level Components¶
The metrics pipeline consists of four distinct components, each with a specific responsibility:
1. LiveKit Cloud (Event Source)
LiveKit Cloud is the real-time video/audio infrastructure that powers Qubital's virtual office rooms. It acts as the authoritative source of truth for all room, participant, track, and recording lifecycle events. When any significant event occurs (a user joins a room, enables their camera, starts a recording, etc.), LiveKit sends an HTTP webhook to our backend within milliseconds.
2. Qubital Backend (Event Processor)
The Go backend receives webhook events, validates their authenticity using cryptographic signatures, and transforms them into Prometheus metrics. It enriches each event with organization context by looking up the room's owner in the database. The backend exposes a /metrics endpoint in Prometheus text format that can be scraped by any compatible collector.
3. Grafana Alloy (Metrics Collector)
Grafana Alloy runs as a sidecar container alongside the backend. Every 15 seconds, it scrapes the /metrics endpoint, adds infrastructure labels (environment, region), batches the samples, and ships them to Grafana Cloud using the Prometheus remote write protocol. This decouples the backend from Grafana Cloud's specifics and provides retry logic, backpressure handling, and efficient batching.
4. Grafana Cloud (Storage and Visualization)
Grafana Cloud provides managed Prometheus storage with 13-month retention, PromQL query capabilities, and the Grafana dashboard/alerting interface. This is where metrics are persisted, queried, and visualized. The backend and Alloy are stateless - if they restart, no historical data is lost because it's already in Grafana Cloud.
Data Flow Diagram¶
The following diagram illustrates how data flows through the system, from a LiveKit event to a Grafana dashboard:
flowchart TD
subgraph External["External Services"]
LK["LiveKit Cloud<br/>(Video Infrastructure)"]
end
subgraph Backend["Qubital Backend Pod"]
WH["/webhook<br/>HTTP POST Handler"]
EL["EventListener<br/>Signature Validation"]
REC["WebhookRecorder<br/>Event Processing"]
subgraph Lookup["Organization Resolution"]
OL["OrgLookup Service"]
DB[(PostgreSQL)]
end
subgraph MetricsPkg["Prometheus Integration"]
LKM["LiveKitMetrics<br/>Gauge/Counter/Histogram"]
REG["Prometheus Registry"]
ME["/metrics<br/>HTTP GET Handler"]
end
end
subgraph Collector["Grafana Alloy Pod"]
SCRAPE["prometheus.scrape<br/>15s interval"]
LABELS["Add external_labels<br/>environment, region, service"]
RW["prometheus.remote_write<br/>Batched shipping"]
end
subgraph Cloud["Grafana Cloud"]
PROM[("Prometheus<br/>13-month retention")]
DASH["Dashboards"]
ALERT["Alerting"]
end
%% Event flow
LK -->|"Webhook Event<br/>(room_started, participant_joined, etc.)"| WH
WH --> EL
EL -->|"Valid event"| REC
%% Processing
REC -->|"room_id lookup"| OL
OL <-->|"SQL query"| DB
REC -->|"Update metrics"| LKM
LKM --> REG
REG --> ME
%% Scraping
SCRAPE -->|"GET /metrics"| ME
SCRAPE --> LABELS
LABELS --> RW
RW -->|"remote_write"| PROM
%% Visualization
PROM --> DASH
PROM --> ALERT
Component Details¶
LiveKit Cloud¶
Role in the System¶
LiveKit Cloud serves as the event source for the entire metrics system. It is a managed WebRTC infrastructure service that handles all the complexity of real-time video/audio - media routing, bandwidth estimation, codec negotiation, etc. From a metrics perspective, LiveKit is valuable because it provides authoritative lifecycle events for everything happening in virtual office rooms.
How Webhooks Work¶
When configured with a webhook URL, LiveKit sends HTTP POST requests to that endpoint whenever significant events occur. Each webhook request includes:
- Authentication: A cryptographic signature in the
Authorizationheader, computed using the shared API secret. The backend validates this signature to ensure the webhook genuinely originated from LiveKit. - Event Type: A string identifier like
room_started,participant_joined, ortrack_published. - Event ID: A unique identifier for deduplication. LiveKit may retry failed webhooks, so this ID prevents double-counting.
- Timestamp: The Unix timestamp when the event actually occurred in LiveKit (not when the webhook was sent).
- Payload: Event-specific data such as room information, participant details, or track metadata.
Webhook Events Reference¶
| Event | When It Fires | Payload Includes |
|---|---|---|
room_started |
A room is created (first participant joins or explicit creation) | Room SID, room name (our UUID), metadata |
room_finished |
A room closes (empty timeout or explicit deletion) | Room SID, room name, duration |
participant_joined |
A user successfully connects to a room | Participant SID, identity (internal DB user ID as string), room info |
participant_left |
A user disconnects (intentional or timeout) | Participant SID, room info |
participant_connection_aborted |
Connection failed during setup | Participant SID, error details |
track_published |
A user enables camera/microphone/screen share | Track SID, source type, media type, participant info |
track_unpublished |
A user disables a media track | Track SID, source type, participant info |
egress_started |
A recording job begins | Egress ID, room name, output configuration |
egress_ended |
A recording completes or fails | Egress ID, status, file results, duration |
ingress_started |
An external stream import begins | Ingress ID, input type, room name |
ingress_ended |
A stream import ends | Ingress ID, status, error (if any) |
Qubital Backend¶
Webhook Endpoint (POST /webhook)¶
Location: internal/features/metrics/api/event_listener.go
The webhook endpoint is the entry point for all LiveKit events. It is intentionally simple and fast - the goal is to accept the webhook, validate it, and return a 200 OK as quickly as possible. Any slow processing would cause LiveKit to retry, potentially creating duplicate events.
The handler performs three steps:
-
Receive and Parse: The LiveKit SDK's
ReceiveWebhookEventfunction reads the request body and parses it into a structured event object. -
Validate Signature: The SDK verifies the
Authorizationheader against the shared API secret. Invalid signatures are rejected with an error, protecting against spoofed webhooks. -
Enqueue for Processing: The event is passed to the WebhookRecorder via a non-blocking
Handle(event)call. This immediately returns, allowing the HTTP response to be sent while processing happens asynchronously.
func (h *WebhookApiHandler) EventListener(c *gin.Context) {
event, err := h.client.ReceiveWebhookEvent(c, h.livekitService)
if err != nil {
logger.ErrorAPICtx(ctx, "Webhook event listener failed", err, nil)
c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to receive webhook event"})
return
}
// Non-blocking enqueue - returns immediately
h.recorder.Handle(event)
logger.InfoCtx(ctx, "Webhook event received", []slog.Attr{
slog.String("event_type", event.Event),
slog.String("event_id", event.Id),
})
}
Webhook Recorder Service¶
Location: internal/features/metrics/service/recorder.go
The WebhookRecorder is a background service that processes webhook events asynchronously. It runs as a single goroutine with a buffered channel, ensuring events are processed in order while decoupling the HTTP handler from potentially slow operations like database lookups.
Processing Pipeline:
-
Deduplication: Each event has a unique ID. The recorder maintains a map of recently seen IDs (10-minute TTL) and skips duplicates. This handles LiveKit's retry behavior gracefully.
-
Health Metrics: Before any business logic, the recorder updates system health metrics - incrementing the event counter and recording the delivery lag (time between event creation and receipt).
-
Organization Resolution: The recorder looks up the organization that owns the room. This is necessary because LiveKit doesn't know about our multi-tenant structure - it just sends room IDs. We query the database to map
room_id → organization → org_id. -
Metric Updates: Based on the event type, the recorder updates the appropriate Prometheus metrics. Counters are incremented, gauges are adjusted, and histograms observe duration values.
-
Participant Activity Persistence: On
participant_left, after computing session duration, the recorder persists the completed session to theparticipant_activity_metricsdatabase table via a fire-and-forget write. This durable state enables WAU/MAU computation that Prometheus alone cannot handle (see Participant Activity Metrics below).
Background Tickers
In addition to event-driven processing, the recorder's goroutine runs several periodic tickers:
| Ticker | Interval | Purpose |
|---|---|---|
| Dedup cleanup | 10 min (configurable) | Removes expired event IDs from the deduplication map to prevent unbounded memory growth |
| License refresh | 1 hour | Re-reads each known organization's monthly_recording_minutes from the org_licenses table and updates the lk_org_recording_limit_minutes gauge. Picks up plan changes without restart. |
| Activity metrics refresh | 5 min | Queries the participant_activity_metrics table to compute per-org WAU (7-day) and MAU (30-day) aggregates, then updates the lk_unique_participants and lk_participant_total_online_seconds gauges. |
| Egress stats refresh | 5 min | Queries the recordings table to compute per-org completed egress count and total duration for sliding windows (1d, 7d, 30d), then updates the lk_egress_completed and lk_egress_total_duration_seconds gauges. Uses the partial index idx_recordings_completed_ended_at (migration 000020). |
| Presence count refresh | 5 min | Queries the room_presence table to count currently connected participants per organization, then updates the lk_participants_connected gauge. |
| Activity purge | 24 hours | Deletes rows older than 90 days from participant_activity_metrics to keep the table bounded. 90 days covers the MAU window (30d) with margin for future quarterly metrics. |
Ephemeral State for Duration Calculations
Why This State Exists
To compute duration-based metrics (e.g., lk_room_duration_seconds, lk_participant_lifetime_seconds), the recorder needs to know when things started. When a room_finished webhook arrives, the recorder must calculate finished_time - started_time. But the room_finished event doesn't include the start time - that information was only available in the earlier room_started event.
The solution is simple: when room_started arrives, the recorder stores the start timestamp in memory. When room_finished arrives later, it retrieves that timestamp, calculates the duration, and then deletes the entry.
What's Stored
| State Map | Key | Stored Value | Purpose |
|---|---|---|---|
rooms |
room SID | Start timestamp + org_id | Calculate room duration on room_finished |
participants |
participant SID | Join timestamp + room SID + org_id + identity (internal DB user ID as string) + userID (parsed int64) | Calculate session duration on participant_left and persist session to DB for WAU/MAU |
| tracks | room SID + source + type | Integer count | Track active media streams per org_id |
| egress | egress ID | Start timestamp + org_id + request_type + room_name | Calculate recording duration on egress_ended |
| ingress | ingress ID | Start timestamp + org_id + input_type + room_name | Calculate stream duration on ingress_ended |
Key Characteristics
-
Ephemeral: This state exists only in memory. If the pod restarts, it's lost. This is acceptable because we only lose in-flight duration calculations - the counters in Prometheus (which are persisted in Grafana Cloud) remain accurate.
-
Bounded Size: The state only contains currently active entities. Once a room finishes or a participant leaves, their entry is deleted. The maps cannot grow unbounded - they shrink back down as activity ends.
-
Not Critical for Counters: Counters (
*_totalmetrics) don't need this state - they just increment on each event. Only histograms (durations) require start/end pairing. -
Trade-off: We could eliminate this state by storing start times in the database or Redis, but that would add latency and complexity for minimal benefit. The current approach is simple and fits the resource constraints.
Memory Usage (Not a Concern)
A common question is whether these in-memory maps will cause problems as customer usage grows. The answer is no - the memory usage is negligible:
| Scenario | Concurrent Entities | Estimated Memory |
|---|---|---|
| Small scale | 50 participants, 10 rooms | ~15 KB |
| Medium scale | 500 participants, 100 rooms | ~100 KB |
| Large scale | 5,000 participants, 1,000 rooms | ~1 MB |
| Extreme scale | 50,000 participants, 10,000 rooms | ~10 MB |
Each map entry is roughly 100-150 bytes (Go string headers, timestamps, a few pointers). Even at extreme scales that far exceed expected usage, the memory footprint is tiny compared to the pod's 300MB allocation.
Why This Isn't a Scaling Concern
-
Maps are bounded by concurrent activity, not total activity: If 1 million participants join and leave throughout the day, but only 500 are ever connected simultaneously, the maps only ever hold ~500 entries.
-
Entries are automatically cleaned up: When a
participant_leftevent arrives, the entry is immediately deleted. There's no accumulation over time. -
The data is trivially small: Timestamps and short strings. No large objects, no nested structures, no media data.
📌 Note for Future Scaling
The in-memory maps are not the bottleneck for scaling this system. If scaling becomes a concern, the limiting factor would be the single-goroutine event processing model (see Scaling Considerations below), not memory usage.
Scaling Considerations
This section separates two distinct concerns that are sometimes conflated: memory usage and throughput. They are independent issues with different thresholds and solutions.
Memory Usage: Not a Concern
As explained above, the in-memory state maps use negligible memory (~1-2MB even at thousands of concurrent participants). This is a non-issue for the foreseeable future and doesn't require any action.
Throughput: The Single-Goroutine Model
The WebhookRecorder processes events in a single goroutine, reading from a buffered channel. This design is intentionally simple:
[Webhook Handler] --enqueue--> [Buffered Channel (1000)] --dequeue--> [Single Processing Goroutine]
Why Single-Threaded?
- No synchronization needed: Metric updates and state map operations don't require mutexes because only one goroutine accesses them.
- Predictable ordering: Events for the same room/participant are processed in order.
- Simple debugging: No race conditions, no deadlocks, no concurrent access bugs.
When Would This Become a Problem?
The single goroutine can process roughly 500-2,000 events per second (depending on database latency for org lookups). Each event involves: - Deduplication check (fast, in-memory) - Metric updates (fast, in-memory) - Optional database lookup (slower, ~5-50ms)
To stress this system, you'd need sustained event rates that exceed processing capacity. Here's a rough estimate:
| Concurrent Participants | Estimated Events/Second | System Status |
|---|---|---|
| 100 | ~5-10 | Comfortable |
| 1,000 | ~50-100 | Comfortable |
| 5,000 | ~250-500 | Approaching limit |
| 10,000+ | ~500-1000+ | May need optimization |
Events per second is estimated based on typical activity: participants joining/leaving, enabling/disabling tracks. Heavy activity (frequent track toggles) increases this.
Signs of Throughput Issues
"Webhook recorder: queue full, dropping event"in logs- Growing lag between event timestamps and processing time
lk_webhook_delivery_lag_secondsp99 increasing over time
Future Solutions (If Needed)
If throughput becomes a concern at 10,000+ concurrent participants:
- Shard by room: Multiple processing goroutines, each handling a subset of rooms. Maintains per-room ordering while increasing parallelism.
- Batch database lookups: Cache org_id by room for a short TTL, reducing database round-trips.
- Horizontal scaling: Multiple backend pods behind a load balancer, each handling a portion of webhooks.
These optimizations are not currently implemented because they add complexity that isn't justified at current scale. The single-goroutine model handles Qubital's expected load with significant headroom.
Organization Lookup¶
Location: internal/features/metrics/service/org_lookup.go
Every metric needs an org_id label for multi-tenant filtering. The OrgLookup service resolves room IDs (which LiveKit knows) to organization IDs (which LiveKit doesn't know).
Resolution Process:
- Receive the room name from the webhook (this is the room's UUID in our database)
- Query:
SELECT organizations.workos_org_id, organizations.name FROM rooms JOIN organizations ON rooms.organization_id = organizations.id WHERE rooms.id = ? - Return
OrgData{OrgID, OrgName, MonthlyRecordingMinutes}for metric labeling and license tracking
Why org_id Instead of org_name?
Organization names are mutable - customers can rename their organization at any time. If we used names as metric labels, renaming would create a new time series and break historical continuity. WorkOS organization IDs are immutable, ensuring metrics remain linked to the same organization forever.
To display friendly names in Grafana, we emit a separate lk_org_info{org_id, org_name} metric that maps IDs to names. Grafana can join this using PromQL's group_left.
Recording Limit Resolution
In addition to resolving org identity, OrgLookup also fetches each organization's monthly recording limit from the org_licenses table. This limit (e.g., 1200 minutes for a standard plan) is included in the OrgData struct and emitted as the lk_org_recording_limit_minutes gauge when an organization is first discovered. A background ticker re-reads the limits from the database every hour to pick up plan changes (upgrades or downgrades) without requiring a restart. This enables Grafana to draw dynamic threshold lines per tenant on the "Monthly Recording Minutes" panel, replacing the previous hardcoded vector(1200) approach.
Metrics Endpoint (GET /metrics)¶
Location: pkg/metrics/handler.go and cmd/main.go
The /metrics endpoint exposes all registered Prometheus metrics in the standard text exposition format. When Grafana Alloy (or any Prometheus-compatible scraper) sends a GET request, it receives output like:
# HELP lk_participants_active active participants
# TYPE lk_participants_active gauge
lk_participants_active{org_id="org_01ABC",project="qubital",region="eu"} 15
lk_participants_active{org_id="org_02XYZ",project="qubital",region="eu"} 8
# HELP lk_participant_sessions_total participant sessions started
# TYPE lk_participant_sessions_total counter
lk_participant_sessions_total{org_id="org_01ABC",project="qubital",region="eu"} 1247
The endpoint is stateless - it simply reads current values from the Prometheus registry and formats them. There's no computation or database access during scraping.
Grafana Alloy¶
Role in the System¶
Grafana Alloy is a vendor-neutral observability collector that bridges the gap between the backend and Grafana Cloud. While we could have the backend push metrics directly to Grafana Cloud, using Alloy provides several benefits:
- Decoupling: The backend doesn't need Grafana Cloud credentials or awareness of the remote write protocol.
- Reliability: Alloy handles retries, backpressure, and temporary network failures gracefully.
- Batching: Instead of sending each metric individually, Alloy batches samples for efficient transmission.
- Label Enrichment: Alloy adds infrastructure-level labels (environment, region) that the backend shouldn't need to know about.
Scraping Configuration¶
Alloy scrapes the backend every 15 seconds. This interval balances freshness (metrics are at most 15 seconds old) against load (200+ scrapes per hour is reasonable for a small pod).
prometheus.scrape "backend" {
job_name = "qubital-backend"
scrape_interval = "15s"
scrape_timeout = "10s"
scheme = "http"
metrics_path = "/metrics"
targets = [
{ "__address__" = env("BACKEND_SCRAPE_TARGET") }, // e.g., "backend-web:3001"
]
forward_to = [prometheus.remote_write.grafanacloud.receiver]
}
Remote Write Configuration¶
After scraping, Alloy forwards metrics to Grafana Cloud's Prometheus endpoint. It adds external labels that apply to all metrics from this Alloy instance:
prometheus.remote_write "grafanacloud" {
external_labels = {
environment = env("ENVIRONMENT"), // "test", "dev", or "prod"
region = env("REGION"), // "eu", "us", etc.
service = "qubital-backend",
}
endpoint {
url = env("GRAFANA_PROM_REMOTE_WRITE_URL")
basic_auth {
username = env("GRAFANA_PROM_USERNAME")
password = env("GRAFANA_API_KEY")
}
queue_config {
max_samples_per_send = 5000
batch_send_deadline = "20s"
retry_on_http_429 = true
}
}
}
Label Strategy¶
Labels are added at two levels, creating a clear separation of concerns:
| Label | Added By | Scope | Purpose |
|---|---|---|---|
org_id |
Backend | Per-metric | Multi-tenant customer identification |
project |
Backend | Per-metric | LiveKit project identifier |
source, type |
Backend | Track metrics only | Media track classification |
request_type |
Backend | Egress metrics only | Recording type classification |
environment |
Alloy | All metrics | Deployment environment (test/dev/prod) |
region |
Alloy | All metrics | Infrastructure region |
service |
Alloy | All metrics | Service identifier for filtering |
This separation means the backend doesn't need to know which environment it's running in - that's infrastructure configuration handled by Alloy.
Grafana Cloud¶
Prometheus Storage¶
Grafana Cloud provides managed Prometheus storage with:
- 13-Month Retention: Metrics are stored for over a year, enabling long-term trend analysis and year-over-year comparisons.
- High Availability: Data is replicated across multiple availability zones.
- Automatic Scaling: Storage and query capacity scale automatically based on usage.
- PromQL Interface: Full PromQL support for complex queries and aggregations.
Querying Metrics¶
Example PromQL queries for common use cases:
# Active participants for a specific organization in production
lk_participants_active{org_id="org_01ABC", environment="prod"}
# Participant session duration percentiles (p50, p90, p99)
histogram_quantile(0.50, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.90, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(lk_participant_lifetime_seconds_bucket{org_id="$org_id"}[5m])) by (le))
# Recording success rate over the last hour
sum(rate(lk_egress_ended_total{result="success", org_id="$org_id"}[1h]))
/
sum(rate(lk_egress_ended_total{org_id="$org_id"}[1h])) * 100
# Camera adoption rate (% of participants with camera enabled)
sum(lk_tracks_active{source="camera", org_id="$org_id"})
/
sum(lk_participants_active{org_id="$org_id"}) * 100
# Weekly Active Users (WAU) per tenant
lk_unique_participants{org_id="$org_id", window="7d"}
# Monthly Active Users (MAU) per tenant
lk_unique_participants{org_id="$org_id", window="30d"}
# Average time online per unique user (7-day window)
lk_participant_total_online_seconds{org_id="$org_id", window="7d"}
/
lk_unique_participants{org_id="$org_id", window="7d"}
# Recording limit threshold line (dynamic per tenant, read from DB)
lk_org_recording_limit_minutes{org_id="$org_id"}
Multi-Tenant Dashboard Variables¶
To display organization names while filtering by stable IDs:
-
Create
org_namevariable:label_values(lk_org_info, org_name)- This shows a dropdown of human-readable organization names. -
Create hidden
org_idvariable:label_values(lk_org_info{org_name="$org_name"}, org_id)- This automatically resolves the selected name to its ID. -
Use in queries:
lk_participants_active{org_id="$org_id"}- Filtering happens on the stable ID, but users see friendly names.
Metrics Reference¶
Room Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_room_active |
Gauge | project, region, org_id | +1 on room_started, -1 on room_finished |
Number of currently active rooms. Incremented when a room is created and decremented when the last participant leaves (or the room is explicitly deleted). Subject to gauge drift if events are lost during pod restarts. |
lk_rooms_started_total |
Counter | project, region, org_id | room_started |
Cumulative count of rooms created since the last pod restart. Survives Prometheus scrape resets via rate()/increase(). Use increase(lk_rooms_started_total[1h]) for rooms created in the last hour. |
lk_rooms_finished_total |
Counter | project, region, org_id | room_finished |
Cumulative count of rooms closed. The difference increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h]) gives a drift-resistant approximation of active rooms. |
lk_room_duration_seconds |
Histogram | project, region, org_id | room_finished |
Distribution of room lifetimes in seconds (time between room_started and room_finished). Use histogram_quantile() for percentiles. Requires the in-memory rooms state map to pair start/end events. |
Histogram Buckets: Exponential from 30s to ~3 hours (30, 54, 97, 175, 315, 567, 1020, 1837, 3307, 5953 seconds)
Participant Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_participants_active |
Gauge | project, region, org_id | +1 on participant_joined, -1 on participant_left/connection_aborted |
Currently connected participants. Only counts standard participants (STANDARD kind) — service participants like egress bots, ingress, SIP bridges, and agents are excluded. |
lk_participant_sessions_total |
Counter | project, region, org_id | participant_joined |
Cumulative count of participant sessions started. One session = one join event. If a user disconnects and reconnects, that counts as two sessions. |
lk_participant_lifetime_seconds |
Histogram | project, region, org_id | participant_left |
Distribution of individual session durations in seconds (time between participant_joined and participant_left). On participant_left, the completed session is also persisted to the participant_activity_metrics DB table for WAU/MAU computation. |
Histogram Buckets: Exponential from 15s to ~3 hours (15, 27, 48, 87, 157, 283, 509, 916, 1649, 2969 seconds)
Participant Activity Metrics (WAU/MAU)¶
These metrics are not derived from ephemeral in-memory state like the other metrics above. They are computed from the participant_activity_metrics database table, which stores daily pre-aggregated participant sessions. A background ticker queries the table every 5 minutes to refresh these gauges.
Why a database table? Prometheus counters and histograms handle restarts via reset detection, but counting distinct identities over a sliding time window (e.g., "how many unique users were active in the last 7 days?") requires durable state. Prometheus has no COUNT(DISTINCT) equivalent. The database table provides the per-user granularity needed for these business KPIs.
Daily pre-aggregation: Instead of storing one row per session (which would scale to millions of rows), the table stores one row per user per organization per day via UPSERT. On each participant_left webhook, the session duration is added to the existing row for that user+org+day, or a new row is inserted if none exists. This reduces storage by 3-5x while preserving the per-user granularity needed for COUNT(DISTINCT user_id).
Data retention: Rows older than 90 days are purged daily by a background ticker. 90 days covers the current MAU window (30d) and WAU window (7d) with margin for future quarterly metrics. If a quarterly (90d) aggregation window is added, retention should be extended to at least 120 days.
| Metric | Type | Labels | Data Source | Description |
|---|---|---|---|---|
lk_unique_participants |
Gauge | project, region, org_id, window | DB query (5 min refresh) | Number of distinct users active per organization within the sliding time window. COUNT(DISTINCT user_id) from participant_activity_metrics where activity_date > now() - window. |
lk_participant_total_online_seconds |
Gauge | project, region, org_id, window | DB query (5 min refresh) | Total seconds spent online by all participants per organization within the sliding time window. SUM(total_duration_seconds) from participant_activity_metrics. |
window Values: 7d (WAU - Weekly Active Users), 30d (MAU - Monthly Active Users)
Grafana Usage:
# Weekly Active Users for a specific tenant
lk_unique_participants{org_id="$org_id", window="7d"}
# Monthly Active Users for a specific tenant
lk_unique_participants{org_id="$org_id", window="30d"}
# Average time spent online per unique user (7-day window)
lk_participant_total_online_seconds{org_id="$org_id", window="7d"}
/
lk_unique_participants{org_id="$org_id", window="7d"}
Egress Stats Metrics (DB-Backed)¶
These metrics are not derived from ephemeral in-memory state. They are computed from the recordings database table, which stores one row per completed egress (recording) job. A background ticker queries the table every 5 minutes to refresh these gauges with sliding time windows (1d, 7d, 30d).
Why DB-backed gauges when we already have in-memory egress counters? The in-memory metrics (lk_egress_ended_total, lk_egress_duration_seconds) reset to zero on pod restart. For billing-critical data like "how many recordings did this customer complete this month" or "how many total recording minutes were consumed", losing state on restart is unacceptable. These DB-backed gauges read from the recordings table which persists across restarts, providing durable billing and usage metrics.
Sliding windows: Same pattern as the WAU/MAU gauges above. The ticker queries the recordings table three times per cycle — once for each window (1d, 7d, 30d) — using the GetCompletedStatsSince repository method. Each query filters on status = 'completed' AND ended_at >= now() - window and groups by organization. The partial index idx_recordings_completed_ended_at (migration 000020) ensures these queries use an index scan, not a sequential scan, even as the recordings table grows.
| Metric | Type | Labels | Data Source | Description |
|---|---|---|---|---|
lk_egress_completed |
Gauge | project, region, org_id, window | DB query (5 min refresh) | Number of completed egress (recording) jobs per organization within the sliding time window. COUNT(*) from recordings where status = 'completed' AND ended_at >= now() - window. |
lk_egress_total_duration_seconds |
Gauge | project, region, org_id, window | DB query (5 min refresh) | Total recording seconds per organization within the sliding time window. SUM(duration_seconds) from recordings where status = 'completed' AND ended_at >= now() - window. |
window Values: 1d (daily), 7d (weekly), 30d (monthly)
Grafana Usage:
# Monthly completed recordings for a specific tenant
lk_egress_completed{org_id="$org_id", window="30d"}
# Daily completed recordings for a specific tenant
lk_egress_completed{org_id="$org_id", window="1d"}
# Monthly recording minutes used (for billing)
lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60
# Recording usage as percentage of plan limit
(lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60)
/
lk_org_recording_limit_minutes{org_id="$org_id"} * 100
Connected Participants Metric (DB-Backed)¶
This metric is not derived from ephemeral in-memory state. It is a point-in-time snapshot computed from the room_presence database table, which stores one row per currently connected participant. A background ticker queries the table every 5 minutes to refresh this gauge.
Why DB-backed when we already have lk_participants_active? The in-memory gauge lk_participants_active tracks join/leave events and is subject to two failure modes: (1) it resets to zero on pod restart, and (2) it can drift from reality if events are lost. The DB-backed lk_participants_connected reads the actual room_presence table state, so it self-corrects every 5 minutes regardless of missed webhooks or restarts.
No time window needed: Unlike the egress stats above, this metric is inherently a snapshot — "how many participants are connected right now." Grafana can apply any time-based function on top of the scraped history: max_over_time for daily peaks, avg_over_time for averages, etc.
| Metric | Type | Labels | Data Source | Description |
|---|---|---|---|---|
lk_participants_connected |
Gauge | project, region, org_id | DB query (5 min refresh) | Currently connected participants per organization, read from the room_presence table. COUNT(*) grouped by organization. Self-corrects every 5 minutes — immune to gauge drift and pod restarts. |
Grafana Usage:
# Current connected participants for a tenant
lk_participants_connected{org_id="$org_id"}
# Peak connected participants over the last day
max_over_time(lk_participants_connected{org_id="$org_id"}[1d])
# Average connected participants over the last week
avg_over_time(lk_participants_connected{org_id="$org_id"}[7d])
Track Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_tracks_active |
Gauge | project, region, org_id, source, type | +1 on track_published, -1 on track_unpublished |
Currently published media tracks, broken down by source (camera, microphone, screen share) and type (audio, video). Useful for computing feature adoption rates, e.g., lk_tracks_active{source="camera"} / lk_participants_active gives the camera-on percentage. |
lk_tracks_published_total |
Counter | project, region, org_id, source, type | track_published |
Cumulative count of tracks published. Each time a participant enables their camera, microphone, or starts a screen share, this counter increments. |
Source Values: camera, microphone, screen_share, screen_share_audio, unknown
Type Values: audio, video, unknown
Egress (Recording) Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_egress_active |
Gauge | project, region, org_id, request_type | +1 on egress_started, -1 on egress_ended |
Currently running recording jobs. Broken down by request_type to distinguish room composite recordings from track-level or web recordings. |
lk_egress_started_total |
Counter | project, region, org_id, request_type | egress_started |
Cumulative count of recording jobs started. On egress_started, the recorder also stores the start state (timestamp, org, request type) in-memory for duration calculation when the recording ends. |
lk_egress_ended_total |
Counter | project, region, org_id, request_type, result | egress_ended |
Cumulative count of recording jobs completed, labeled by result (success or failed). On egress_ended, the recorder also persists the recording completion data (status, duration, expiry) to the recordings database table. |
lk_egress_duration_seconds |
Histogram | project, region, org_id, request_type, result | egress_ended |
Distribution of recording durations in seconds. Useful for monitoring typical recording lengths and detecting abnormally short recordings (potential failures). |
request_type Values: room_composite, web, track, participant, unknown
result Values: success, failed
Histogram Buckets: 30s, 60s, 120s, 300s, 600s, 1200s, 1800s, 3600s, 7200s, 14400s
Ingress (Stream Import) Metrics¶
| Metric | Type | Labels | Webhook Trigger | Description |
|---|---|---|---|---|
lk_ingress_active |
Gauge | project, region, org_id, input_type | +1 on ingress_started, -1 on ingress_ended |
Currently running stream import jobs, broken down by input protocol (RTMP, WHIP, URL). |
lk_ingress_started_total |
Counter | project, region, org_id, input_type | ingress_started |
Cumulative count of stream import jobs started. |
lk_ingress_ended_total |
Counter | project, region, org_id, input_type, result | ingress_ended |
Cumulative count of stream import jobs completed, labeled by result (success or failed). Failures are detected from the ingress state's error field or ENDPOINT_ERROR status. |
lk_ingress_duration_seconds |
Histogram | project, region, org_id, input_type, result | ingress_ended |
Distribution of stream import durations in seconds. |
input_type Values: rtmp, whip, url, unknown
result Values: success, failed
Organization License Metrics¶
| Metric | Type | Labels | Data Source | Description |
|---|---|---|---|---|
lk_org_recording_limit_minutes |
Gauge | project, region, org_id | DB query (hourly refresh) | Monthly recording limit in minutes for the organization, read from the org_licenses table. Emitted on first org discovery and refreshed hourly. Used in Grafana to draw a dynamic per-tenant threshold line on the "Monthly Recording Minutes" panel. Zero if the organization has no license. |
Grafana Usage:
# Dynamic threshold line per tenant on the recording minutes panel
lk_org_recording_limit_minutes{org_id="$org_id"}
# Recording usage as percentage of plan limit (uses DB-backed gauge)
(lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60)
/
lk_org_recording_limit_minutes{org_id="$org_id"} * 100
Webhook Health Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
lk_webhook_events_total |
Counter | event | Total webhook events received, broken down by event type (e.g., room_started, participant_joined). Use rate(lk_webhook_events_total[5m]) to monitor event throughput. A sudden drop to zero during business hours indicates a connectivity issue between LiveKit and the backend. |
lk_webhook_delivery_lag_seconds |
Histogram | event | Time in seconds between when the event actually occurred in LiveKit (CreatedAt timestamp) and when the backend received the webhook. High p99 values indicate network latency or backend processing delays. Only recorded for events with a positive CreatedAt timestamp. |
lk_webhook_duplicates_total |
Counter | event | Count of duplicate webhook events detected via the deduplication map (same event ID seen within the 10-minute TTL window). LiveKit retries webhook delivery on timeout, so some duplicates are expected. A sustained high rate may indicate the backend is responding too slowly, causing LiveKit to retry aggressively. |
System Health Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
lk_org_lookup_failures_total |
Counter | reason | Failed room → organization lookups. With proper FK constraints, any increment indicates a data integrity issue requiring immediate investigation. |
lk_org_info |
Gauge | org_id, org_name | Metadata metric mapping org_id to org_name for Grafana group_left joins. Always has value 1. Emitted once per organization on first discovery via a webhook event. |
reason Values: not_configured (OrgLookup interface not provided), room_not_found (room doesn't exist in database), no_organization (room exists but has no organization - FK violation), no_org_id (organization exists but has no WorkOS ID - sync issue)
Complete Data Flow¶
Sequence Diagram: Webhook to Dashboard¶
sequenceDiagram
participant LK as LiveKit Cloud
participant BE as Backend /webhook
participant REC as WebhookRecorder
participant DB as PostgreSQL
participant REG as Prometheus Registry
participant ME as Backend /metrics
participant AL as Grafana Alloy
participant GC as Grafana Cloud
Note over LK,GC: 1. Event Occurs (participant joins room)
LK->>BE: POST /webhook<br/>{event: "participant_joined", id: "evt_123", ...}
BE->>BE: Validate webhook signature
BE->>REC: Handle(event) [non-blocking enqueue]
BE-->>LK: 200 OK
Note over REC,REG: 2. Async Processing (in background goroutine)
REC->>REC: Check deduplication (event.id not seen?)
REC->>REC: Record lk_webhook_events_total++
REC->>REC: Record lk_webhook_delivery_lag_seconds
REC->>DB: SELECT org_id FROM rooms JOIN organizations
DB-->>REC: {org_id: "org_01ABC", org_name: "Acme Corp"}
REC->>REC: Store participant start time (for duration calc later)
REC->>REG: lk_participants_active{org_id="org_01ABC"}.Inc()
REC->>REG: lk_participant_sessions_total{org_id="org_01ABC"}.Inc()
Note over AL,GC: 3. Periodic Scrape (every 15 seconds)
AL->>ME: GET /metrics
ME->>REG: Collect all metric values
REG-->>ME: Prometheus text format
ME-->>AL: lk_participants_active{org_id="org_01ABC"} 15\n...
AL->>AL: Add external_labels (environment, region, service)
AL->>GC: remote_write (batched, compressed, with retries)
GC->>GC: Store in Prometheus
Note over GC: 4. Query & Visualize
GC->>GC: Dashboard executes PromQL queries
GC->>GC: Render panels with current and historical data
Configuration Reference¶
Backend Environment Variables¶
| Variable | Required | Description | Example |
|---|---|---|---|
LIVEKIT_API_KEY |
Yes | LiveKit API key for webhook signature validation | APIxxxxxxxx |
LIVEKIT_API_SECRET |
Yes | LiveKit API secret for webhook signature validation | secret-key |
LIVEKIT_PROJECT |
No | Project name for metric labels | qubital |
Grafana Alloy Environment Variables¶
| Variable | Required | Description | Example |
|---|---|---|---|
ENVIRONMENT |
Yes | Environment name for external label | prod |
REGION |
Yes | Deployment region for external label | eu |
BACKEND_SCRAPE_TARGET |
Yes | Backend service DNS and port | backend-web:3001 |
GRAFANA_PROM_REMOTE_WRITE_URL |
Yes | Grafana Cloud Prometheus endpoint | https://prometheus-xxx.grafana.net/api/prom/push |
GRAFANA_PROM_USERNAME |
Yes | Grafana Cloud instance ID | 123456 |
GRAFANA_API_KEY |
Yes | Grafana Cloud API key with write permissions | glc_eyJ... |
Recorder Configuration¶
The WebhookRecorder has two tunable parameters:
type Config struct {
QueueSize int // Default: 1000 events
DedupTTL time.Duration // Default: 10 minutes
}
- QueueSize: Maximum number of events waiting to be processed. If exceeded, new events are dropped (and logged). Increase if you see "queue full" warnings.
- DedupTTL: How long to remember event IDs for deduplication. Must be longer than LiveKit's retry window (typically a few minutes).
Operational Considerations¶
Gauge Drift¶
Gauge metrics (lk_*_active) track current counts by incrementing on "start" events and decrementing on "end" events. This approach can drift from reality if events are lost:
Causes of Drift: - Network issues between LiveKit and the backend - Backend pod restarts (in-flight events are lost) - Queue overflow (events dropped due to backpressure)
Detection:
- Compare lk_rooms_started_total - lk_rooms_finished_total with lk_room_active - they should match
- Monitor lk_webhook_duplicates_total - high rates may indicate delivery issues
- Alert on lk_org_lookup_failures_total > 0 - indicates data integrity problems
Mitigation:
- Gauges self-correct over time as rooms finish and participants leave
- Counter metrics (*_total) remain accurate regardless of drift
- For critical accuracy, use counter differences: increase(lk_rooms_started_total[1h]) - increase(lk_rooms_finished_total[1h])
Deduplication Behavior¶
LiveKit may retry webhook delivery if it doesn't receive a timely response. The recorder handles this by tracking event IDs for 10 minutes:
- First occurrence: Processed normally
- Subsequent occurrences (same ID within 10 minutes): Skipped, counted in
lk_webhook_duplicates_total - After 10 minutes: ID is forgotten (old events won't be reprocessed anyway)
Backpressure Handling¶
If webhook events arrive faster than they can be processed, the queue (1000 events) acts as a buffer. If it fills completely:
- New events are dropped (not queued)
- A warning is logged:
"Webhook recorder: queue full, dropping event" - The
event_typeandevent_idare included in the log for debugging
This is a safety valve to prevent unbounded memory growth. If you see these warnings, investigate why processing is slow (database latency? high event volume?) or increase the queue size.
Alerting Recommendations¶
| Alert Name | PromQL Condition | Severity | Description |
|---|---|---|---|
| WebhookDeliveryLagHigh | histogram_quantile(0.99, sum(rate(lk_webhook_delivery_lag_seconds_bucket[5m])) by (le)) > 10 |
Warning | 99th percentile delivery lag exceeds 10 seconds |
| OrgLookupFailures | increase(lk_org_lookup_failures_total[1h]) > 0 |
Critical | Any org lookup failure indicates data integrity issues |
| RecordingFailures | increase(lk_egress_ended_total{result="failed"}[1h]) > 0 |
Warning | Recording jobs are failing |
| NoWebhookEvents | rate(lk_webhook_events_total[5m]) == 0 |
Critical | No webhook events received (during business hours) |
| HighDuplicateRate | rate(lk_webhook_duplicates_total[5m]) > 1 |
Warning | More than 1 duplicate per second indicates delivery issues |
| RecordingLimitApproaching | lk_egress_total_duration_seconds{org_id="$org_id", window="30d"} / 60 > lk_org_recording_limit_minutes{org_id="$org_id"} * 0.8 |
Warning | Organization has used 80% of monthly recording limit. Uses the DB-backed gauge which survives pod restarts, replacing the previous increase(lk_egress_duration_seconds_sum[30d]) approach. |
| WAUDropped | lk_unique_participants{window="7d"} == 0 |
Warning | No active users in the last 7 days for an organization (during business periods) |
File Reference¶
| File Path | Purpose |
|---|---|
pkg/metrics/livekit.go |
Metric definitions (names, types, labels, help text) and Prometheus registration |
pkg/metrics/handler.go |
HTTP handler wrapper for the /metrics endpoint |
internal/features/metrics/api/event_listener.go |
Webhook HTTP handler - receives and validates LiveKit webhooks |
internal/features/metrics/service/recorder.go |
Background service - processes events and updates metrics |
internal/features/metrics/service/models.go |
Data structures (Config, OrgData, RoomState, etc.) and interfaces |
internal/features/metrics/service/org_lookup.go |
Database queries to resolve room IDs to organization IDs and fetch recording limits |
internal/features/metrics/service/activity_recorder.go |
Adapters bridging repository interfaces to service-layer interfaces: participant activity (ActivityRecorder), egress stats (EgressStatsLoader), and connected participant counts (PresenceCounter). Each adapter defines a narrow repository interface, converts repository types to service-layer types, and provides a constructor for dependency injection. |
internal/features/metrics/service/presence_manager.go |
Single-session enforcement (kicks user from old room when joining new one) |
internal/repository/participant_activity_metric_repository.go |
Database operations for daily pre-aggregated participant activity (UPSERT, aggregate queries, purge) |
internal/repository/recording_repository.go |
Database operations for recordings. Includes GetCompletedStatsSince which powers the lk_egress_completed and lk_egress_total_duration_seconds DB-backed gauges via a GROUP BY query on completed recordings per organization. Also defines the OrgRecordingStats scan target struct. |
internal/repository/room_presence_repository.go |
Database operations for room presence tracking. Includes CountByOrg which powers the lk_participants_connected DB-backed gauge via a GROUP BY query on currently connected participants per organization. Also defines the OrgPresenceCount scan target struct. |
internal/domain/database/participant_activity_metric.go |
GORM model for the participant_activity_metrics table |
internal/database/migrations/000010_participant_activity_metrics.up.sql |
Migration creating the participant_activity_metrics table with daily pre-aggregation schema |
internal/database/migrations/000020_recordings_completed_index.up.sql |
Migration adding a partial index idx_recordings_completed_ended_at on recordings(ended_at) WHERE status = 'completed' to support efficient GetCompletedStatsSince queries for the egress stats DB-backed gauges |
cmd/main.go |
Application entry point - registers /metrics and /webhook endpoints |
internal/app/app.go |
Dependency injection - creates and wires all metrics components, including the egress stats and presence count adapters |