POC — Centrifugo Chat Implementation
Part of the Supabase / DB Restructure Roadmap — the detailed chat phase (Phase 1) of its Event C.
Status: POC / first working version. Not production-hardened. Goal of this POC: a user types a message in the frontend chat → it is stored durably in Postgres → it is delivered live to the other participant(s) via Centrifugo. DMs and group conversations, text only. Everything ephemeral (history, recovery) runs on Centrifugo's memory engine on a single node; Redis is the documented next step, not part of the POC.
This document is the step-by-step build plan. Every decision in it was verified against the backend codebase (
/opt/livekit-backend-go/backend), the actual chat migrations (internal/database/migrations/000011–000031), and the official Centrifugo v6 docs. See §13 Verified facts for the receipts.Migration strategy: EXTEND, not recreate. The existing chat tables (built migrations
000011–000031) are ~90% ready for this design — verified table-by-table in §7. We keep the tables and their data, add two small columns, and decommission the PostgREST-era procedural layer (triggers, RPCs, RLS, grants) as logic moves into Go. We do not drop and rebuild the schema.
Every blue note like this one marks a future / production enhancement that is explicitly out of POC scope. They exist so the POC schema and channel model don't paint us into a corner — each names the column, channel, or mechanism that lands later. Skip them on a first read; they are the prod roadmap, not POC work.
1. Scope
In scope (POC):
- DMs (1:1) and group conversations, text messages only.
- Send → persist → live receive, with optimistic UI and idempotent retries.
- Edit own message, soft-delete own message.
- Read cursors + unread badges.
- Typing indicators.
- Single Centrifugo node, memory engine.
Out of scope (POC) — designed-for, built later (see the blue notes and §14):
- Presence / online status. The app already has a Teams-like status system (online / away / busy / appear-offline, auto-away via heartbeat) on the existing
user_presencetable. The POC does not touch it and does not add Centrifugo presence (which would leak "appear-offline" — see D3). Chat integrates with the existing status system when it merges into the real app. - Attachments / images / files (
chat_attachmentstable already exists; R2 design inCHAT_BACKEND_IMPLEMENTATION.mdis the reference). - Reactions (
chat_reactionstable already exists), threads / replies (parent_message_idreserved, unused). - Org-wide / broadcast conversations (the
type='channel'+visibility/history_modepath — §14). - Multi-node Centrifugo + Redis; transactional outbox; message search, mentions, "last seen".
2. Architecture & topology
2.1 Components
| Component | Role | Hosting |
|---|---|---|
Electron frontend (TS/Vite + centrifuge-js) | Renders chat, holds the WebSocket, optimistic UI | Client |
Backend (Go, Gin, GORM) — cmd/api | Authority: REST API, mints Centrifugo connection and subscription JWTs, persists messages, publishes to Centrifugo | Sevalla pod |
| Centrifugo v6 (OSS) | Transport: holds client WebSockets, fan-out, history/recovery | Sevalla pod |
| PostgreSQL (Supabase-hosted) | Durable source of truth for conversations, members, messages | Supabase |
Guiding principle, applied throughout: durable truth → Postgres; ephemeral state → Centrifugo (memory → Redis). The backend is the authority; Centrifugo is transport. Centrifugo is never the source of truth — if it is fully down, chat still works over REST (sends persist, history loads), only live push is lost.
2.2 Topology — three links, each independently gated
The auth model is token-based (see §3 / §6): the backend authorizes a subscription once, at token-mint time, and Centrifugo enforces the signed token thereafter. There is no subscribe-proxy webhook — Centrifugo never calls back into the backend, so there is no inbound public surface beyond the WebSocket itself.
| # | Link | Exposure | Gate |
|---|---|---|---|
| 1 | FE → Centrifugo (WebSocket) | Public | Connection JWT (HS256), verified by Centrifugo signature check |
| 2 | FE → Backend (REST: messages, connection-token, subscription-token) | Public | Existing WorkOS session (access token in cookie) |
| 3 | Backend → Centrifugo (publish / broadcast / unsubscribe / disconnect API) | Internal port only | Network isolation + API key |
Only Centrifugo's client WebSocket endpoint is public. Its /api, /admin, /metrics, /health live on internal_port and are never publicly reachable (verified: http_server.internal_port).
3. End-to-end flows
Legend — identifiers, and the two layers they live in
The schema has two id layers, and the boundary between them is load-bearing for security. Get this right and the rest follows.
| Field | Layer | Type | What it is |
|---|---|---|---|
users.id, organizations.id, chat_conversations.id, chat_messages.id | Internal | BIGINT (BIGSERIAL) | The real primary keys + foreign keys. Small, sequential, fast for joins/ordering/cursors. Never cross the API or channel boundary (they'd leak/enumerate). This is what the existing tables actually use for org_id, user_id, sender_id, created_by. |
workos_user_id | Boundary | TEXT | WorkOS user id, stable + opaque. The Centrifugo connection sub, and the <user> segment of channel names. Maps to users.id via users.workos_id. |
workos_org_id | Boundary | TEXT | WorkOS org id of the active session. The <org> segment of channel names. Maps to organizations.id via organizations.workos_org_id. |
conversation_public_id | Boundary | UUID | chat_conversations.public_id. The <conv> segment of channel names and what the FE sends. Maps to chat_conversations.id. |
message_public_id | Boundary | UUID (v7) | chat_messages.public_id. The external message id in API responses and publish payloads. Time-ordered. |
client_message_id | FE-generated | UUID | Identifies one compose action (one tap of "send"). Reused only on a retry of that same send. Drives idempotency and optimistic-bubble reconciliation. Stored in chat_messages.client_message_id. |
The boundary rule (security-critical): internal
BIGINTids never appear in a channel name, a JWT, or an API payload; boundary ids (WorkOS ids,public_id) never appear in a foreign key. The backend translates at the edge — it already hasusers.workos_idandorganizations.workos_org_idfor exactly this. Channel names are therefore built only from boundary ids:chat:<workosOrgId>:<conversationPublicId>anduser:<workosOrgId>:<workosUserId>#<workosUserId>.Why org is in every channel name: a user is the same identity across orgs. A personal channel without the org segment would be the same channel in every org the user belongs to → a cross-org leak. Org in the name, verified at token mint, is the core tenancy boundary. (This also answers the "do we need org in channels for the POC" question: yes — for correctness, not just for test convenience.)
The auth model in one paragraph
Two JWT types, both minted by the backend, both HS256:
- Connection token (link 2 → used on link 1): identity only —
sub = workosUserId,exp,meta. In the POC it carries nochannelsclaim. Refreshed silently bycentrifuge-js. - Personal channel (no token): the client subscribes to the user-limited channel
user:<org>:<user>#<user>; Centrifugo enforces that the part after#equalssub— the documented personal-channel pattern, no token and no backend call. - Subscription token (link 2, one per conversation): authorizes one
chat:channel. The backend checks org match + membership once, at mint time, then signs a token bound to the exact channel string with a shortexp, plusjti/iat. Centrifugo verifies the signature,exp, and channel on every (re)subscribe — including after a mass reconnect — without calling the backend.
This is the Centrifugo-idiomatic pattern: subscription tokens "survive a massive reconnect scenario" and keep load off the session backend, unlike a subscribe-proxy webhook that fires an uncached HTTP call on every subscribe (§13).
Flow A — Connect & subscribe
- User is already authenticated to the backend (WorkOS session cookie).
- FE calls
POST /api/v1/realtime/centrifugo-token. Backend (behind existing auth middleware) resolves the session →workosUserId,workosOrgId, internal ids, and mints the connection token:sub=workosUserIdexp= now + 1h (tunable — see D6)meta={ workos_org_id }(server-only; informational/observability — authz no longer depends on it, see C3)- signed HS256 with
CENTRIFUGO_TOKEN_HMAC_SECRET. - (No
channelsclaim in the POC — the personal channel is client-subscribed in step 3; the only future use of the claim is theorg:channel, §14.)
- FE opens WSS to Centrifugo with that token. Centrifugo verifies the signature → connection accepted. The FE then subscribes to its personal channel
user:<workosOrgId>:<workosUserId>#<workosUserId>— a user-limited channel (allow_user_limited_channels: trueon theusernamespace). No token is needed: Centrifugo identifies the user from the connection JWT'ssuband allows the subscribe only if the part after#equalssub(the documented personal-channel pattern). The<org>prefix scopes it to the session's org. - For each conversation the user opens, the FE subscribes to
chat:<workosOrgId>:<conversationPublicId>.centrifuge-jsinvokes that subscription'sgetTokencallback, which callsPOST /api/v1/chat/subscribe-token { conversationPublicId }; the backend checks org + membership and returns a subscription token bound to that channel (Flow A handshake detail in §6). Centrifugo verifies the token → subscribed. - The personal channel stays subscribed for the whole session (subscribed once at connect;
centrifuge-jsauto-resubscribes it on reconnect) and carries inbox pings when a conversation isn't open. - Token refresh is automatic: the connection token refreshes via the SDK's connection
getTokenbeforeexp(~25s grace); each subscription token refreshes via its per-subscriptiongetTokenbefore its ownexp. On org switch or sign-out the backend stops minting and the client disconnects/reconnects with a fresh connection token, then re-subscribes its personal channel under the new org prefix.
When we add broadcast/announcement conversations (type='channel', visibility='org', §14), the connection token gains a channels claim carrying org:<workosOrgId> (and later role:<org>:<role>) as server-side subscriptions, so a single publish reaches the whole org with no member enumeration. Not in the POC: there are no org-wide conversations, so the POC connection token has no channels claim at all.
Flow B — Send a message (hot path)
-
FE generates a
client_message_id(UUID) for this compose action and renders the bubble optimistically, keyed by that id. -
FE calls
POST /api/v1/chat/messages/send{ conversationPublicId, clientMessageId, content }. (User already authenticated.) -
Backend authorizes and validates: resolve
conversationPublicId→ conversation; cross-org check (conversation.org_idmust equal the session's internal org id); membership; rate limit;contentlength ≤ 4000 (app-level — the DB has no length CHECK, see §7). -
In one transaction (
pkg/utils.TransactionManager):INSERTthe message intochat_messageswithclient_message_idset. Idempotency is enforced by the existing unique index(conversation_id, sender_id, client_message_id)(migration 000028, see C1):- Insert succeeds → first time. Continue.
- Unique violation (
23505) → this is a retry.SELECTthe existing row by(conversation_id, sender_id, client_message_id)and return it withisReplay: true. Stop (no second row, no re-publish).
-
Commit.
-
After commit (never inside the tx — publish only once it's committed, so no one is told about a message that could still roll back), publish to Centrifugo via the internal API as one
batchcontaining two commands — two payloads to two audiences:publish("chat:<org>:<conv>", <full message payload>)— the complete message (id, author, content, timestamps) to everyone with that conversation currently open, who renders it immediately.broadcast(["user:<org>:<m1>#<m1>", "user:<org>:<m2>#<m2>", …], <chat.activity ping>)— a thin ping to each member's personal channel. It carries a lightweight preview (conversationPublicId,senderId, ≤200-char snippet, timestamp) — not the full message — so a member with the conversation closed can badge the inbox and update the sidebar row ("Alice: hey…") without a fetch. Full body only ever comes viachat:or the history fetch. (Recipient list comes fromchat_conversation_members— fan-out from the DB, not from Centrifugo connectivity. Building eachuser:<org>:<m>#<m>target requires mapping the member's internaluser_id→users.workos_id, a join on the send path.)
-
Fan-out: recipients draw a new bubble; the sender's app finds its pending "sending…" bubble with this
clientMessageId(step 1) and updates it — attaches the realpublic_id, flips to "sent" — instead of drawing a second. The confirmation arrives twice (REST reply + this publish, both carrying the id); the first updates the bubble, the second is seen as a duplicate and ignored.Example: Alice sends "hi". Her app instantly shows a grey "sending…" bubble (only the
clientMessageId). The backend saves it and fans out. Bob has never seen this id → he draws a fresh "hi" bubble. Alice gets the same fan-out back, recognises the id as her pending bubble → turns it "sent" with the real id. Alice hears the confirmation twice (REST reply + publish) — first turns it "sent", second is ignored as a duplicate. -
Publish failure is tolerated — log a WARN, still return
200. The message is committed to Postgres (the source of truth), so it's durable regardless; the publish is only the live "show it now" transport. It resurfaces on the next REST read — when any client opens/reopens the conversation, or on therecovered: falsefallback after a reconnect (Flow C). A Centrifugo outage costs latency, never correctness, and must never fail a send.Honest limit of the after-commit publish. Narrow window: if the backend crashes between commit (step 5) and publish (step 6), or the publish fails (we only WARN), the message lands in Postgres but never enters Centrifugo — not even its history buffer. A recipient who stays connected with the conversation open won't get it live and won't get it via Flow C recovery — only on the next REST refetch. Accepted for the POC; the documented fix is the transactional outbox (blue note below).
Insert the message and an outbox row (chat_outbox) in the same transaction, and let Centrifugo's PostgreSQL outbox consumer publish committed rows (LISTEN/NOTIFY keeps it near-instant). This makes the publish atomic with the write, closing the crash-between-commit-and-publish window. Runs on a single node + memory engine — independent of the Redis/multi-node upgrade; adopt when that window becomes unacceptable. Ref: https://centrifugal.dev/docs/tutorial/outbox_cdc
Why
publish+broadcastand not one call:broadcastsends one identical payload to many channels. The conversation needs the full message; the personal channels need a different lightweight ping. Two payloads ⇒ two commands ⇒ wrapped inbatchfor a single HTTP round-trip. (Verified: §13.)
Flow C — Recovery (reconnect / missed messages)
- On reconnect,
centrifuge-jsrequests recovery on eachchat:subscription. If the gap is within the channel's history window, Centrifugo replays the missed publications and reportsrecovered: true. - If
recovered: false(gap too large, or Centrifugo restarted — memory-engine history is lost on restart), the FE REST-refetches the conversation tail (chat/messages/list) and reconciles. This is the designed fallback, not an error.
Flow diagram — all three flows, end to end
4. Decisions (with rationale)
| # | Decision | Why |
|---|---|---|
| A1 | chat: channels authorized by a per-channel subscription token, not a subscribe proxy. | Centrifugo best practice. A subscription JWT bakes the channel + a short exp and is verified by signature, so it "survives a massive reconnect scenario" and reduces load on the session backend (§13). A subscribe-proxy webhook fires an uncached HTTP call on every subscribe — a Centrifugo restart stampedes the backend. Org + membership are checked once, at mint. |
| A2 | Personal channel is a user-limited channel user:<org>:<user>#<user> the client subscribes to (no token, no channels claim). | Centrifugo's documented personal-channel pattern: user-limited channels are "ideal for… personal messages to a single user" and work "without requiring a subscription token" — Centrifugo enforces owner = sub from the connection JWT (§13). We don't use the built-in subscribe_to_user_personal_channel feature because its fixed namespace:#userID format can't carry our <org> prefix (mandatory for cross-org isolation). The # gives ownership; the <org> prefix gives tenancy. Revoked via Disconnect (the sub dies with the connection). |
| A3 | Org segment in every channel name, built from boundary ids only. | Cross-org tenancy boundary: a user spans orgs, so an org-less personal channel leaks across them. Org is verified at mint and baked into the channel string, so even a leaked token for org A can't address org B. Internal BIGINT ids never enter channel names (boundary rule, §3). |
| A4 | No subscribe-proxy webhook anywhere in the POC. | Token auth removes the need. Bonus: Centrifugo never calls back into the backend, eliminating an entire inbound public surface (the old shared-secret webhook is gone). The subscribe proxy is reserved for a future high-fan-out ephemeral status: channel only (§14). |
| C1 | Idempotency is enforced by the existing DB unique index (conversation_id, sender_id, client_message_id) (migration 000028) — no app-level store. | The constraint already exists and chat_messages is not partitioned, so the DB can enforce it directly. INSERT; on 23505 treat as a replay and return the existing row. This is simpler and more robust than an in-memory map (survives pod restarts) and matches the colleague's design. (Supersedes the earlier in-memory-store plan, which existed only to work around partitioning we are not doing.) |
| C2 | Centrifugo's API/admin/metrics/health on internal_port; the publish API is reached only over the internal network with an API key. | The genuinely sensitive surface (publish-to-anything, admin UI) gets real network isolation. With no subscribe proxy there is no public webhook to secret-gate at all. |
| C3 | (Superseded by A1/A4.) org_id previously rode in meta for the subscribe-proxy cross-org check. With token auth the cross-org check happens at mint time (backend has the session org) and the org is baked into the channel name. meta.workos_org_id is kept only for logging/observability; authz no longer depends on it. | |
| C4 | Dual delivery = publish + broadcast inside one batch call. | One broadcast carries one payload; we need two (full message vs ping). batch combines both commands in one HTTP request. |
| D1 | DMs + groups, text only. | Grill decision. |
| D2 | Read cursor stays as the existing last_read_message_id BIGINT (already on chat_conversation_members); API speaks public_id. | Unread counts are range counts on the monotonic internal id; storing a public_id cursor would force a resolve-join per inbox row. Count excludes own + soft-deleted messages. Column already exists — no migration. |
| D3 | Presence/status is OUT of POC scope; keep the existing user_presence system; add NO Centrifugo presence. | Centrifugo built-in presence/join-leave reports raw connectivity, which leaks "appear-offline." The app already has a Teams-like status model (online/away/busy/appear-offline, auto-away via heartbeat) on user_presence (migrations 013/016), with a clear-stale-presence cron — untouched by the POC. We do not add an org: presence channel or in-conversation presence. Real-app integration reads the existing system; the future realtime path is interest-based status:<org>:<user> channels (§14). |
| D4 | Clients never publish to Centrifugo. Typing is a client POST → backend publishes to chat:<org>:<conv>. | A single invariant ("all publishes are server-side") is simpler and safer than "clients may publish only to typing:". With no proxy there's nothing to attach identity to a client publish anyway; backend-publish keeps identity authoritative (from the session, not the payload). Typing rides the same chat: channel (no separate typing: channel → half the subscription-token mints). Throttled client-side ≥3s; published directly, never through the outbox (ephemeral). |
| D5 | Memory engine, single node. Redis is the documented next step. | Grill decision. Memory engine supports history + recovery on one node (verified); lost only on restart, which Flow C handles. |
| D6 | New dedicated CENTRIFUGO_TOKEN_HMAC_SECRET (HS256); exp chosen by us, documented as tunable. | Don't reuse SUPABASE_JWT_SECRET. HS256 matches existing token signing; no RSA in the codebase. Centrifugo prescribes no exp value — "choose judiciously / a reasonable expiration time" (illustrative 1h example, 25s grace) — so connection exp ≈ 1h and subscription exp deliberately short (minutes) is our engineering trade-off (shorter = faster revocation, more getToken calls), not a Centrifugo recommendation (§13). |
| D7 | GORM + pkg/utils.TransactionManager. | The codebase data layer is GORM, not pgxpool. Code wins over research docs. |
| D8 | Hybrid ids: internal BIGINT id for keys/joins, external public_id (UUID/v7) + WorkOS TEXT ids at the boundary. | Already the schema's convention (public_id added in 000024/000027/000029; org_id/user_id/sender_id are BIGINT FKs). The POC honors it via the boundary rule (§3). |
| D9 | People listing reuses the existing org-members endpoint. | Not asserting a chat-specific people endpoint exists; add one only if the existing one doesn't fit. |
| D10 | Group admin leaving → auto-transfer to oldest member; last member leaving just leaves. | No destructive delete path in the POC; an empty conversation is invisible (no members ⇒ nobody can list/subscribe) and harmless. Uses the existing role ∈ ('admin','member') values. |
| D11 | One chat_conversations table with type ∈ ('dm','group','channel') (POC uses dm/group); a new dm_key makes DMs unique per pair per org. | Same pair in two shared orgs = two DM threads (org isolation — correct). parent_message_id reserved for future threads; type='channel' reserved for org-wide (§14). |
5. Centrifugo deployment on Sevalla
- Pin a specific v6 image tag (e.g.
centrifugo/centrifugo:v6.x.y— pick the exact patch at deploy time; neverlatest). - Ports:
- Public: client WebSocket (
port, e.g.8000) → exposed via Sevalla ingress as WSS. - Internal:
http_server.internal_port(e.g.9000) for/api,/admin,/metrics,/health→ not exposed publicly; reachable only by the backend over the Sevalla internal network.
- Public: client WebSocket (
- Admin UI password-gated, on the internal port only.
- Health check enabled (
/healthon internal port) for Sevalla. - Env / config (key names verified; confirm exact v6 nesting at M1):
client.token.hmac_secret_key=CENTRIFUGO_TOKEN_HMAC_SECRET— verifies both the connection token and the subscription tokens (same secret; subscription tokens are just JWTs with achannelclaim).client.subscription_token.hmac_secret_key— set if v6 wants the subscription-token secret configured separately; otherwise it falls back to the connection-token secret. Confirm at M1.client.allowed_origins— theOrigin(s) Centrifugo accepts on the WS handshake. ⚠ Electron gotcha: the desktop renderer's origin is not a normalhttps://app.…domain (it'sfile://, a packaged custom scheme, or a dev-serverhttp://localhost:<port>). If this list doesn't match what Electron sends, every connection is rejected. Confirm the realOriginat M1 (step 2) and list exactly those values.channel.namespaces— define thechatandusernamespaces (§5.1).allow_user_limited_channels: trueon theusernamespace (a channel/namespace option, not a top-levelclient.*key) ← required for the personal channel#.- server API key for link 3.
http_server.internal_port,http_server.internal_address.engine.type=memory(POC).
- Backend env additions:
CENTRIFUGO_TOKEN_HMAC_SECRET,CENTRIFUGO_API_ENDPOINT(internal),CENTRIFUGO_API_KEY. (No subscribe-proxy secret — there is no proxy.)
For a 2nd node set engine.type=redis + engine.redis.address (Sevalla Redis) — two env vars, zero code change (Redis through the relevant scaling stage, not NATS — NATS loses history/recovery). The subscription-token model needs no change to scale out (tokens are stateless), which is part of why A1 was chosen.
5.1 Namespaces
| Namespace | Subscribe auth | History | Presence | Client publish | Notes |
|---|---|---|---|---|---|
chat: | subscription token (org + membership checked at mint) | on (e.g. 100 msgs / 24h), recovery on | off (D3) | no (D4) | chat:<org>:<conv>. Carries messages and typing. |
user: | user-limited (#) — client subscribes; Centrifugo enforces owner = sub (no token) | small (e.g. 10 / 1h) | off | no | user:<org>:<user>#<user>; namespace sets allow_user_limited_channels: true. Inbox pings. |
org: (org-wide broadcast, server-side via connection token) and status: (per-user presence, interest-based, the one place a subscribe proxy returns — high-fan-out ephemeral, no token mint per viewer) land with the org-wide and rich-presence features (§14). role: for role-based access (§14). None are in the POC.
6. Backend changes
All new chat code follows the existing feature-module layout (api / service / module / models / mocks) and the existing central route-registration convention.
- Platform client
internal/platform/centrifugo/— thin wrapper over the Centrifugo HTTP API:Publish,Broadcast,Batch,Unsubscribe,Disconnect(the last two for revocation — item 6). Mirrorsinternal/platform/livekit. - Connection-token endpoint —
POST /api/v1/realtime/centrifugo-token(new handler inrealtimeor a newchatfeature). Mints the connection JWT in Flow A step 2. Mirrors the existing LiveKit/Supabase token generators (HS256). - Subscription-token endpoint —
POST /api/v1/chat/subscribe-token { conversationPublicId }. Resolvepublic_id→ conversation; cross-org deny (conversation.org_id != session.org_id → ErrCrossOrgAccess); membership deny (caller not inchat_conversation_members); else mint a subscription JWT{ sub: workosUserId, channel: "chat:<org>:<conv>", exp: short, iat, jti }. This is the only authorization point forchat:subscribes — there is no proxy. - Send service — Flow B. Authz → cross-org → rate-limit → validate (≤4000) →
INSERT(unique index = idempotency; catch23505→ return existing,isReplay) → commit → batchpublish(chat:<org>:<conv>)+broadcast(user pings)after commit → return200(publish failure = WARN only). - Typing publish —
POST /api/v1/chat/typing { conversationPublicId }(D4): membership check (cheap), then backendpublish("chat:<org>:<conv>", <typing event with server-resolved identity>). Ephemeral, never persisted, never via outbox. - Revocation of live access — a subscription token, once minted, is valid until its
exp, and Centrifugo serves re-subscribes from the cached token without calling the backend. So when access is taken away, kill the live subscription explicitly, after committing the change:- On member removal / leave (M3):
Unsubscribe(workosUserId, "chat:<org>:<conv>"). This kills the live sub immediately; the short subscription-tokenexpis what closes the re-subscribe-with-cached-token window (on nextgetTokenthe backend re-checks membership and refuses). - On ban / cross-org change:
Disconnect(workosUserId)— kills the connection and with it the personal-channel subscription.Unsubscribeon the personal channel alone wouldn't stick (the client re-subscribes on reconnect), so a ban must drop the connection and the backend must then refuse to mint a fresh connection token. - Idempotent + best-effort (log on failure); the membership row is the source of truth, so a missed call only delays cutoff to the next token refresh.
- On member removal / leave (M3):
OSS has no revoke-by-jti; PRO adds token revocation by jti/iat and user blocking (instant, enabled when JWTs carry jti/iat — which ours do). Adopt if the short-exp window ever proves too loose. Refs: https://centrifugal.dev/docs/pro/token_revocation , https://centrifugal.dev/docs/pro/user_block
- Endpoints (logical; register per existing convention):
chat/conversations/open-dm,create-group,list,by-id,add-member,remove-member,rename,leavechat/messages/send,list,edit,deletechat/conversations/read-cursorchat/typingrealtime/centrifugo-token,chat/subscribe-token- people listing → reuse existing org-members endpoint (D9).
These replace the PostgREST RPCs being decommissioned (
get_recent_dms,get_conversation_messages,update_read_cursor, …) — same behavior, now in Go (§7).
- Rate limit — in-process token bucket (e.g. burst 10, 1 msg/s) for the POC; swap to Redis rate when pods > 1. Replaces the old
trg_check_message_rate_limit/trg_check_global_rate_limittriggers (§7).
Limit the high-amplification cases (org-wide channels, large groups) harder than a 1:1 DM, token bucket keyed (user_id, scope). Matters once org-wide channels exist; the flat per-user bucket is fine for POC DMs/groups.
- Scheduler — no new chat job. The existing
clear-stale-presencecron stays (it's the presence system, kept as-is). No partition top-up (the table isn't partitioned). No dedup-cleanup (idempotency is the DB unique index). - Observability — Prometheus metrics (existing
pkg/metrics): publish latency, publish failures, subscription-token mints + denials withErrCrossOrgAccessas its own label.
7. Database schema — EXTEND the existing tables
The chat tables already exist (migrations 000011–000031) and are ~90% ready. This section is the verified inventory of what's there, the two columns we add, and the PostgREST-era layer we decommission. We keep all tables and their data.
7.1 What already exists (verified against the migrations)
| Table | Key columns already present | Verdict |
|---|---|---|
private.chat_conversations | id BIGSERIAL PK, org_id BIGINT FK, type CHECK('dm','group','channel'), name, description, created_by BIGINT FK, is_archived, created_at/updated_at, public_id UUID NOT NULL UNIQUE (000024) | Reuse as-is + add dm_key (§7.2). |
private.chat_conversation_members | conversation_id BIGINT FK, user_id BIGINT FK, org_id BIGINT FK, role CHECK('admin','member'), muted_until, last_read_message_id BIGINT FK (read cursor, D2), joined_at, PK (conversation_id,user_id) | Reuse as-is. No change. |
private.chat_messages | id BIGSERIAL PK, conversation_id/org_id/sender_id BIGINT FK, parent_message_id (threads, reserved), content TEXT, reply_count, last_reply_at, is_edited/is_deleted (soft delete), created_at/updated_at, public_id UUID NOT NULL UNIQUE (UUIDv7, 000027/029), client_message_id UUID + unique (conversation_id,sender_id,client_message_id) (000028) | Reuse as-is + add kind (§7.2). Not partitioned — kept plain. |
private.chat_reactions | (message_id,user_id,emoji) PK | Exists; out of POC scope. No change. |
private.chat_attachments | id, message_id, org_id, file_name, file_size, content_type, storage_key | Exists; out of POC scope. No change. |
private.user_presence | user_id PK, org_id, status SMALLINT (0/1/2 + app-side 3=busy), manual_status, last_seen_at, heartbeat_at; clear-stale-presence cron | The existing status system (D3). Keep entirely untouched. |
Key facts that shape the POC, all confirmed in the migrations:
- Idempotency is already DB-enforced (000028) → C1 (no app store).
- Hybrid ids already exist (
public_idon conversations + messages) → D8, the boundary rule. chat_messagesis not partitioned → no partition machinery,public_idis a plain unique key.- Soft delete is
is_edited/is_deletedbooleans (not timestamps) — the send/edit/delete service uses these. org_id/user_id/sender_idare internalBIGINTFKs — channel names/JWTs use boundary ids, mapped at the edge (§3).
7.2 What we add (one small migration 000032)
-- DM-pair uniqueness, org-scoped (D11). Guards the check-then-create race.
ALTER TABLE private.chat_conversations
ADD COLUMN dm_key VARCHAR(255); -- '<orgId>:<minUserId>:<maxUserId>'; NULL for non-DM
CREATE UNIQUE INDEX idx_chat_conversations_dm_key
ON private.chat_conversations(dm_key)
WHERE dm_key IS NOT NULL;
-- System messages for lifecycle events ("X added Y", "renamed to Z") in groups.
ALTER TABLE private.chat_messages
ADD COLUMN kind VARCHAR(10) NOT NULL DEFAULT 'user'
CHECK (kind IN ('user','system')); -- 'system' excluded from unread + rate limit
That's the whole additive change for the POC. Notably not added (deliberate scope cuts, see blue note): visibility, history_mode, first_visible_message_id, chat_outbox. Content length is validated in Go (≤4000); we add no DB CHECK (keeps edits/migrations cheap and matches the app being the authority).
chat_conversations.visibility(private/org) +history_mode(all/since_join), with a CHECK thatsince_joinonly pairs withprivate. Backfilldm→private/all,group→private/since_join,channel→org/all. Drives org-wide channels and "new members don't see old history."chat_conversation_members.first_visible_message_id BIGINT(FK →chat_messages.id, NULL = full history), snapshotted at add-time — makes since-join a pure UPDATE.private.chat_outbox(id,method,payload JSONB,created_at,LISTEN/NOTIFYtrigger) for the transactional-outbox publisher. In the POC, group members see full history and there are no org-wide conversations — these stay unbuilt.
7.3 What we decommission (migration 000033 — the PostgREST → Go cutover)
The chat tables were built for the PostgREST/Supabase-Realtime era: the frontend wrote rows directly and triggers/RPCs/RLS did the work. Go now owns all of it, so that procedural layer is removed. Sequencing is a correctness precondition (§10, M0): this migration must ship before or together with the Go send endpoint — trg_enforce_message_defaults overwrites sender_id with current_user_id() = NULL on a GORM insert → NOT NULL violation; the rate-limit triggers would misfire.
Drop (all on private.chat_messages unless noted):
- Triggers + their functions:
trg_enforce_message_defaults,trg_check_message_rate_limit,trg_check_global_rate_limit,trg_update_parent_reply_count,trg_notify_new_message,trg_touch_conversation. (Defaults, rate-limit, reply-count, therealtime.sendfan-out, and conversationupdated_attouch all move into the Go send service.) - Chat RPCs:
update_read_cursor,get_recent_dms,get_conversation_messages,get_recent_dm_presence(and the 000030/000031 public-id variants, 000015 dashboard reconciliation). Replaced by Go endpoints (§6.7). (get_recent_dm_presenceis a presence-flavored RPC but it joins the chat tables to find DM peers, so it belongs to the chat sidebar — the Go conversation-list endpoint returns each peer's presence inline by readinguser_presencedirectly.) - RLS policies +
authenticatedgrants onchat_conversations,chat_conversation_members,chat_messages. The backend authorizes now; the FE no longer touches Postgres directly. realtime.messagesbroadcast policy — rewrite, do NOT drop. The liveBroadcast channel accesspolicy (migration 000015, which replaced the 000011 §8b one) is shared: it authorizes both the chatuser:<org>:<user>topic and the officeroom-presence:<org>:<room>topic. Remove only the chat (user:) branch; keep theroom-presence:branch and theRoom members can send presence broadcastsINSERT policy — they belong to the out-of-scope office presence.- Supabase Realtime publication: remove
chat_messagesandchat_conversation_members(live push is Centrifugo's job now).
Explicitly KEPT (do not decommission):
- The
user_presencetable, itsupdate_presencewrite RPC, theclear-stale-presencecron, its RLS/grants, and its Supabase Realtime publication membership — this is the existing office-dots presence system, out of POC scope and untouched (D3). Its full architecture, and the constraints it imposes on this teardown, are documented in §7.4. Only the DM-sidebar RPCget_recent_dm_presenceis decommissioned (above); the office-tile dots run on a separate path and are left as-is. - Helper functions
current_user_id()/current_org_id()— definitely kept: used by both the retainedupdate_presenceRPC and the retainedroom-presenceRLS policy (verified).
The matching
dbsecuritytests (internal/database/dbsecurity/— rls/grants/triggers/rpc) assert exactly what's being removed; rework them in the same PR as000033.
7.4 The existing presence system — verified, and why the POC leaves it alone
The away / busy / free / "appear-offline" dot on every office tile is an existing, working feature that predates chat and is independent of it. We mapped exactly how it is built — backend migrations plus the qubital-apps frontend (packages/app-shared/src/supabase/supabasePresenceClient.ts) — because it is the single biggest constraint on the Supabase teardown: chat can move to Centrifugo, but this feature cannot, and naively "removing Supabase" would silently break it. It therefore stays 100% untouched by the POC (D3). Here is the whole picture, so the teardown can be surgical rather than guesswork.
How it works today, end to end:
-
Status truth — the
user_presencetable. One row per user:status(0=offline, 1=online, 2=away, 3=busy),manual_status(the user's explicit override —appear-offlineismanual_status=0),last_seen_at,heartbeat_at. Effective status =COALESCE(manual_status, automatic_status). This is durable truth: it survives reconnects, restarts, and sessions until cleared. -
Status write — the
update_presencePostgREST RPC. All the status logic runs client-side insupabasePresenceClient: a heartbeat plus DOM activity listeners and Electron system-idle time drive auto-away after a timeout;ZoneBusyManagerforces busy when the user is behind a closed door in the office; manual choices (online/away/busy/appear-offline) take precedence, with a grace period so a manual "away" isn't instantly undone by a mouse move. Each resolved status is persisted by calling the Supabase RPCupdate_presence(p_status, p_manual_status, p_last_activity_at), which relies on thecurrent_user_id()/current_org_id()helpers. Aclear-stale-presencepg_cron job flips automatic status to offline whenheartbeat_atgoes stale — and never touchesmanual_status. -
Live distribution — a Supabase Realtime broadcast channel. For people in the same office room,
subscribeToRoomPresence()opens a Supabase Realtime channelroom-presence:<org>:<room>; each member broadcasts its ownstatus_changeevents on it (emitStatusChange()→httpSend), so the tile dots update near-instantly. On (re)subscribe it seeds from a directuser_presenceread (getRoomMembersPresence), and the DM-sidebar dots are polled separately via theget_recent_dm_presenceRPC. Authorization for the broadcast channel is therealtime.messages"Broadcast channel access" RLS policy (migration000015) — which is shared with chat (it also authorizes the chatuser:<org>:<user>notification topic), plus a companionroom-presenceINSERT policy.
Why this constrains the POC — four concrete consequences:
- Supabase Realtime cannot be fully abandoned. It is the live transport for the office dots — and it is Supabase Realtime broadcast, not LiveKit (this resolves the long-standing open question). Pulling it out wholesale would break presence on every tile. So the POC removes Supabase Realtime only for chat; the office presence keeps using it.
- The broadcast RLS policy is rewritten, not dropped (§7.3). Because one
Broadcast channel accesspolicy serves both chat and room-presence, the decommission migration must surgically strip only the chat (user:) branch and keep theroom-presencebranch + theroom-presenceINSERT policy. Dropping the whole policy would deny the office broadcasts. - The helper functions and the Supabase auth surface stay.
current_user_id()/current_org_id()are called by both the retainedupdate_presenceRPC and the retained room-presence RLS policy, so they remain — which is also whySUPABASE_JWT_SECRETand the PostgREST path stay alive for presence even though chat stops using them. - The frontend teardown is surgical (M0).
supabasePresenceClientis currentlybind()-ed bySupabaseChatManagerand shares the samesupabaseConnection;RoomPresenceManager/ZoneBusyManagerdepend on that client. So we decouple presence from the chat manager — re-home ownership of the Supabase connection and thebind()call — rather than deleting the connection along with the chat code. Delete naively and the dots go dark.
What the POC does change about presence: only the DM-sidebar dots (the small status indicators next to conversations) are re-sourced — the new Go conversation-list endpoint returns each peer's status inline by reading user_presence, replacing the get_recent_dm_presence poll. The office-tile dots (the room-presence broadcast path) are a different code path and are left exactly as they are.
If we ever migrate the office presence off Supabase Realtime, the target is the interest-based status:<org>:<user> model (§14): Go heartbeat write → Postgres truth → one publish per change → Centrifugo delivery to only the clients currently rendering that user (sidebar peers, on-screen room avatars, open profile card), via a refcounted subscription registry and a subscribe-proxy same-org check. Built-in Centrifugo presence is still not an option (it reports raw connectivity and leaks appear-offline). Only after that migration could Supabase Realtime be fully retired — out of scope here.
8. Frontend integration (centrifuge-js)
- Connection manager singleton: one Centrifugo connection per app session; the connection
getTokenwired torealtime/centrifugo-tokenfor silent refresh. Lifecycle: connect after auth; on org-switch or sign-out, disconnect and reconnect with a fresh token (the new token'schannelsrebind the session to the new org — never reuse a connection across orgs); on background/foreground, let the SDK auto-reconnect. - Electron token wiring:
centrifuge-jsruns in the renderer, but the WorkOS session cookie lives in the main process (apiClient). So the token-fetchinggetTokencallbacks — the connection token and eachchat:subscription token — must fetch through the existing IPCRequestActionflow: the main process makes the authenticated call (realtime/centrifugo-tokenorchat/subscribe-token), the renderer receives the token over IPC. The renderer must not call these endpoints directly (no cookie). (The personal channel needs no token — it's user-limited — so it has nogetToken.) - Subscriptions (steady state ~1–3): the always-on personal
user:<org>:<user>#<user>(subscribed at connect, user-limited, no token) and the open conversation'schat:<org>:<conv>(subscribed when the conversation opens — itsgetTokenmints the subscription token; unsubscribed when it closes). The personal channel persists for the whole session. - Open-conversation join handshake (avoids a message-loss race): opening a conversation does two things — REST-fetch the tail and subscribe to
chat:<org>:<conv>. Order: (1) subscribe first and buffer incoming events, (2) then REST-fetch the history tail, (3) merge by message id, deduping viaclientMessageId. Fetch-then-subscribe would lose anything arriving in the gap. On reopen of a cached conversation, the fetch in (2) is a delta — only messages after the last cached id (a cheapafter-cursor query); a warm cache with nothing new is a pure cache hit with no backend call. - Optimistic send: render keyed by
clientMessageId; reconcile on the echoedmessage.created(match byclientMessageId); onisReplayjust confirm. Keep a per-conversation mapclientMessageId → bubblefor pending sends so the echo (REST reply or live publish) updates the existing bubble instead of duplicating it; drop the entry once confirmed. - Recovery: on
recovered: false, REST-refetch the tail and reconcile (Flow C). - Sidebar + badges: unread count from the server + live
chat.activitypings on the personal channel re-badge the inbox. The ping carries a lightweight preview (conversationPublicId,senderId, ≤200-char snippet, timestamp) — not the full message — so the sidebar row updates ("Alice: hey are you…") without a fetch. The full body only ever arrives viachat:or the history fetch. - Typing: on local typing,
POST /chat/typing(throttle ≥3s); render the typer's identity from the server-published event, never from local guesses. (No client publish to Centrifugo — D4.) - Presence: the POC renders online/away/busy from the existing app status system (
user_presence), exactly as today — chat adds nothing here (D3). - Security: message content is plain text end-to-end; the renderer must never interpret it as HTML — in Electron, XSS can escalate to local code execution.
9. Security model (summary)
- Connect: connection JWT (HS256), signature-verified by Centrifugo. No valid token → no connection.
- Subscribe:
user:— user-limited (#), client-subscribed with no token; Centrifugo enforces owner =subfrom the connection JWT. The<org>prefix scopes it to the session's org.chat:— a subscription token the backend mints only after a cross-org + membership check; Centrifugo verifies signature,exp, and the exact channel string on every (re)subscribe.- No subscribe proxy → Centrifugo never calls back into the backend.
- Publish: clients never publish; everything (messages, typing) is server-published over the internal API.
- Internal surface: Centrifugo
/api+/admin+/metricsoninternal_port, never public. No public webhook at all. - Cross-org invariant: the org segment is in every channel name and is verified at mint (
conversation.org_id != session.org_id → ErrCrossOrgAccess). A leaked subscription token for org A cannot address org B's channel. Mandatory test: a user who is a member via a different org is denied a token. - Revocation: subscription tokens are bounded by a short
exp; on membership loss the backend callsUnsubscribe(kills the live sub) and the shortexpcloses the cached-token re-subscribe window; bans callDisconnect(also how the personal channel is revoked). - Content rendering (client): message content is rendered as plain text, never HTML — in Electron an XSS can escalate to local code execution (§8.10). This is the one security property that lives entirely on the frontend.
- Token lifetime: short
exp+ silent refresh bounds a leaked token's window; on org-switch/sign-out the backend stops minting and the client reconnects, so access can't outlive the session. - Out of the POC threat model (stated for honesty): no end-to-end encryption — content is plaintext at rest in Postgres and relies on TLS/WSS in transit; per-message identity is trusted from the connection's backend-minted JWT, not re-verified on every publish.
10. Implementation plan (tracer-bullet slices)
Each milestone is a vertical slice — frontend → backend → Centrifugo → DB, ending in something runnable. Ship in order.
| Milestone | Deliverable (vertical slice) |
|---|---|
| M0 | Schema EXTEND (000032) + decommission (000033); old Supabase-Realtime chat code paths removed; app boots on the extended schema with chat write/read still to come. |
| M1 | Tracer bullet: Centrifugo pod on Sevalla + platform client + connection-token endpoint + Electron round-trips a backend publish end-to-end (personal channel). |
| M2 | Schema in use + DM send/receive over the real path (Flow B) + DB idempotency + the real subscription-token endpoint (cross-org + membership). |
| M3 | Groups: create, add/remove member, rename, leave (admin→oldest transfer); membership system messages. |
| M4 | Realtime niceties: typing (backend-published on chat:), read cursors + badges, edit/soft-delete. |
| M5 | Dogfood hardening: recovery drills, restart drill, rate-limit, metrics, runbook. |
M0 — Schema EXTEND + decommission (no teardown)
Goal: the database and codebase stop depending on the PostgREST/Supabase-Realtime chat path, while keeping the tables, their data, and the presence system.
- Inventory confirmed (done in §7 against migrations
000011–000031). The tables are kept. - Migration
000032— addchat_conversations.dm_key(+ partial unique index) andchat_messages.kind(§7.2). Realup/down; runup→down→uplocally. - Migration
000033— decommission the PostgREST-era triggers, chat RPCs, RLS/grants, and the chat-tables Realtime publication (§7.3). Keep the presence path. ⚠ Must ship before/with the Go send endpoint (M2) — §7.3. Reworkinternal/database/dbsecuritytests in the same PR. - Remove old chat code paths — surgically. Remove the Supabase-Realtime chat client (
useSupabaseChat, the chat side ofSupabaseChatManager) and the direct-PostgREST chat read/write paths. ⚠ Critical coupling (verified inqubital-apps):supabasePresenceClientisbind()-ed bySupabaseChatManagerand shares the samesupabaseConnection, andRoomPresenceManager/ZoneBusyManagerdepend on it. So do not delete the shared Supabase connection or the presence client — decouple presence from the chat manager (re-home the connection /bind()ownership) so the office dots keep working once chat moves to Centrifugo. - Verify
SUPABASE_JWT_SECRET/ the Supabase token generator usage — grep the codebase; remove only if chat was the sole consumer, otherwise keep (presence/other features may use it). Done when: CI green, the app boots on the extended schema, no dangling imports.
M1 — Tracer bullet (thinnest end-to-end pipe)
Goal: prove a backend publish reaches the Electron client over the connection token before any chat logic. This is where v6 config is nailed down.
- Deploy Centrifugo on Sevalla per §5: pinned v6 image, public WSS,
internal_portfor/api/admin/metrics/health. Health check on internal/health. - Confirm exact v6 config nesting (the one open M1 risk):
client.token.hmac_secret_key, the subscription-token secret key (separate vs shared — §5),allow_user_limited_channels: trueon theusernamespace, thechat/usernamespaces,http_server.internal_port. Also capture the Electron WSOriginand setclient.allowed_originsto exactly that — a mismatch rejects every connection, so verify this before anything else. - Add backend env (§5):
CENTRIFUGO_TOKEN_HMAC_SECRET,CENTRIFUGO_API_ENDPOINT,CENTRIFUGO_API_KEY. Wire intointernal/app/env-setup.go. - Platform client
internal/platform/centrifugo/:Publish,Broadcast,Batchover the internal API + a test against it. - Connection-token endpoint
realtime/centrifugo-tokenbehind existing auth: mint HS256 JWT withsub=workosUserId,exp,meta.org(identity only — nochannelsclaim). - Electron: connection-manager singleton with
getToken→ token endpoint; connect; subscribe to the user-limited personal channeluser:<org>:<user>#<user>(no token); have the backendPublishto it and see it arrive. Done when: a backend publish shows up live in the Electron client, token refresh works, the user-limited subscribe is accepted, and/apiis unreachable from the public internet.
M2 — Real DM send/receive (the core loop)
Goal: Flow B for real, with the real token security model.
- GORM models for the (now extended)
chat_conversations,chat_conversation_members,chat_messages— internalBIGINTids +public_id+client_message_id+kind. - Subscription-token endpoint (§6.3): resolve
public_id→ conversation → cross-org deny → membership deny → mint{sub, channel: chat:<org>:<conv>, exp short, iat, jti}. - Send service (Flow B, §6.4) via
pkg/utils.TransactionManager: authz → cross-org → rate-limit → validate(≤4000) → INSERT (catch23505→ existing row,isReplay) → commit → batchpublish(chat:<org>:<conv>)+broadcast(user pings)after commit →200. - Open-DM + list + messages/list endpoints;
dm_keymakes the pair unique per org (D11). These replaceget_recent_dms/get_conversation_messages. - Electron DM view: optimistic bubble keyed by
clientMessageId, reconcile on echoedmessage.created, render as plain text (§8.10), subscribe-first join handshake (§8.4), per-subscriptiongetToken→chat/subscribe-tokenover IPC. - Tests (mandatory): cross-org denial — a user in org A is refused a subscription token (and thus a subscribe) for org B's conversation even if listed as a member (the single most important test). Plus idempotent-retry (same
clientMessageId→ one row,isReplay). Done when: Alice DMs Bob, it persists, Bob receives it live, a retry creates no duplicate, cross-org is denied.
M3 — Groups
- Endpoints:
create-group,add-member,remove-member,rename,leave(§6.7). - Admin-leave rule (D10): admin leaving auto-transfers to the oldest member; last member leaving just leaves (empty conversation is harmless/invisible). Uses existing
role ∈ ('admin','member'). - Revoke live access on removal (§6.6): after committing a
remove-member/leave, callUnsubscribe(workosUserId, "chat:<org>:<conv>"). - Membership events + system messages: insert a
kind='system'message ("X added Y", "renamed to Z") in the same transaction andpublishit onchat:<org>:<conv>so open clients update live;conversation.addedon the new member'suser:channel (they aren't onchat:yet) andconversation.removedsymmetrically. - Reuse the existing org-members endpoint for the people picker (D9). Done when: groups create, members add/remove, rename, ownership transfers on admin-leave, and system lines render.
M4 — Realtime niceties
- Typing (D4): client
POST /chat/typing(throttle ≥3s) → backendpublishonchat:<org>:<conv>with server-resolved identity. Notyping:channel, no client publish. - Read cursors + unread badges (D2):
read-cursorendpoint writeslast_read_message_id; unread = range count on the monotonicid(exclude own + soft-deleted +kind='system'); livechat.activitypings badge closed conversations. - Edit / soft-delete own message: set
is_edited/is_deleted(+updated_at); emitmessage.updated/message.deleted. Done when: typing, badges, and edit/delete all work live.
presence/online dots (D3 — kept on the existing system), reactions, attachments, threads, org-wide channels. Their tables/columns already exist or are reserved; they're future features (§14), not POC work.
M5 — Dogfood hardening
- Recovery drill: disconnect mid-conversation, reconnect → verify replay (
recovered: true) and the REST tail-refetch when the gap is too large (Flow C). - Restart drill: restart Centrifugo → memory-engine history loss degrades gracefully to the REST fallback, no message loss (DB is source of truth).
- Rate-limit: in-process token bucket (§6.8) — verify a spammer is throttled.
- Metrics (§6.10): publish latency/failures, subscription-token denials (
ErrCrossOrgAccesslabelled) — confirm they emit. - Runbook: short doc — dashboards, metrics, the Redis cutover trigger (§5 blue note). Done when: all drills pass and the runbook exists.
11. Testing
The security tests are non-negotiable — they guard the cross-org invariant.
Unit (service layer, mocked repos):
- Subscription-token authz matrix for
chat::- member, same org → token minted.
- non-member, same org → denied.
- member-via-a-different-org → denied (the mandatory cross-org case: real member, but
conversation.org_id ≠ session.org_id). The single most important test.
- Idempotency: first
sendinserts; an immediate retry with the sameclientMessageIdhits the unique index → returns the samepublic_idwithisReplay: true, no second row; six sends with different ids → six rows. - Rate limit: burst over the bucket is rejected; refills after the window.
- Read-cursor / unread count: range over
id > last_read_message_id, excludes own, soft-deleted, andkind='system', 0 when caught up. - Admin-leave transfer (D10): admin leaving transfers to the oldest member; last member just leaves.
Repository (real Postgres, testcontainers):
- Idempotency unique index: the same
(conversation_id, sender_id, client_message_id)inserted twice raises23505; differentclient_message_idinserts cleanly. dm_keyuniqueness: opening the same DM pair twice in one org returns the same conversation; the same pair in a different shared org creates a separate one (D11).- Read path: latest-N-messages query orders by
id DESCand uses the existing(conversation_id, created_at DESC)index — sanity-check viaEXPLAIN.
Integration (testcontainers: Centrifugo + Postgres + backend):
- Happy path (Flow B):
send→ row in Postgres →publish+broadcastfire → a subscribed client receivesmessage.createdcarrying theclientMessageId. - Subscription-token verification: Centrifugo accepts the minted token for
chat:<org>:<conv>and rejects a token whosechannelclaim doesn't match the subscribe target. - Publish-failure tolerance: with Centrifugo unreachable,
sendstill returns200and the row is committed (publish logged as WARN). - Recovery (Flow C): drop a subscription, publish in the gap, reconnect →
recovered: true; force a gap beyond the window →recovered: false→ REST tail-refetch reconciles.
Decommission tests (same PR as 000033):
- The reworked
dbsecuritysuite asserts the dropped triggers/RPCs/RLS are gone and that a GORM insert with an explicitsender_idis no longer clobbered to NULL (§7.3).
Manual drills (M5):
- 30s offline → reconnect → messages recovered (replay or REST).
- Centrifugo pod restart → memory history lost → clients refetch, no persisted message lost.
- Two real orgs → cross-org subscribe denied end-to-end (not just the unit mock).
- Public-internet probe of
/apiand/admin→ unreachable.
12. Definition of Done (dogfood bar)
- Team uses it daily for 1 week.
- 30s-offline reconnect recovers messages (replay or REST refetch).
- Centrifugo restart drill: no lost persisted messages; clients refetch.
- Send hot path p95 < 200ms — measured client-side as the send→echo delta during normal dogfood use (not during the drills, which inject latency by design).
- Cross-org denial verified with a real second org.
- Idempotent retry verified (
isReplay). - Token refresh: a session left open past the connection
expkeeps receiving live messages with no manual reconnect; subscription tokens refresh per-channel. - Groups work: create, add/remove member, rename, admin-leave transfer (D10), system lines render.
- Hostile content renders inert:
<img src=x onerror=alert(1)>displays as literal text and executes nothing. - Centrifugo
/api+/adminnot reachable from the public internet. - Old PostgREST chat triggers/RPCs/RLS decommissioned; presence path still works;
ARCHITECTURE.md+ a short runbook updated.
13. Verified facts (receipts)
Codebase (/opt/livekit-backend-go/backend):
- Migrations at
000031; chat schema spans000011–000031(see §7 for the per-table inventory). Next free numbers:000032(extend),000033(decommission). - Real column types:
chat_conversations.org_id/chat_conversation_members.user_id/chat_messages.sender_idareBIGINTFKs toorganizations(id)/users(id)— not WorkOS TEXT ids (000011). WorkOS ids map viausers.workos_id/organizations.workos_org_id. → the boundary rule (§3), D8. - Idempotency already DB-enforced:
chat_messages.client_message_id UUID+ unique(conversation_id, sender_id, client_message_id) WHERE client_message_id IS NOT NULL(000028). → C1. public_idalready present:chat_conversations.public_id(000024,gen_random_uuid),chat_messages.public_id(000027/029, UUIDv7 backfill matchingpkg/uid.New()). → D8.chat_messagesis NOT partitioned (plainBIGSERIAL PK, 000011). → no partition machinery;public_idis a plain unique key.- Soft delete is
is_edited/is_deletedbooleans (000011). → §6.4 / M4. - Presence is
user_presence(renamed fromchat_user_presencein 000013),status SMALLINT(0/1/2) +manual_status+heartbeat_at,clear-stale-presencecron every 5 min (000013/000016). → D3 (kept as-is). - The trigger hazard:
trg_enforce_message_defaultssetsNEW.sender_id := private.current_user_id()(= NULL for a GORM/non-PostgREST insert) → must be dropped before/with the Go send endpoint (000011 §4.1). → §7.3. - Token signing is HS256 (
realtime/service/generatetoken.go); existing secretSUPABASE_JWT_SECRET; no RSA. → D6. pkg/utils.TransactionManagerover*gorm.DB; two binaries (cmd/api,cmd/worker); existing 5-min and hourly tickers.
Frontend (StartingQuoTechDivision/qubital-apps):
- Office status dots run on Supabase Realtime:
packages/app-shared/src/supabase/supabasePresenceClient.tsopens aroom-presence:<org>:<room>broadcast channel and writes status via theupdate_presencePostgREST RPC; reads viauser_presenceselects +get_recent_dm_presence. The sharedsupabaseConnectionisbind()-ed bySupabaseChatManager;RoomPresenceManager/ZoneBusyManagerconsume it. → D3, §7.3, M0 (presence kept; Supabase Realtime not fully removed; surgical teardown). Resolves the room-presence-transport question: Supabase, not LiveKit.
Centrifugo v6 docs:
- Subscription token auth: a per-channel JWT (
channelclaim) verified by signature/exp/channel; with a reasonableexpit "survive[s] a massive reconnect scenario" and reduces session-backend load. → A1. (channel_token_auth) - Subscribe proxy fires an HTTP request per subscribe and does not proxy token/user-limited subscriptions. → A1/A4. (proxy)
- User-limited channels (
#): create a personal channel "without requiring a subscription token"; "ideal for… personal messages to a single user"; Centrifugo identifies the user from the connection JWT and allows the subscribe only if the id after#equalssub(specified for client subscribes). → A2 (personal channel). (channels) - Built-in personal channel (
subscribe_to_user_personal_channel) auto-subscribes a user-limited personal channel, but its fixednamespace:#userIDformat can't carry our<org>prefix — so we DIY an org-scoped user-limited channel instead. → A2. (server_subs) - Server-side subscriptions via the connection token
channelsclaim — ideal for a static channel set known at connect time. → reserved for the futureorg:channel (§14), not used in the POC. (server_subs) - No prescribed
exp: "choose theexpvalue judiciously" / "a reasonable expiration time"; illustrative 1-hour (3600) example; ~25s SDK grace. → D6. (authentication) broadcast= same payload to many channels;batch= many commands in one request. → C4.- Memory engine supports history + recovery on a single node; lost on restart. → D5 / Flow C.
http_server.internal_port/internal_addressmove/api/admin/metrics/healthoff the public port. → C2.- Server API
unsubscribe {user, channel}anddisconnect {user}for revocation. → §6.6. (server_api)
14. Future-stage pointers (production roadmap)
Everything here is out of POC scope; the POC schema/channel model is built so each lands without rework. (These mirror the blue notes inline.)
- Transactional outbox (publish reliability — independent of Redis):
chat_outboxtable + Centrifugo PostgreSQL outbox consumer; the send tx inserts message and outbox row, consumer publishes committed rows. Closes the crash-between-commit-and-publish window (Flow B step 8). Ref: https://centrifugal.dev/docs/tutorial/outbox_cdc - Redis / multi-node:
engine.type=redis+ address; rate-limit → Redis. Subscription tokens need no change. Optional pre-DB dedup via RedisSET NXto short-circuit retries before the INSERT (the DB unique index stays the source of truth). - Org-wide / broadcast conversations:
type='channel',visibility='org',history_mode='all'; anorg:<org>channel server-side-subscribed via the connection token → O(1) fan-out, no member enumeration. Addvisibility/history_modecolumns (§7.2 blue note). - Since-join history:
chat_conversation_members.first_visible_message_id, consulted only whenhistory_mode='since_join'. POC members see full history. - Rich presence (away/busy/appear-offline) over realtime: per-user
status:<org>:<user>channels with interest-based subscriptions and a subscribe proxy (the one place it earns its keep — high-fan-out ephemeral, no per-viewer token mint), status truth in the existinguser_presence. Never Centrifugo built-in presence (leaks appear-offline — D3). Until then, presence stays on the existing app system. - Roles / RBAC: org-scoped
roles/user_roles/conversation_role_grants; access = explicit member or role grant;role:<org>:<role>channels server-side-subscribed for O(members+roles) fan-out. Purely additive — no change to today's tables. - Instant token revocation: Centrifugo PRO (
jti/iatrevocation + user block). Refs: pro/token_revocation, pro/user_block. - Reactions / attachments / threads: tables already exist (
chat_reactions,chat_attachmentswithstorage_key,parent_message_id/reply_count). Attachments follow the R2 design inCHAT_BACKEND_IMPLEMENTATION.md. - Cold storage / infinite scroll: if message volume demands it, (re)introduce monthly partitioning on
chat_messagesand detach old partitions to object storage (R2/Parquet). Deliberately not done now (the table is plain — §7). - Desktop notifications (Electron): when backgrounded, the renderer + WS stay alive so
chat.activitypreviews still arrive — fire an OS notification (existingnotification/API + ElectronNotification) when the window is unfocused and the conversation isn't active. - "Last seen": a single
last_seen_atalready exists onuser_presence.