Skip to main content

POC — Centrifugo Chat Implementation

Part of the Supabase / DB Restructure Roadmap — the detailed chat phase (Phase 1) of its Event C.

Status: POC / first working version. Not production-hardened. Goal of this POC: a user types a message in the frontend chat → it is stored durably in Postgres → it is delivered live to the other participant(s) via Centrifugo. DMs and group conversations, text only. Everything ephemeral (history, recovery) runs on Centrifugo's memory engine on a single node; Redis is the documented next step, not part of the POC.

This document is the step-by-step build plan. Every decision in it was verified against the backend codebase (/opt/livekit-backend-go/backend), the actual chat migrations (internal/database/migrations/000011000031), and the official Centrifugo v6 docs. See §13 Verified facts for the receipts.

Migration strategy: EXTEND, not recreate. The existing chat tables (built migrations 000011000031) are ~90% ready for this design — verified table-by-table in §7. We keep the tables and their data, add two small columns, and decommission the PostgREST-era procedural layer (triggers, RPCs, RLS, grants) as logic moves into Go. We do not drop and rebuild the schema.

Reading the blue notes

Every blue note like this one marks a future / production enhancement that is explicitly out of POC scope. They exist so the POC schema and channel model don't paint us into a corner — each names the column, channel, or mechanism that lands later. Skip them on a first read; they are the prod roadmap, not POC work.


1. Scope

In scope (POC):

  • DMs (1:1) and group conversations, text messages only.
  • Send → persist → live receive, with optimistic UI and idempotent retries.
  • Edit own message, soft-delete own message.
  • Read cursors + unread badges.
  • Typing indicators.
  • Single Centrifugo node, memory engine.

Out of scope (POC) — designed-for, built later (see the blue notes and §14):

  • Presence / online status. The app already has a Teams-like status system (online / away / busy / appear-offline, auto-away via heartbeat) on the existing user_presence table. The POC does not touch it and does not add Centrifugo presence (which would leak "appear-offline" — see D3). Chat integrates with the existing status system when it merges into the real app.
  • Attachments / images / files (chat_attachments table already exists; R2 design in CHAT_BACKEND_IMPLEMENTATION.md is the reference).
  • Reactions (chat_reactions table already exists), threads / replies (parent_message_id reserved, unused).
  • Org-wide / broadcast conversations (the type='channel' + visibility/history_mode path — §14).
  • Multi-node Centrifugo + Redis; transactional outbox; message search, mentions, "last seen".

2. Architecture & topology

2.1 Components

ComponentRoleHosting
Electron frontend (TS/Vite + centrifuge-js)Renders chat, holds the WebSocket, optimistic UIClient
Backend (Go, Gin, GORM) — cmd/apiAuthority: REST API, mints Centrifugo connection and subscription JWTs, persists messages, publishes to CentrifugoSevalla pod
Centrifugo v6 (OSS)Transport: holds client WebSockets, fan-out, history/recoverySevalla pod
PostgreSQL (Supabase-hosted)Durable source of truth for conversations, members, messagesSupabase

Guiding principle, applied throughout: durable truth → Postgres; ephemeral state → Centrifugo (memory → Redis). The backend is the authority; Centrifugo is transport. Centrifugo is never the source of truth — if it is fully down, chat still works over REST (sends persist, history loads), only live push is lost.

2.2 Topology — three links, each independently gated

The auth model is token-based (see §3 / §6): the backend authorizes a subscription once, at token-mint time, and Centrifugo enforces the signed token thereafter. There is no subscribe-proxy webhook — Centrifugo never calls back into the backend, so there is no inbound public surface beyond the WebSocket itself.

#LinkExposureGate
1FE → Centrifugo (WebSocket)PublicConnection JWT (HS256), verified by Centrifugo signature check
2FE → Backend (REST: messages, connection-token, subscription-token)PublicExisting WorkOS session (access token in cookie)
3Backend → Centrifugo (publish / broadcast / unsubscribe / disconnect API)Internal port onlyNetwork isolation + API key

Only Centrifugo's client WebSocket endpoint is public. Its /api, /admin, /metrics, /health live on internal_port and are never publicly reachable (verified: http_server.internal_port).


3. End-to-end flows

Legend — identifiers, and the two layers they live in

The schema has two id layers, and the boundary between them is load-bearing for security. Get this right and the rest follows.

FieldLayerTypeWhat it is
users.id, organizations.id, chat_conversations.id, chat_messages.idInternalBIGINT (BIGSERIAL)The real primary keys + foreign keys. Small, sequential, fast for joins/ordering/cursors. Never cross the API or channel boundary (they'd leak/enumerate). This is what the existing tables actually use for org_id, user_id, sender_id, created_by.
workos_user_idBoundaryTEXTWorkOS user id, stable + opaque. The Centrifugo connection sub, and the <user> segment of channel names. Maps to users.id via users.workos_id.
workos_org_idBoundaryTEXTWorkOS org id of the active session. The <org> segment of channel names. Maps to organizations.id via organizations.workos_org_id.
conversation_public_idBoundaryUUIDchat_conversations.public_id. The <conv> segment of channel names and what the FE sends. Maps to chat_conversations.id.
message_public_idBoundaryUUID (v7)chat_messages.public_id. The external message id in API responses and publish payloads. Time-ordered.
client_message_idFE-generatedUUIDIdentifies one compose action (one tap of "send"). Reused only on a retry of that same send. Drives idempotency and optimistic-bubble reconciliation. Stored in chat_messages.client_message_id.

The boundary rule (security-critical): internal BIGINT ids never appear in a channel name, a JWT, or an API payload; boundary ids (WorkOS ids, public_id) never appear in a foreign key. The backend translates at the edge — it already has users.workos_id and organizations.workos_org_id for exactly this. Channel names are therefore built only from boundary ids: chat:<workosOrgId>:<conversationPublicId> and user:<workosOrgId>:<workosUserId>#<workosUserId>.

Why org is in every channel name: a user is the same identity across orgs. A personal channel without the org segment would be the same channel in every org the user belongs to → a cross-org leak. Org in the name, verified at token mint, is the core tenancy boundary. (This also answers the "do we need org in channels for the POC" question: yes — for correctness, not just for test convenience.)

The auth model in one paragraph

Two JWT types, both minted by the backend, both HS256:

  • Connection token (link 2 → used on link 1): identity onlysub = workosUserId, exp, meta. In the POC it carries no channels claim. Refreshed silently by centrifuge-js.
  • Personal channel (no token): the client subscribes to the user-limited channel user:<org>:<user>#<user>; Centrifugo enforces that the part after # equals sub — the documented personal-channel pattern, no token and no backend call.
  • Subscription token (link 2, one per conversation): authorizes one chat: channel. The backend checks org match + membership once, at mint time, then signs a token bound to the exact channel string with a short exp, plus jti/iat. Centrifugo verifies the signature, exp, and channel on every (re)subscribe — including after a mass reconnect — without calling the backend.

This is the Centrifugo-idiomatic pattern: subscription tokens "survive a massive reconnect scenario" and keep load off the session backend, unlike a subscribe-proxy webhook that fires an uncached HTTP call on every subscribe (§13).

Flow A — Connect & subscribe

  1. User is already authenticated to the backend (WorkOS session cookie).
  2. FE calls POST /api/v1/realtime/centrifugo-token. Backend (behind existing auth middleware) resolves the session → workosUserId, workosOrgId, internal ids, and mints the connection token:
    • sub = workosUserId
    • exp = now + 1h (tunable — see D6)
    • meta = { workos_org_id } (server-only; informational/observability — authz no longer depends on it, see C3)
    • signed HS256 with CENTRIFUGO_TOKEN_HMAC_SECRET.
    • (No channels claim in the POC — the personal channel is client-subscribed in step 3; the only future use of the claim is the org: channel, §14.)
  3. FE opens WSS to Centrifugo with that token. Centrifugo verifies the signature → connection accepted. The FE then subscribes to its personal channel user:<workosOrgId>:<workosUserId>#<workosUserId> — a user-limited channel (allow_user_limited_channels: true on the user namespace). No token is needed: Centrifugo identifies the user from the connection JWT's sub and allows the subscribe only if the part after # equals sub (the documented personal-channel pattern). The <org> prefix scopes it to the session's org.
  4. For each conversation the user opens, the FE subscribes to chat:<workosOrgId>:<conversationPublicId>. centrifuge-js invokes that subscription's getToken callback, which calls POST /api/v1/chat/subscribe-token { conversationPublicId }; the backend checks org + membership and returns a subscription token bound to that channel (Flow A handshake detail in §6). Centrifugo verifies the token → subscribed.
  5. The personal channel stays subscribed for the whole session (subscribed once at connect; centrifuge-js auto-resubscribes it on reconnect) and carries inbox pings when a conversation isn't open.
  6. Token refresh is automatic: the connection token refreshes via the SDK's connection getToken before exp (~25s grace); each subscription token refreshes via its per-subscription getToken before its own exp. On org switch or sign-out the backend stops minting and the client disconnects/reconnects with a fresh connection token, then re-subscribes its personal channel under the new org prefix.
Future — org-wide channels

When we add broadcast/announcement conversations (type='channel', visibility='org', §14), the connection token gains a channels claim carrying org:<workosOrgId> (and later role:<org>:<role>) as server-side subscriptions, so a single publish reaches the whole org with no member enumeration. Not in the POC: there are no org-wide conversations, so the POC connection token has no channels claim at all.

Flow B — Send a message (hot path)

  1. FE generates a client_message_id (UUID) for this compose action and renders the bubble optimistically, keyed by that id.

  2. FE calls POST /api/v1/chat/messages/send { conversationPublicId, clientMessageId, content }. (User already authenticated.)

  3. Backend authorizes and validates: resolve conversationPublicId → conversation; cross-org check (conversation.org_id must equal the session's internal org id); membership; rate limit; content length ≤ 4000 (app-level — the DB has no length CHECK, see §7).

  4. In one transaction (pkg/utils.TransactionManager): INSERT the message into chat_messages with client_message_id set. Idempotency is enforced by the existing unique index (conversation_id, sender_id, client_message_id) (migration 000028, see C1):

    • Insert succeeds → first time. Continue.
    • Unique violation (23505) → this is a retry. SELECT the existing row by (conversation_id, sender_id, client_message_id) and return it with isReplay: true. Stop (no second row, no re-publish).
  5. Commit.

  6. After commit (never inside the tx — publish only once it's committed, so no one is told about a message that could still roll back), publish to Centrifugo via the internal API as one batch containing two commands — two payloads to two audiences:

    • publish("chat:<org>:<conv>", <full message payload>) — the complete message (id, author, content, timestamps) to everyone with that conversation currently open, who renders it immediately.
    • broadcast(["user:<org>:<m1>#<m1>", "user:<org>:<m2>#<m2>", …], <chat.activity ping>) — a thin ping to each member's personal channel. It carries a lightweight preview (conversationPublicId, senderId, ≤200-char snippet, timestamp) — not the full message — so a member with the conversation closed can badge the inbox and update the sidebar row ("Alice: hey…") without a fetch. Full body only ever comes via chat: or the history fetch. (Recipient list comes from chat_conversation_members — fan-out from the DB, not from Centrifugo connectivity. Building each user:<org>:<m>#<m> target requires mapping the member's internal user_idusers.workos_id, a join on the send path.)
  7. Fan-out: recipients draw a new bubble; the sender's app finds its pending "sending…" bubble with this clientMessageId (step 1) and updates it — attaches the real public_id, flips to "sent" — instead of drawing a second. The confirmation arrives twice (REST reply + this publish, both carrying the id); the first updates the bubble, the second is seen as a duplicate and ignored.

    Example: Alice sends "hi". Her app instantly shows a grey "sending…" bubble (only the clientMessageId). The backend saves it and fans out. Bob has never seen this id → he draws a fresh "hi" bubble. Alice gets the same fan-out back, recognises the id as her pending bubble → turns it "sent" with the real id. Alice hears the confirmation twice (REST reply + publish) — first turns it "sent", second is ignored as a duplicate.

  8. Publish failure is tolerated — log a WARN, still return 200. The message is committed to Postgres (the source of truth), so it's durable regardless; the publish is only the live "show it now" transport. It resurfaces on the next REST read — when any client opens/reopens the conversation, or on the recovered: false fallback after a reconnect (Flow C). A Centrifugo outage costs latency, never correctness, and must never fail a send.

    Honest limit of the after-commit publish. Narrow window: if the backend crashes between commit (step 5) and publish (step 6), or the publish fails (we only WARN), the message lands in Postgres but never enters Centrifugo — not even its history buffer. A recipient who stays connected with the conversation open won't get it live and won't get it via Flow C recovery — only on the next REST refetch. Accepted for the POC; the documented fix is the transactional outbox (blue note below).

Future — transactional outbox

Insert the message and an outbox row (chat_outbox) in the same transaction, and let Centrifugo's PostgreSQL outbox consumer publish committed rows (LISTEN/NOTIFY keeps it near-instant). This makes the publish atomic with the write, closing the crash-between-commit-and-publish window. Runs on a single node + memory engine — independent of the Redis/multi-node upgrade; adopt when that window becomes unacceptable. Ref: https://centrifugal.dev/docs/tutorial/outbox_cdc

Why publish + broadcast and not one call: broadcast sends one identical payload to many channels. The conversation needs the full message; the personal channels need a different lightweight ping. Two payloads ⇒ two commands ⇒ wrapped in batch for a single HTTP round-trip. (Verified: §13.)

Flow C — Recovery (reconnect / missed messages)

  1. On reconnect, centrifuge-js requests recovery on each chat: subscription. If the gap is within the channel's history window, Centrifugo replays the missed publications and reports recovered: true.
  2. If recovered: false (gap too large, or Centrifugo restarted — memory-engine history is lost on restart), the FE REST-refetches the conversation tail (chat/messages/list) and reconciles. This is the designed fallback, not an error.

Flow diagram — all three flows, end to end


4. Decisions (with rationale)

#DecisionWhy
A1chat: channels authorized by a per-channel subscription token, not a subscribe proxy.Centrifugo best practice. A subscription JWT bakes the channel + a short exp and is verified by signature, so it "survives a massive reconnect scenario" and reduces load on the session backend (§13). A subscribe-proxy webhook fires an uncached HTTP call on every subscribe — a Centrifugo restart stampedes the backend. Org + membership are checked once, at mint.
A2Personal channel is a user-limited channel user:<org>:<user>#<user> the client subscribes to (no token, no channels claim).Centrifugo's documented personal-channel pattern: user-limited channels are "ideal for… personal messages to a single user" and work "without requiring a subscription token" — Centrifugo enforces owner = sub from the connection JWT (§13). We don't use the built-in subscribe_to_user_personal_channel feature because its fixed namespace:#userID format can't carry our <org> prefix (mandatory for cross-org isolation). The # gives ownership; the <org> prefix gives tenancy. Revoked via Disconnect (the sub dies with the connection).
A3Org segment in every channel name, built from boundary ids only.Cross-org tenancy boundary: a user spans orgs, so an org-less personal channel leaks across them. Org is verified at mint and baked into the channel string, so even a leaked token for org A can't address org B. Internal BIGINT ids never enter channel names (boundary rule, §3).
A4No subscribe-proxy webhook anywhere in the POC.Token auth removes the need. Bonus: Centrifugo never calls back into the backend, eliminating an entire inbound public surface (the old shared-secret webhook is gone). The subscribe proxy is reserved for a future high-fan-out ephemeral status: channel only (§14).
C1Idempotency is enforced by the existing DB unique index (conversation_id, sender_id, client_message_id) (migration 000028) — no app-level store.The constraint already exists and chat_messages is not partitioned, so the DB can enforce it directly. INSERT; on 23505 treat as a replay and return the existing row. This is simpler and more robust than an in-memory map (survives pod restarts) and matches the colleague's design. (Supersedes the earlier in-memory-store plan, which existed only to work around partitioning we are not doing.)
C2Centrifugo's API/admin/metrics/health on internal_port; the publish API is reached only over the internal network with an API key.The genuinely sensitive surface (publish-to-anything, admin UI) gets real network isolation. With no subscribe proxy there is no public webhook to secret-gate at all.
C3(Superseded by A1/A4.) org_id previously rode in meta for the subscribe-proxy cross-org check. With token auth the cross-org check happens at mint time (backend has the session org) and the org is baked into the channel name. meta.workos_org_id is kept only for logging/observability; authz no longer depends on it.
C4Dual delivery = publish + broadcast inside one batch call.One broadcast carries one payload; we need two (full message vs ping). batch combines both commands in one HTTP request.
D1DMs + groups, text only.Grill decision.
D2Read cursor stays as the existing last_read_message_id BIGINT (already on chat_conversation_members); API speaks public_id.Unread counts are range counts on the monotonic internal id; storing a public_id cursor would force a resolve-join per inbox row. Count excludes own + soft-deleted messages. Column already exists — no migration.
D3Presence/status is OUT of POC scope; keep the existing user_presence system; add NO Centrifugo presence.Centrifugo built-in presence/join-leave reports raw connectivity, which leaks "appear-offline." The app already has a Teams-like status model (online/away/busy/appear-offline, auto-away via heartbeat) on user_presence (migrations 013/016), with a clear-stale-presence cron — untouched by the POC. We do not add an org: presence channel or in-conversation presence. Real-app integration reads the existing system; the future realtime path is interest-based status:<org>:<user> channels (§14).
D4Clients never publish to Centrifugo. Typing is a client POST → backend publishes to chat:<org>:<conv>.A single invariant ("all publishes are server-side") is simpler and safer than "clients may publish only to typing:". With no proxy there's nothing to attach identity to a client publish anyway; backend-publish keeps identity authoritative (from the session, not the payload). Typing rides the same chat: channel (no separate typing: channel → half the subscription-token mints). Throttled client-side ≥3s; published directly, never through the outbox (ephemeral).
D5Memory engine, single node. Redis is the documented next step.Grill decision. Memory engine supports history + recovery on one node (verified); lost only on restart, which Flow C handles.
D6New dedicated CENTRIFUGO_TOKEN_HMAC_SECRET (HS256); exp chosen by us, documented as tunable.Don't reuse SUPABASE_JWT_SECRET. HS256 matches existing token signing; no RSA in the codebase. Centrifugo prescribes no exp value — "choose judiciously / a reasonable expiration time" (illustrative 1h example, 25s grace) — so connection exp ≈ 1h and subscription exp deliberately short (minutes) is our engineering trade-off (shorter = faster revocation, more getToken calls), not a Centrifugo recommendation (§13).
D7GORM + pkg/utils.TransactionManager.The codebase data layer is GORM, not pgxpool. Code wins over research docs.
D8Hybrid ids: internal BIGINT id for keys/joins, external public_id (UUID/v7) + WorkOS TEXT ids at the boundary.Already the schema's convention (public_id added in 000024/000027/000029; org_id/user_id/sender_id are BIGINT FKs). The POC honors it via the boundary rule (§3).
D9People listing reuses the existing org-members endpoint.Not asserting a chat-specific people endpoint exists; add one only if the existing one doesn't fit.
D10Group admin leaving → auto-transfer to oldest member; last member leaving just leaves.No destructive delete path in the POC; an empty conversation is invisible (no members ⇒ nobody can list/subscribe) and harmless. Uses the existing role ∈ ('admin','member') values.
D11One chat_conversations table with type ∈ ('dm','group','channel') (POC uses dm/group); a new dm_key makes DMs unique per pair per org.Same pair in two shared orgs = two DM threads (org isolation — correct). parent_message_id reserved for future threads; type='channel' reserved for org-wide (§14).

5. Centrifugo deployment on Sevalla

  1. Pin a specific v6 image tag (e.g. centrifugo/centrifugo:v6.x.y — pick the exact patch at deploy time; never latest).
  2. Ports:
    • Public: client WebSocket (port, e.g. 8000) → exposed via Sevalla ingress as WSS.
    • Internal: http_server.internal_port (e.g. 9000) for /api, /admin, /metrics, /healthnot exposed publicly; reachable only by the backend over the Sevalla internal network.
  3. Admin UI password-gated, on the internal port only.
  4. Health check enabled (/health on internal port) for Sevalla.
  5. Env / config (key names verified; confirm exact v6 nesting at M1):
    • client.token.hmac_secret_key = CENTRIFUGO_TOKEN_HMAC_SECRET — verifies both the connection token and the subscription tokens (same secret; subscription tokens are just JWTs with a channel claim).
    • client.subscription_token.hmac_secret_key — set if v6 wants the subscription-token secret configured separately; otherwise it falls back to the connection-token secret. Confirm at M1.
    • client.allowed_origins — the Origin(s) Centrifugo accepts on the WS handshake. ⚠ Electron gotcha: the desktop renderer's origin is not a normal https://app.… domain (it's file://, a packaged custom scheme, or a dev-server http://localhost:<port>). If this list doesn't match what Electron sends, every connection is rejected. Confirm the real Origin at M1 (step 2) and list exactly those values.
    • channel.namespaces — define the chat and user namespaces (§5.1).
    • allow_user_limited_channels: true on the user namespace (a channel/namespace option, not a top-level client.* key) ← required for the personal channel #.
    • server API key for link 3.
    • http_server.internal_port, http_server.internal_address.
    • engine.type = memory (POC).
  6. Backend env additions: CENTRIFUGO_TOKEN_HMAC_SECRET, CENTRIFUGO_API_ENDPOINT (internal), CENTRIFUGO_API_KEY. (No subscribe-proxy secret — there is no proxy.)
Future — Redis / multi-node

For a 2nd node set engine.type=redis + engine.redis.address (Sevalla Redis) — two env vars, zero code change (Redis through the relevant scaling stage, not NATS — NATS loses history/recovery). The subscription-token model needs no change to scale out (tokens are stateless), which is part of why A1 was chosen.

5.1 Namespaces

NamespaceSubscribe authHistoryPresenceClient publishNotes
chat:subscription token (org + membership checked at mint)on (e.g. 100 msgs / 24h), recovery onoff (D3)no (D4)chat:<org>:<conv>. Carries messages and typing.
user:user-limited (#) — client subscribes; Centrifugo enforces owner = sub (no token)small (e.g. 10 / 1h)offnouser:<org>:<user>#<user>; namespace sets allow_user_limited_channels: true. Inbox pings.
Future namespaces

org: (org-wide broadcast, server-side via connection token) and status: (per-user presence, interest-based, the one place a subscribe proxy returns — high-fan-out ephemeral, no token mint per viewer) land with the org-wide and rich-presence features (§14). role: for role-based access (§14). None are in the POC.


6. Backend changes

All new chat code follows the existing feature-module layout (api / service / module / models / mocks) and the existing central route-registration convention.

  1. Platform client internal/platform/centrifugo/ — thin wrapper over the Centrifugo HTTP API: Publish, Broadcast, Batch, Unsubscribe, Disconnect (the last two for revocation — item 6). Mirrors internal/platform/livekit.
  2. Connection-token endpointPOST /api/v1/realtime/centrifugo-token (new handler in realtime or a new chat feature). Mints the connection JWT in Flow A step 2. Mirrors the existing LiveKit/Supabase token generators (HS256).
  3. Subscription-token endpointPOST /api/v1/chat/subscribe-token { conversationPublicId }. Resolve public_id → conversation; cross-org deny (conversation.org_id != session.org_id → ErrCrossOrgAccess); membership deny (caller not in chat_conversation_members); else mint a subscription JWT { sub: workosUserId, channel: "chat:<org>:<conv>", exp: short, iat, jti }. This is the only authorization point for chat: subscribes — there is no proxy.
  4. Send service — Flow B. Authz → cross-org → rate-limit → validate (≤4000) → INSERT (unique index = idempotency; catch 23505 → return existing, isReplay) → commit → batch publish(chat:<org>:<conv>) + broadcast(user pings) after commit → return 200 (publish failure = WARN only).
  5. Typing publishPOST /api/v1/chat/typing { conversationPublicId } (D4): membership check (cheap), then backend publish("chat:<org>:<conv>", <typing event with server-resolved identity>). Ephemeral, never persisted, never via outbox.
  6. Revocation of live access — a subscription token, once minted, is valid until its exp, and Centrifugo serves re-subscribes from the cached token without calling the backend. So when access is taken away, kill the live subscription explicitly, after committing the change:
    • On member removal / leave (M3): Unsubscribe(workosUserId, "chat:<org>:<conv>"). This kills the live sub immediately; the short subscription-token exp is what closes the re-subscribe-with-cached-token window (on next getToken the backend re-checks membership and refuses).
    • On ban / cross-org change: Disconnect(workosUserId) — kills the connection and with it the personal-channel subscription. Unsubscribe on the personal channel alone wouldn't stick (the client re-subscribes on reconnect), so a ban must drop the connection and the backend must then refuse to mint a fresh connection token.
    • Idempotent + best-effort (log on failure); the membership row is the source of truth, so a missed call only delays cutoff to the next token refresh.
Future — instant token revocation (Centrifugo PRO)

OSS has no revoke-by-jti; PRO adds token revocation by jti/iat and user blocking (instant, enabled when JWTs carry jti/iat — which ours do). Adopt if the short-exp window ever proves too loose. Refs: https://centrifugal.dev/docs/pro/token_revocation , https://centrifugal.dev/docs/pro/user_block

  1. Endpoints (logical; register per existing convention):
    • chat/conversations/open-dm, create-group, list, by-id, add-member, remove-member, rename, leave
    • chat/messages/send, list, edit, delete
    • chat/conversations/read-cursor
    • chat/typing
    • realtime/centrifugo-token, chat/subscribe-token
    • people listing → reuse existing org-members endpoint (D9). These replace the PostgREST RPCs being decommissioned (get_recent_dms, get_conversation_messages, update_read_cursor, …) — same behavior, now in Go (§7).
  2. Rate limit — in-process token bucket (e.g. burst 10, 1 msg/s) for the POC; swap to Redis rate when pods > 1. Replaces the old trg_check_message_rate_limit / trg_check_global_rate_limit triggers (§7).
Future — tiered rate-limit by fan-out

Limit the high-amplification cases (org-wide channels, large groups) harder than a 1:1 DM, token bucket keyed (user_id, scope). Matters once org-wide channels exist; the flat per-user bucket is fine for POC DMs/groups.

  1. Schedulerno new chat job. The existing clear-stale-presence cron stays (it's the presence system, kept as-is). No partition top-up (the table isn't partitioned). No dedup-cleanup (idempotency is the DB unique index).
  2. Observability — Prometheus metrics (existing pkg/metrics): publish latency, publish failures, subscription-token mints + denials with ErrCrossOrgAccess as its own label.

7. Database schema — EXTEND the existing tables

The chat tables already exist (migrations 000011000031) and are ~90% ready. This section is the verified inventory of what's there, the two columns we add, and the PostgREST-era layer we decommission. We keep all tables and their data.

7.1 What already exists (verified against the migrations)

TableKey columns already presentVerdict
private.chat_conversationsid BIGSERIAL PK, org_id BIGINT FK, type CHECK('dm','group','channel'), name, description, created_by BIGINT FK, is_archived, created_at/updated_at, public_id UUID NOT NULL UNIQUE (000024)Reuse as-is + add dm_key (§7.2).
private.chat_conversation_membersconversation_id BIGINT FK, user_id BIGINT FK, org_id BIGINT FK, role CHECK('admin','member'), muted_until, last_read_message_id BIGINT FK (read cursor, D2), joined_at, PK (conversation_id,user_id)Reuse as-is. No change.
private.chat_messagesid BIGSERIAL PK, conversation_id/org_id/sender_id BIGINT FK, parent_message_id (threads, reserved), content TEXT, reply_count, last_reply_at, is_edited/is_deleted (soft delete), created_at/updated_at, public_id UUID NOT NULL UNIQUE (UUIDv7, 000027/029), client_message_id UUID + unique (conversation_id,sender_id,client_message_id) (000028)Reuse as-is + add kind (§7.2). Not partitioned — kept plain.
private.chat_reactions(message_id,user_id,emoji) PKExists; out of POC scope. No change.
private.chat_attachmentsid, message_id, org_id, file_name, file_size, content_type, storage_keyExists; out of POC scope. No change.
private.user_presenceuser_id PK, org_id, status SMALLINT (0/1/2 + app-side 3=busy), manual_status, last_seen_at, heartbeat_at; clear-stale-presence cronThe existing status system (D3). Keep entirely untouched.

Key facts that shape the POC, all confirmed in the migrations:

  • Idempotency is already DB-enforced (000028) → C1 (no app store).
  • Hybrid ids already exist (public_id on conversations + messages) → D8, the boundary rule.
  • chat_messages is not partitioned → no partition machinery, public_id is a plain unique key.
  • Soft delete is is_edited/is_deleted booleans (not timestamps) — the send/edit/delete service uses these.
  • org_id/user_id/sender_id are internal BIGINT FKs — channel names/JWTs use boundary ids, mapped at the edge (§3).

7.2 What we add (one small migration 000032)

-- DM-pair uniqueness, org-scoped (D11). Guards the check-then-create race.
ALTER TABLE private.chat_conversations
ADD COLUMN dm_key VARCHAR(255); -- '<orgId>:<minUserId>:<maxUserId>'; NULL for non-DM
CREATE UNIQUE INDEX idx_chat_conversations_dm_key
ON private.chat_conversations(dm_key)
WHERE dm_key IS NOT NULL;

-- System messages for lifecycle events ("X added Y", "renamed to Z") in groups.
ALTER TABLE private.chat_messages
ADD COLUMN kind VARCHAR(10) NOT NULL DEFAULT 'user'
CHECK (kind IN ('user','system')); -- 'system' excluded from unread + rate limit

That's the whole additive change for the POC. Notably not added (deliberate scope cuts, see blue note): visibility, history_mode, first_visible_message_id, chat_outbox. Content length is validated in Go (≤4000); we add no DB CHECK (keeps edits/migrations cheap and matches the app being the authority).

Future — columns for org-wide + since-join history + outbox
  • chat_conversations.visibility (private/org) + history_mode (all/since_join), with a CHECK that since_join only pairs with private. Backfill dm→private/all, group→private/since_join, channel→org/all. Drives org-wide channels and "new members don't see old history."
  • chat_conversation_members.first_visible_message_id BIGINT (FK → chat_messages.id, NULL = full history), snapshotted at add-time — makes since-join a pure UPDATE.
  • private.chat_outbox (id, method, payload JSONB, created_at, LISTEN/NOTIFY trigger) for the transactional-outbox publisher. In the POC, group members see full history and there are no org-wide conversations — these stay unbuilt.

7.3 What we decommission (migration 000033 — the PostgREST → Go cutover)

The chat tables were built for the PostgREST/Supabase-Realtime era: the frontend wrote rows directly and triggers/RPCs/RLS did the work. Go now owns all of it, so that procedural layer is removed. Sequencing is a correctness precondition (§10, M0): this migration must ship before or together with the Go send endpointtrg_enforce_message_defaults overwrites sender_id with current_user_id() = NULL on a GORM insert → NOT NULL violation; the rate-limit triggers would misfire.

Drop (all on private.chat_messages unless noted):

  • Triggers + their functions: trg_enforce_message_defaults, trg_check_message_rate_limit, trg_check_global_rate_limit, trg_update_parent_reply_count, trg_notify_new_message, trg_touch_conversation. (Defaults, rate-limit, reply-count, the realtime.send fan-out, and conversation updated_at touch all move into the Go send service.)
  • Chat RPCs: update_read_cursor, get_recent_dms, get_conversation_messages, get_recent_dm_presence (and the 000030/000031 public-id variants, 000015 dashboard reconciliation). Replaced by Go endpoints (§6.7). (get_recent_dm_presence is a presence-flavored RPC but it joins the chat tables to find DM peers, so it belongs to the chat sidebar — the Go conversation-list endpoint returns each peer's presence inline by reading user_presence directly.)
  • RLS policies + authenticated grants on chat_conversations, chat_conversation_members, chat_messages. The backend authorizes now; the FE no longer touches Postgres directly.
  • realtime.messages broadcast policy — rewrite, do NOT drop. The live Broadcast channel access policy (migration 000015, which replaced the 000011 §8b one) is shared: it authorizes both the chat user:<org>:<user> topic and the office room-presence:<org>:<room> topic. Remove only the chat (user:) branch; keep the room-presence: branch and the Room members can send presence broadcasts INSERT policy — they belong to the out-of-scope office presence.
  • Supabase Realtime publication: remove chat_messages and chat_conversation_members (live push is Centrifugo's job now).

Explicitly KEPT (do not decommission):

  • The user_presence table, its update_presence write RPC, the clear-stale-presence cron, its RLS/grants, and its Supabase Realtime publication membership — this is the existing office-dots presence system, out of POC scope and untouched (D3). Its full architecture, and the constraints it imposes on this teardown, are documented in §7.4. Only the DM-sidebar RPC get_recent_dm_presence is decommissioned (above); the office-tile dots run on a separate path and are left as-is.
  • Helper functions current_user_id() / current_org_id()definitely kept: used by both the retained update_presence RPC and the retained room-presence RLS policy (verified).

The matching dbsecurity tests (internal/database/dbsecurity/ — rls/grants/triggers/rpc) assert exactly what's being removed; rework them in the same PR as 000033.

7.4 The existing presence system — verified, and why the POC leaves it alone

The away / busy / free / "appear-offline" dot on every office tile is an existing, working feature that predates chat and is independent of it. We mapped exactly how it is built — backend migrations plus the qubital-apps frontend (packages/app-shared/src/supabase/supabasePresenceClient.ts) — because it is the single biggest constraint on the Supabase teardown: chat can move to Centrifugo, but this feature cannot, and naively "removing Supabase" would silently break it. It therefore stays 100% untouched by the POC (D3). Here is the whole picture, so the teardown can be surgical rather than guesswork.

How it works today, end to end:

  1. Status truth — the user_presence table. One row per user: status (0=offline, 1=online, 2=away, 3=busy), manual_status (the user's explicit override — appear-offline is manual_status=0), last_seen_at, heartbeat_at. Effective status = COALESCE(manual_status, automatic_status). This is durable truth: it survives reconnects, restarts, and sessions until cleared.

  2. Status write — the update_presence PostgREST RPC. All the status logic runs client-side in supabasePresenceClient: a heartbeat plus DOM activity listeners and Electron system-idle time drive auto-away after a timeout; ZoneBusyManager forces busy when the user is behind a closed door in the office; manual choices (online/away/busy/appear-offline) take precedence, with a grace period so a manual "away" isn't instantly undone by a mouse move. Each resolved status is persisted by calling the Supabase RPC update_presence(p_status, p_manual_status, p_last_activity_at), which relies on the current_user_id() / current_org_id() helpers. A clear-stale-presence pg_cron job flips automatic status to offline when heartbeat_at goes stale — and never touches manual_status.

  3. Live distribution — a Supabase Realtime broadcast channel. For people in the same office room, subscribeToRoomPresence() opens a Supabase Realtime channel room-presence:<org>:<room>; each member broadcasts its own status_change events on it (emitStatusChange()httpSend), so the tile dots update near-instantly. On (re)subscribe it seeds from a direct user_presence read (getRoomMembersPresence), and the DM-sidebar dots are polled separately via the get_recent_dm_presence RPC. Authorization for the broadcast channel is the realtime.messages "Broadcast channel access" RLS policy (migration 000015) — which is shared with chat (it also authorizes the chat user:<org>:<user> notification topic), plus a companion room-presence INSERT policy.

Why this constrains the POC — four concrete consequences:

  • Supabase Realtime cannot be fully abandoned. It is the live transport for the office dots — and it is Supabase Realtime broadcast, not LiveKit (this resolves the long-standing open question). Pulling it out wholesale would break presence on every tile. So the POC removes Supabase Realtime only for chat; the office presence keeps using it.
  • The broadcast RLS policy is rewritten, not dropped (§7.3). Because one Broadcast channel access policy serves both chat and room-presence, the decommission migration must surgically strip only the chat (user:) branch and keep the room-presence branch + the room-presence INSERT policy. Dropping the whole policy would deny the office broadcasts.
  • The helper functions and the Supabase auth surface stay. current_user_id() / current_org_id() are called by both the retained update_presence RPC and the retained room-presence RLS policy, so they remain — which is also why SUPABASE_JWT_SECRET and the PostgREST path stay alive for presence even though chat stops using them.
  • The frontend teardown is surgical (M0). supabasePresenceClient is currently bind()-ed by SupabaseChatManager and shares the same supabaseConnection; RoomPresenceManager / ZoneBusyManager depend on that client. So we decouple presence from the chat manager — re-home ownership of the Supabase connection and the bind() call — rather than deleting the connection along with the chat code. Delete naively and the dots go dark.

What the POC does change about presence: only the DM-sidebar dots (the small status indicators next to conversations) are re-sourced — the new Go conversation-list endpoint returns each peer's status inline by reading user_presence, replacing the get_recent_dm_presence poll. The office-tile dots (the room-presence broadcast path) are a different code path and are left exactly as they are.

Future — unify presence on Centrifugo

If we ever migrate the office presence off Supabase Realtime, the target is the interest-based status:<org>:<user> model (§14): Go heartbeat write → Postgres truth → one publish per change → Centrifugo delivery to only the clients currently rendering that user (sidebar peers, on-screen room avatars, open profile card), via a refcounted subscription registry and a subscribe-proxy same-org check. Built-in Centrifugo presence is still not an option (it reports raw connectivity and leaks appear-offline). Only after that migration could Supabase Realtime be fully retired — out of scope here.


8. Frontend integration (centrifuge-js)

  1. Connection manager singleton: one Centrifugo connection per app session; the connection getToken wired to realtime/centrifugo-token for silent refresh. Lifecycle: connect after auth; on org-switch or sign-out, disconnect and reconnect with a fresh token (the new token's channels rebind the session to the new org — never reuse a connection across orgs); on background/foreground, let the SDK auto-reconnect.
  2. Electron token wiring: centrifuge-js runs in the renderer, but the WorkOS session cookie lives in the main process (apiClient). So the token-fetching getToken callbacks — the connection token and each chat: subscription token — must fetch through the existing IPC RequestAction flow: the main process makes the authenticated call (realtime/centrifugo-token or chat/subscribe-token), the renderer receives the token over IPC. The renderer must not call these endpoints directly (no cookie). (The personal channel needs no token — it's user-limited — so it has no getToken.)
  3. Subscriptions (steady state ~1–3): the always-on personal user:<org>:<user>#<user> (subscribed at connect, user-limited, no token) and the open conversation's chat:<org>:<conv> (subscribed when the conversation opens — its getToken mints the subscription token; unsubscribed when it closes). The personal channel persists for the whole session.
  4. Open-conversation join handshake (avoids a message-loss race): opening a conversation does two things — REST-fetch the tail and subscribe to chat:<org>:<conv>. Order: (1) subscribe first and buffer incoming events, (2) then REST-fetch the history tail, (3) merge by message id, deduping via clientMessageId. Fetch-then-subscribe would lose anything arriving in the gap. On reopen of a cached conversation, the fetch in (2) is a delta — only messages after the last cached id (a cheap after-cursor query); a warm cache with nothing new is a pure cache hit with no backend call.
  5. Optimistic send: render keyed by clientMessageId; reconcile on the echoed message.created (match by clientMessageId); on isReplay just confirm. Keep a per-conversation map clientMessageId → bubble for pending sends so the echo (REST reply or live publish) updates the existing bubble instead of duplicating it; drop the entry once confirmed.
  6. Recovery: on recovered: false, REST-refetch the tail and reconcile (Flow C).
  7. Sidebar + badges: unread count from the server + live chat.activity pings on the personal channel re-badge the inbox. The ping carries a lightweight preview (conversationPublicId, senderId, ≤200-char snippet, timestamp) — not the full message — so the sidebar row updates ("Alice: hey are you…") without a fetch. The full body only ever arrives via chat: or the history fetch.
  8. Typing: on local typing, POST /chat/typing (throttle ≥3s); render the typer's identity from the server-published event, never from local guesses. (No client publish to Centrifugo — D4.)
  9. Presence: the POC renders online/away/busy from the existing app status system (user_presence), exactly as today — chat adds nothing here (D3).
  10. Security: message content is plain text end-to-end; the renderer must never interpret it as HTML — in Electron, XSS can escalate to local code execution.

9. Security model (summary)

  • Connect: connection JWT (HS256), signature-verified by Centrifugo. No valid token → no connection.
  • Subscribe:
    • user: — user-limited (#), client-subscribed with no token; Centrifugo enforces owner = sub from the connection JWT. The <org> prefix scopes it to the session's org.
    • chat: — a subscription token the backend mints only after a cross-org + membership check; Centrifugo verifies signature, exp, and the exact channel string on every (re)subscribe.
    • No subscribe proxy → Centrifugo never calls back into the backend.
  • Publish: clients never publish; everything (messages, typing) is server-published over the internal API.
  • Internal surface: Centrifugo /api + /admin + /metrics on internal_port, never public. No public webhook at all.
  • Cross-org invariant: the org segment is in every channel name and is verified at mint (conversation.org_id != session.org_id → ErrCrossOrgAccess). A leaked subscription token for org A cannot address org B's channel. Mandatory test: a user who is a member via a different org is denied a token.
  • Revocation: subscription tokens are bounded by a short exp; on membership loss the backend calls Unsubscribe (kills the live sub) and the short exp closes the cached-token re-subscribe window; bans call Disconnect (also how the personal channel is revoked).
  • Content rendering (client): message content is rendered as plain text, never HTML — in Electron an XSS can escalate to local code execution (§8.10). This is the one security property that lives entirely on the frontend.
  • Token lifetime: short exp + silent refresh bounds a leaked token's window; on org-switch/sign-out the backend stops minting and the client reconnects, so access can't outlive the session.
  • Out of the POC threat model (stated for honesty): no end-to-end encryption — content is plaintext at rest in Postgres and relies on TLS/WSS in transit; per-message identity is trusted from the connection's backend-minted JWT, not re-verified on every publish.

10. Implementation plan (tracer-bullet slices)

Each milestone is a vertical slice — frontend → backend → Centrifugo → DB, ending in something runnable. Ship in order.

MilestoneDeliverable (vertical slice)
M0Schema EXTEND (000032) + decommission (000033); old Supabase-Realtime chat code paths removed; app boots on the extended schema with chat write/read still to come.
M1Tracer bullet: Centrifugo pod on Sevalla + platform client + connection-token endpoint + Electron round-trips a backend publish end-to-end (personal channel).
M2Schema in use + DM send/receive over the real path (Flow B) + DB idempotency + the real subscription-token endpoint (cross-org + membership).
M3Groups: create, add/remove member, rename, leave (admin→oldest transfer); membership system messages.
M4Realtime niceties: typing (backend-published on chat:), read cursors + badges, edit/soft-delete.
M5Dogfood hardening: recovery drills, restart drill, rate-limit, metrics, runbook.

M0 — Schema EXTEND + decommission (no teardown)

Goal: the database and codebase stop depending on the PostgREST/Supabase-Realtime chat path, while keeping the tables, their data, and the presence system.

  1. Inventory confirmed (done in §7 against migrations 000011000031). The tables are kept.
  2. Migration 000032 — add chat_conversations.dm_key (+ partial unique index) and chat_messages.kind (§7.2). Real up/down; run updownup locally.
  3. Migration 000033 — decommission the PostgREST-era triggers, chat RPCs, RLS/grants, and the chat-tables Realtime publication (§7.3). Keep the presence path. ⚠ Must ship before/with the Go send endpoint (M2)§7.3. Rework internal/database/dbsecurity tests in the same PR.
  4. Remove old chat code paths — surgically. Remove the Supabase-Realtime chat client (useSupabaseChat, the chat side of SupabaseChatManager) and the direct-PostgREST chat read/write paths. ⚠ Critical coupling (verified in qubital-apps): supabasePresenceClient is bind()-ed by SupabaseChatManager and shares the same supabaseConnection, and RoomPresenceManager/ZoneBusyManager depend on it. So do not delete the shared Supabase connection or the presence client — decouple presence from the chat manager (re-home the connection / bind() ownership) so the office dots keep working once chat moves to Centrifugo.
  5. Verify SUPABASE_JWT_SECRET / the Supabase token generator usage — grep the codebase; remove only if chat was the sole consumer, otherwise keep (presence/other features may use it). Done when: CI green, the app boots on the extended schema, no dangling imports.

M1 — Tracer bullet (thinnest end-to-end pipe)

Goal: prove a backend publish reaches the Electron client over the connection token before any chat logic. This is where v6 config is nailed down.

  1. Deploy Centrifugo on Sevalla per §5: pinned v6 image, public WSS, internal_port for /api /admin /metrics /health. Health check on internal /health.
  2. Confirm exact v6 config nesting (the one open M1 risk): client.token.hmac_secret_key, the subscription-token secret key (separate vs shared — §5), allow_user_limited_channels: true on the user namespace, the chat/user namespaces, http_server.internal_port. Also capture the Electron WS Origin and set client.allowed_origins to exactly that — a mismatch rejects every connection, so verify this before anything else.
  3. Add backend env (§5): CENTRIFUGO_TOKEN_HMAC_SECRET, CENTRIFUGO_API_ENDPOINT, CENTRIFUGO_API_KEY. Wire into internal/app/env-setup.go.
  4. Platform client internal/platform/centrifugo/: Publish, Broadcast, Batch over the internal API + a test against it.
  5. Connection-token endpoint realtime/centrifugo-token behind existing auth: mint HS256 JWT with sub=workosUserId, exp, meta.org (identity only — no channels claim).
  6. Electron: connection-manager singleton with getToken → token endpoint; connect; subscribe to the user-limited personal channel user:<org>:<user>#<user> (no token); have the backend Publish to it and see it arrive. Done when: a backend publish shows up live in the Electron client, token refresh works, the user-limited subscribe is accepted, and /api is unreachable from the public internet.

M2 — Real DM send/receive (the core loop)

Goal: Flow B for real, with the real token security model.

  1. GORM models for the (now extended) chat_conversations, chat_conversation_members, chat_messages — internal BIGINT ids + public_id + client_message_id + kind.
  2. Subscription-token endpoint (§6.3): resolve public_id → conversation → cross-org deny → membership deny → mint {sub, channel: chat:<org>:<conv>, exp short, iat, jti}.
  3. Send service (Flow B, §6.4) via pkg/utils.TransactionManager: authz → cross-org → rate-limit → validate(≤4000) → INSERT (catch 23505 → existing row, isReplay) → commit → batch publish(chat:<org>:<conv>) + broadcast(user pings) after commit200.
  4. Open-DM + list + messages/list endpoints; dm_key makes the pair unique per org (D11). These replace get_recent_dms / get_conversation_messages.
  5. Electron DM view: optimistic bubble keyed by clientMessageId, reconcile on echoed message.created, render as plain text (§8.10), subscribe-first join handshake (§8.4), per-subscription getTokenchat/subscribe-token over IPC.
  6. Tests (mandatory): cross-org denial — a user in org A is refused a subscription token (and thus a subscribe) for org B's conversation even if listed as a member (the single most important test). Plus idempotent-retry (same clientMessageId → one row, isReplay). Done when: Alice DMs Bob, it persists, Bob receives it live, a retry creates no duplicate, cross-org is denied.

M3 — Groups

  1. Endpoints: create-group, add-member, remove-member, rename, leave (§6.7).
  2. Admin-leave rule (D10): admin leaving auto-transfers to the oldest member; last member leaving just leaves (empty conversation is harmless/invisible). Uses existing role ∈ ('admin','member').
  3. Revoke live access on removal (§6.6): after committing a remove-member/leave, call Unsubscribe(workosUserId, "chat:<org>:<conv>").
  4. Membership events + system messages: insert a kind='system' message ("X added Y", "renamed to Z") in the same transaction and publish it on chat:<org>:<conv> so open clients update live; conversation.added on the new member's user: channel (they aren't on chat: yet) and conversation.removed symmetrically.
  5. Reuse the existing org-members endpoint for the people picker (D9). Done when: groups create, members add/remove, rename, ownership transfers on admin-leave, and system lines render.

M4 — Realtime niceties

  1. Typing (D4): client POST /chat/typing (throttle ≥3s) → backend publish on chat:<org>:<conv> with server-resolved identity. No typing: channel, no client publish.
  2. Read cursors + unread badges (D2): read-cursor endpoint writes last_read_message_id; unread = range count on the monotonic id (exclude own + soft-deleted + kind='system'); live chat.activity pings badge closed conversations.
  3. Edit / soft-delete own message: set is_edited / is_deleted (+ updated_at); emit message.updated / message.deleted. Done when: typing, badges, and edit/delete all work live.
Not an M-milestone:

presence/online dots (D3 — kept on the existing system), reactions, attachments, threads, org-wide channels. Their tables/columns already exist or are reserved; they're future features (§14), not POC work.

M5 — Dogfood hardening

  1. Recovery drill: disconnect mid-conversation, reconnect → verify replay (recovered: true) and the REST tail-refetch when the gap is too large (Flow C).
  2. Restart drill: restart Centrifugo → memory-engine history loss degrades gracefully to the REST fallback, no message loss (DB is source of truth).
  3. Rate-limit: in-process token bucket (§6.8) — verify a spammer is throttled.
  4. Metrics (§6.10): publish latency/failures, subscription-token denials (ErrCrossOrgAccess labelled) — confirm they emit.
  5. Runbook: short doc — dashboards, metrics, the Redis cutover trigger (§5 blue note). Done when: all drills pass and the runbook exists.

11. Testing

The security tests are non-negotiable — they guard the cross-org invariant.

Unit (service layer, mocked repos):

  • Subscription-token authz matrix for chat::
    • member, same org → token minted.
    • non-member, same org → denied.
    • member-via-a-different-org → denied (the mandatory cross-org case: real member, but conversation.org_id ≠ session.org_id). The single most important test.
  • Idempotency: first send inserts; an immediate retry with the same clientMessageId hits the unique index → returns the same public_id with isReplay: true, no second row; six sends with different ids → six rows.
  • Rate limit: burst over the bucket is rejected; refills after the window.
  • Read-cursor / unread count: range over id > last_read_message_id, excludes own, soft-deleted, and kind='system', 0 when caught up.
  • Admin-leave transfer (D10): admin leaving transfers to the oldest member; last member just leaves.

Repository (real Postgres, testcontainers):

  • Idempotency unique index: the same (conversation_id, sender_id, client_message_id) inserted twice raises 23505; different client_message_id inserts cleanly.
  • dm_key uniqueness: opening the same DM pair twice in one org returns the same conversation; the same pair in a different shared org creates a separate one (D11).
  • Read path: latest-N-messages query orders by id DESC and uses the existing (conversation_id, created_at DESC) index — sanity-check via EXPLAIN.

Integration (testcontainers: Centrifugo + Postgres + backend):

  • Happy path (Flow B): send → row in Postgres → publish+broadcast fire → a subscribed client receives message.created carrying the clientMessageId.
  • Subscription-token verification: Centrifugo accepts the minted token for chat:<org>:<conv> and rejects a token whose channel claim doesn't match the subscribe target.
  • Publish-failure tolerance: with Centrifugo unreachable, send still returns 200 and the row is committed (publish logged as WARN).
  • Recovery (Flow C): drop a subscription, publish in the gap, reconnect → recovered: true; force a gap beyond the window → recovered: false → REST tail-refetch reconciles.

Decommission tests (same PR as 000033):

  • The reworked dbsecurity suite asserts the dropped triggers/RPCs/RLS are gone and that a GORM insert with an explicit sender_id is no longer clobbered to NULL (§7.3).

Manual drills (M5):

  • 30s offline → reconnect → messages recovered (replay or REST).
  • Centrifugo pod restart → memory history lost → clients refetch, no persisted message lost.
  • Two real orgs → cross-org subscribe denied end-to-end (not just the unit mock).
  • Public-internet probe of /api and /admin → unreachable.

12. Definition of Done (dogfood bar)

  • Team uses it daily for 1 week.
  • 30s-offline reconnect recovers messages (replay or REST refetch).
  • Centrifugo restart drill: no lost persisted messages; clients refetch.
  • Send hot path p95 < 200ms — measured client-side as the send→echo delta during normal dogfood use (not during the drills, which inject latency by design).
  • Cross-org denial verified with a real second org.
  • Idempotent retry verified (isReplay).
  • Token refresh: a session left open past the connection exp keeps receiving live messages with no manual reconnect; subscription tokens refresh per-channel.
  • Groups work: create, add/remove member, rename, admin-leave transfer (D10), system lines render.
  • Hostile content renders inert: <img src=x onerror=alert(1)> displays as literal text and executes nothing.
  • Centrifugo /api + /admin not reachable from the public internet.
  • Old PostgREST chat triggers/RPCs/RLS decommissioned; presence path still works; ARCHITECTURE.md + a short runbook updated.

13. Verified facts (receipts)

Codebase (/opt/livekit-backend-go/backend):

  • Migrations at 000031; chat schema spans 000011000031 (see §7 for the per-table inventory). Next free numbers: 000032 (extend), 000033 (decommission).
  • Real column types: chat_conversations.org_id / chat_conversation_members.user_id / chat_messages.sender_id are BIGINT FKs to organizations(id) / users(id)not WorkOS TEXT ids (000011). WorkOS ids map via users.workos_id / organizations.workos_org_id. → the boundary rule (§3), D8.
  • Idempotency already DB-enforced: chat_messages.client_message_id UUID + unique (conversation_id, sender_id, client_message_id) WHERE client_message_id IS NOT NULL (000028). → C1.
  • public_id already present: chat_conversations.public_id (000024, gen_random_uuid), chat_messages.public_id (000027/029, UUIDv7 backfill matching pkg/uid.New()). → D8.
  • chat_messages is NOT partitioned (plain BIGSERIAL PK, 000011). → no partition machinery; public_id is a plain unique key.
  • Soft delete is is_edited/is_deleted booleans (000011). → §6.4 / M4.
  • Presence is user_presence (renamed from chat_user_presence in 000013), status SMALLINT (0/1/2) + manual_status + heartbeat_at, clear-stale-presence cron every 5 min (000013/000016). → D3 (kept as-is).
  • The trigger hazard: trg_enforce_message_defaults sets NEW.sender_id := private.current_user_id() (= NULL for a GORM/non-PostgREST insert) → must be dropped before/with the Go send endpoint (000011 §4.1). → §7.3.
  • Token signing is HS256 (realtime/service/generatetoken.go); existing secret SUPABASE_JWT_SECRET; no RSA. → D6.
  • pkg/utils.TransactionManager over *gorm.DB; two binaries (cmd/api, cmd/worker); existing 5-min and hourly tickers.

Frontend (StartingQuoTechDivision/qubital-apps):

  • Office status dots run on Supabase Realtime: packages/app-shared/src/supabase/supabasePresenceClient.ts opens a room-presence:<org>:<room> broadcast channel and writes status via the update_presence PostgREST RPC; reads via user_presence selects + get_recent_dm_presence. The shared supabaseConnection is bind()-ed by SupabaseChatManager; RoomPresenceManager/ZoneBusyManager consume it. → D3, §7.3, M0 (presence kept; Supabase Realtime not fully removed; surgical teardown). Resolves the room-presence-transport question: Supabase, not LiveKit.

Centrifugo v6 docs:

  • Subscription token auth: a per-channel JWT (channel claim) verified by signature/exp/channel; with a reasonable exp it "survive[s] a massive reconnect scenario" and reduces session-backend load. → A1. (channel_token_auth)
  • Subscribe proxy fires an HTTP request per subscribe and does not proxy token/user-limited subscriptions. → A1/A4. (proxy)
  • User-limited channels (#): create a personal channel "without requiring a subscription token"; "ideal for… personal messages to a single user"; Centrifugo identifies the user from the connection JWT and allows the subscribe only if the id after # equals sub (specified for client subscribes). → A2 (personal channel). (channels)
  • Built-in personal channel (subscribe_to_user_personal_channel) auto-subscribes a user-limited personal channel, but its fixed namespace:#userID format can't carry our <org> prefix — so we DIY an org-scoped user-limited channel instead. → A2. (server_subs)
  • Server-side subscriptions via the connection token channels claim — ideal for a static channel set known at connect time. → reserved for the future org: channel (§14), not used in the POC. (server_subs)
  • No prescribed exp: "choose the exp value judiciously" / "a reasonable expiration time"; illustrative 1-hour (3600) example; ~25s SDK grace. → D6. (authentication)
  • broadcast = same payload to many channels; batch = many commands in one request. → C4.
  • Memory engine supports history + recovery on a single node; lost on restart. → D5 / Flow C.
  • http_server.internal_port / internal_address move /api /admin /metrics /health off the public port. → C2.
  • Server API unsubscribe {user, channel} and disconnect {user} for revocation. → §6.6. (server_api)

14. Future-stage pointers (production roadmap)

Everything here is out of POC scope; the POC schema/channel model is built so each lands without rework. (These mirror the blue notes inline.)

  • Transactional outbox (publish reliability — independent of Redis): chat_outbox table + Centrifugo PostgreSQL outbox consumer; the send tx inserts message and outbox row, consumer publishes committed rows. Closes the crash-between-commit-and-publish window (Flow B step 8). Ref: https://centrifugal.dev/docs/tutorial/outbox_cdc
  • Redis / multi-node: engine.type=redis + address; rate-limit → Redis. Subscription tokens need no change. Optional pre-DB dedup via Redis SET NX to short-circuit retries before the INSERT (the DB unique index stays the source of truth).
  • Org-wide / broadcast conversations: type='channel', visibility='org', history_mode='all'; an org:<org> channel server-side-subscribed via the connection token → O(1) fan-out, no member enumeration. Add visibility/history_mode columns (§7.2 blue note).
  • Since-join history: chat_conversation_members.first_visible_message_id, consulted only when history_mode='since_join'. POC members see full history.
  • Rich presence (away/busy/appear-offline) over realtime: per-user status:<org>:<user> channels with interest-based subscriptions and a subscribe proxy (the one place it earns its keep — high-fan-out ephemeral, no per-viewer token mint), status truth in the existing user_presence. Never Centrifugo built-in presence (leaks appear-offline — D3). Until then, presence stays on the existing app system.
  • Roles / RBAC: org-scoped roles/user_roles/conversation_role_grants; access = explicit member or role grant; role:<org>:<role> channels server-side-subscribed for O(members+roles) fan-out. Purely additive — no change to today's tables.
  • Instant token revocation: Centrifugo PRO (jti/iat revocation + user block). Refs: pro/token_revocation, pro/user_block.
  • Reactions / attachments / threads: tables already exist (chat_reactions, chat_attachments with storage_key, parent_message_id/reply_count). Attachments follow the R2 design in CHAT_BACKEND_IMPLEMENTATION.md.
  • Cold storage / infinite scroll: if message volume demands it, (re)introduce monthly partitioning on chat_messages and detach old partitions to object storage (R2/Parquet). Deliberately not done now (the table is plain — §7).
  • Desktop notifications (Electron): when backgrounded, the renderer + WS stay alive so chat.activity previews still arrive — fire an OS notification (existing notification/ API + Electron Notification) when the window is unfocused and the conversation isn't active.
  • "Last seen": a single last_seen_at already exists on user_presence.