Skip to main content

Supabase / DB Restructure Roadmap

Date: 2026-06-13 Status: Plan of record. Organized by event, not by calendar week — we are doing a big restructure now but deliberately not covering everything in one shot.

This document is self-contained: it consolidates the findings of a DB/Supabase research spike (schema audit, platform assessment, connection-config review, Stripe/Centrifugo target design) into a single standalone plan. You do not need to read any other document to act on it. The one external dependency it refers to operationally is the in-progress frontend PostgREST → Go-endpoint cutover ("the frontend cutover"), because two migrations are sequenced against it.


How this roadmap is structured

Work is bucketed by the event that unblocks it, not by a date. Three kinds of bucket:

  1. NOW — the current restructure. Gated only on capturing a prod ground-truth snapshot first.
  2. Event-gated — happens "when X lands" (X = repo refactor, Centrifugo, Stripe). No date; the event is the trigger.
  3. Loose horizon (≤ ~6 months, not urgent) — anything not tied to a specific event. "6 months" = our runway to pilots; these can slip to month 4, 5, or 6, they just need to be done before pilot hardening.

Strategic verdict (the frame for everything below)

We keep Supabase, used as bare managed Postgres. Rationale:

  • Cheapest managed Postgres provider for our scale.
  • SOC 2 / ISO compliant — relevant because we are about to work toward our own certifications, and the DB sitting on a compliant host helps that.
  • After this restructure we use ~none of Supabase's value-add (Auth → WorkOS, Storage → R2, Realtime → Centrifugo, RLS → dropped), so the platform is interchangeable — but the migration cost of moving isn't worth paying now.

Consequence that simplifies the whole plan: because we stay on Supabase, pg_cron stays available — which is why we do not need a Go worker/scheduler (see the Go-worker note below). The Sevalla-Postgres consolidation question from the assessment is parked, not killed: revisit only if/when switching cost is near zero.


Gate 0 — capture prod ground truth (do this BEFORE any purge migration)

The schema dump we reviewed does not show triggers, functions, RLS policies, cron jobs, FK actions, or ~50 indexes, and migration 000015 admits to syncing dashboard-made changes out-of-band. So the live function/trigger/policy/cron state is unverified. Before writing a single drop migration, snapshot prod (read-only):

pg_dump --schema-only --schema=private --schema=public --schema=realtime
SELECT jobname, schedule, command FROM cron.job;
SELECT * FROM pg_policies;
SELECT tgname, tgrelid::regclass FROM pg_trigger WHERE NOT tgisinternal;
SELECT pubname, schemaname, tablename FROM pg_publication_tables;

This is the one hard prerequisite. Everything in "NOW" assumes we have this snapshot.


NOW — the current restructure

These are the migrations and code changes we do in this pass. Order matters where noted.

N1 · Purge the billing/licensing surface (one migration + backend cleanup)

Drop and remove all logic for:

  • org_licenses
  • user_licenses
  • subscription_link (the special-link gating table)
  • the check-expired-licenses pg_cron job + the /webhook/license-expired endpoint + its Vault secret + pg_net plumbing for that job
  • all backend code that fetches/writes the above (repositories, the AdminSignup reserve/consume dance, the hardcoded now()+3 months expiry in createorganization.go and eventsync/handlers.go)
  • the existing LicenseChecker in AuthMiddleware (pkg/middleware/auth.go:33,103-128) — it reads user_licenses only on the session-refresh path, so once the table is gone it has nothing to query and must come out in this same purge. It's already a weak gate (a valid access token is never re-checked until it expires); removing it now is consistent with the ungated-interim trade-off below, and the real billing gate arrives as a dedicated middleware at Event S.

Bonus: this fixes a real prod bug — today the only access gate is user_licenses.expires_at, a value copied once from org_licenses at membership creation, hardcoded to now()+3 months, with no renewal path anywhere (OrgLicenseRepository.Update has zero callers). So every org bricks 3 months after creation, and even a manual UPDATE can't restore access because the per-membership copies stay expired. Removing the broken gate is strictly better than leaving it.

⚠️ Conscious trade-off: purging these removes the only current access gate. Between now and Stripe landing, access gating is auth → (empty Stripe mw) → RBAC with no billing enforcement. That is intended (pilots are months out, no real billing yet) — but state it so nobody is surprised that orgs are ungated in the interim.

Feeds the next step — OpenAPI contracts. The OpenAPI contract definition is the keystone pulled forward right after this work. N1 removes endpoints (/webhook/license-expired) and license-related fields from request/response payloads, so whatever surface N1 deletes must be absent from the contracts. Sequence N1 before (or alongside) the contract authoring so the contracts describe the post-purge API, not the legacy one.

N2 · Fix the room → recordings cascade (prod bug)

fk_recordings_room is ON DELETE CASCADE (000007). Deleting a room — a routine RBAC-gated endpoint — silently wipes its recordings, which resets consumed recording-minutes quota and destroys billing/audit history. Recordings are billing/quota objects → they belong to the org, not the room's lifecycle.

Fix: re-point so the org is the durable owner — nullable room_id with ON DELETE SET NULL, keeping org_id as the billing key (or RESTRICT + service-level archival). Same migration should settle recordings.initiated_by / user-deletion FK policy so deleting a user who ever recorded isn't impossible while deleting a user who started a conversation nukes other people's messages.

N3 · Create the Stripe middleware — empty, contracts only

Stand up the middleware in the gating chain now, with no logic:

auth mw → stripe mw (contract only, calls c.Next()) → rbac mw

It defines the position and the interface it will eventually enforce, but for now just passes through. Real logic arrives when Stripe lands (event S). This lets the chain shape stabilize without blocking on billing.


Event-gated work

Event R — "when the repository / DB-access layer is reworked" (the weeks 3–5 backend tidy)

The roadmap's backend-tidy phase reworks folder structure, repo pattern, and service pattern. Tied to that event:

R1 · Re-point the data-access layer at the new (post-purge) schema.

R2 · Delete the dead repo methods the review found (AddUserToOrganization, GetUserOrganizations, GetOrganizationUsers, UserBelongsToOrganization, SubscriptionLinkRepository.*, OrgLicenseRepository.Update, etc.) — less surface to port.

R3 · Centralize tenant scoping (org_id/user_id filtering) in the repository layer — this becomes the single enforcement layer once RLS is gone.

R4 · Add the Stripe scaffolding column so the model is ready: organizations.stripe_customer_id (UNIQUE, NULL until first billing touch). Heavier Stripe tables come with Event S, not here.

Event C — "when Centrifugo lands" (realtime cutover)

This event tears down the entire Supabase realtime + RLS surface and replaces it with Centrifugo. Also gated on the frontend PostgREST → Go cutover: the pieces below are load-bearing for the legacy frontend until it cuts over — drop them at cutover, not before, or you break the live frontend. If the FE has already cut over, this is safe to start.

C1 · Disable RLS + Supabase realtime machinery (migration "032"):

  • DISABLE ROW LEVEL SECURITY on all tables; drop the 10 policies on private tables before the JWT helper functions they reference (pg_depend ordering).
  • Drop the JWT-claim helpers (current_user_id(), current_org_id(), user_conversation_ids()) and the 9 frontend-facing RPC functions (DROP FUNCTION IF EXISTS with historical signatures — some arg lists changed).
  • REVOKE ALL … FROM authenticated (the roles are Supabase-managed — revoke, don't drop).
  • Delete internal/database/dbsecurity/ (its only purpose is verifying this surface).
  • Triggers — do NOT drop all six blindly (all on chat_messages):
TriggerAction
trg_enforce_message_defaultsDrop. Reads request.jwt.claims (PostgREST-only) → makes the Go chat-send path fail 100%. The SendMessage CTE already resolves sender/org.
trg_check_message_rate_limit / trg_check_global_rate_limitRewrite to NEW.sender_id/NEW.org_id or drop + rate-limit in Go. They never fire for Go writes today → the Go path currently has zero rate limiting.
trg_notify_new_messageDrop — replaced by backend publish (C3).
trg_update_parent_reply_count / trg_touch_conversationKeep, or fold into the SendMessage CTE — do not just drop. No Supabase dependency; dropping them silently loses reply counts and DM-inbox ordering.

C2 · Remove user_presence; review room_presence:

  • user_presence: remove entirely — replaced by Centrifugo. Drop the clear-stale-presence cron with it. Clean up the two Go readers that use it as an org fence (user_repository.go:266, user_organization_repository.go:413) so multi-org users don't read as offline.
  • room_presence: keep it (LiveKit-derived authoritative state; Centrifugo presence is ephemeral per-channel UX only). Two things attach:
    • 🐞 Broken in prod right now: a sentinel-error refactor made RoomPresenceRepository.GetByWorkosUserID return ErrNotFound on no row, but presence_manager.HandleParticipantJoined still does if err != nil { return err } and aborts before the only upsert that creates rows — so the table drains permanently empty (single-session enforcement never fires, lk_participants_connected reads ~0). One-line fix: if err != nil && !errors.Is(err, apperr.ErrNotFound). Ship this regardless of the restructure — it needn't wait for Centrifugo.
    • Presence model decision (per-org (user_id, org_id) vs global (user_id)) — settle before building Centrifugo presence.
    • ⚠️ Open: who reaps stale room_presence rows? The clear-stale-presence cron is dropped here (it was heartbeat-based, tied to user_presence), DeleteStaleEntries is dead code, and there is no Go worker. Options: reconcile against LiveKit ListParticipants on a ticker, a thin pg_cron job scoped to room_presence, or fold cleanup into the Centrifugo presence flow.

C3 · Centrifugo cutover proper:

  • Drop both realtime.messages policies and all 8 supabase_realtime publication entries.
  • Delete internal/features/realtime/, the /api/v1/realtime/token route, and SUPABASE_JWT_SECRET.
  • Replace with backend publish post-commit (Centrifugo server HTTP API, idempotency_key = message public_id). Channels: personal:#<user_id>, chat:<conversation_public_id>, presence:<room_uuid>.

Event S — "when Stripe is adapted"

Billing model (decided): per-organization subscription with nominative seats — each org pays one subscription; each member occupies a named seat. Mid-cycle joins are prorated (a mid-month join ≈ half a month on the next invoice); leaves credit symmetrically (or pay-to-period-end — open decision). This model raises several findings from "later" to "billing-critical now" — they're captured below.

The full standalone design (mental model, DDL, webhook flow, seat sync, dunning, step-by-step adaptation) lives in "How to adapt Stripe" below. This section is the roadmap-level summary of what lands and in what order.

This is the event that fills the empty middleware from N3:

S1 · Replace the contract-only stripe mw with a dedicated RequireActiveSubscription middleware (not an extension of AuthMiddleware): chain is AuthMiddleware → RequireActiveSubscription → RequirePermission. Gate rule: membership row exists AND org_subscriptions.status IN ('active','trialing') + explicit past_due grace. On failure return a 402-style httperror so the FE routes to billing. Mount on the protected group minus a billingGroup (checkout, customer-portal, subscription-status, sign-out) that carries only AuthMiddleware so unpaid orgs can still pay. Optional 30–60s in-process TTL cache.

  • The old refresh-only LicenseChecker is already gone (removed in N1 with the user_licenses purge). RequireActiveSubscription is its replacement — and deliberately not an extension of AuthMiddleware, to avoid the stale-gate pattern (refresh-only re-check) the old checker had.

S2 · Add Stripe endpoints: POST /webhooks/stripe (stripe-go SDK, signature verify → event.id dedup via stripe_event_inbox → 2xx fast → async). Handle webhooks in Go, not Supabase Edge Functions. New code homes: internal/platform/stripe/ (client wrapper), internal/features/billing/ (webhook handler, sync service, gate). Same shape as the existing LiveKit /webhook + WorkOS event_inbox pattern — Stripe is the third sibling.

S3 · DB migration adding the tables in the design section below:

  • org_subscriptions (status CHECK, quantity/seats, period fields from the subscription item, pinned API version). max_seats nullable — under pure pay-per-seat the quantity is elastic; it's a plan-tier cap from Product metadata, not a hard schema fact.
  • stripe_event_inbox (default 'pending').
  • organizations.stripe_customer_id (added as scaffolding back in Event R).
  • user_organizations status + timestamps: add a WorkOS-synced status/active column and created_at / deactivated_at, and stop hard-deleting on leave. This (plus Stripe's own subscription-item change log) is the primary seat-history mechanism. A dedicated seat_events ledger is deferred — add only if a concrete per-period audit need appears (see the design section).

S4 · One idempotent SyncStripeDataForOrg(customer_id) called from every webhook and the post-checkout path — never partial-apply payloads (events arrive out of order). Stripe is source of truth; DB is a thin cache; per-request gating reads only the local table.

S5 · Seat quantity sync — event-driven, single write path, with proration:

  • The membership mutation (user_organizations ±, including the status flip) is committed before the Stripe call. Then update the subscription-item quantity with proration_behavior: 'create_prorations'. Stripe computes the proration — no custom math. (Decide leave-behavior explicitly: create_prorations credits on decrease; for pay-to-period-end, decrease with proration_behavior: 'none' at the period boundary.)
  • Failure ordering is local-first, Stripe-second, never the reverse — billing for a member who failed to be created is worse than briefly under-billing. A reconciliation sweep (count local allocated memberships ↔ Stripe subscription-item quantity) heals any mid-flight gap.

S6 · Membership lifecycle via WorkOS activate/deactivate (decided): treat a member's active/inactive state through the WorkOS deactivate / reactivate organization-membership API rather than hard-deleting user_organizations rows. When an org's payment expires → deactivate; when they pay → reactivate. Implications to design around:

  • WorkOS stays the source of truth for membership state, consistent with "WorkOS is identity-only / source of truth." Status changes arrive as organization_membership.updated events through eventsync → which is exactly why the single write path (blocker #1 below) must own deactivate/reactivate too, not just create/delete.
  • This resolves the "no membership history" blocker (#2 below): a status flip preserves the row, so current occupancy is no longer destroyed on leave. With WorkOS-synced status + timestamps on user_organizations, plus Stripe's own log of every subscription-item quantity change, we retain enough history without a custom ledger. (A dedicated seat_events table only adds value for repeated in-period transitions — join/leave/rejoin in one cycle — which a single mutable row can't capture; defer it until that's a real need.)
  • ⚠️ Separate the two causes of deactivation — they affect billing oppositely:
    • Member-level (member left, or an admin removed them) → reduces the billable seat quantity (prorated credit/charge per the model above).
    • Org-level payment lapse → an access suspension, and it must NOT zero the billable quantity — otherwise you'd credit the org for not paying, which is backwards. Payment-lapse gating is better enforced at the RequireActiveSubscription middleware (reading org_subscriptions.status); if you also deactivate the memberships on lapse, record the cause so reactivation restores exactly the previously-active set and doesn't accidentally revive a member an admin had individually deactivated.
  • Seat quantity = count of member-level allocated memberships (active + pending — a seat is billed once allocated to a person, whether or not they've claimed it), computed independently of org-level payment suspension (see above). This is the number synced to the Stripe subscription item.
  • Replaces the legacy room_presence-based force-disconnect: on lapse, deny at room-auth token mint; deactivated members fail the gate.

Blockers that must be resolved BEFORE the Stripe PR:

  1. Membership writes are split across three paths — eventsync (WorkOS-driven), the membership feature endpoints, and admin/member signup all touch user_organizations independently. With nominative seats, every one of them changes the billable quantity. They MUST be funneled through a single service method (membership mutation + Stripe quantity sync) or invoices drift. Inventory all writers first.
  2. user_organizations membership history. Originally a hard blocker (hard-delete, PK (user_id, org_id), no history). The WorkOS deactivate/reactivate decision above resolves it: status-flips preserve the row, so current occupancy survives a leave. Remaining work — add a status/active column synced from WorkOS membership state plus created_at / deactivated_at, and stop hard-deleting. No longer a blocker; the schema change ships in the billing PR. (No seat_events ledger needed unless per-period audit becomes a real requirement.)
  3. eventsync cursor advances past hard-failed events — re-graded High → Critical-for-billing. WorkOS organization_membership.created/deleted events now drive seat quantity; a silently-skipped event means Stripe quantity diverges from reality → wrong invoices. Fix: stop advancing past poison events (or retry sweep + attempts counter) plus the periodic full reconciliation above. (This is the same item as L6's eventsync fix, but it is not "hygiene" once billing depends on it — it gates the Stripe PR.)
  4. Seat-limit race is billing-critical now, not "mandatory later": today member-signup counts WorkOS memberships while invitation-acceptance counts local rows, both unserialized → concurrent joins exceed the seat cap. Fix: one source of truth (local user_organizations), SELECT … FOR UPDATE on the subscription row, count, insert, commit, then create the WorkOS membership. A phantom 11th member on a 10-seat sync is an invoice dispute, not a soft overage.

How to adapt Stripe

This is the full, self-contained design for billing. It expands Event S above into something you can implement from directly. Read it top-to-bottom: mental model → data model → webhook flow → seat sync → membership lifecycle → gating → dunning → adaptation steps.

1. Mental model — who owns what

Three rules decide every detail that follows:

  1. Stripe is the source of truth for billing state (what plan, what status, how many seats billed, when the period ends). We never compute billing state ourselves; we cache Stripe's.
  2. The Go backend is the only thing that writes Postgres. Stripe never touches the database. Stripe talks to us via webhooks; we talk to Stripe via the API. The frontend never talks to Stripe for state — it reads our DB through our endpoints.
  3. The per-request hot path never calls Stripe. Gating reads one local cached row (org_subscriptions). Stripe calls happen only in webhooks, at checkout, and on membership changes — never inside RequireActiveSubscription.
┌────────────────────────────────────────────────────┐
│ STRIPE │
│ Customer ─ Subscription(item: qty = seats) ─ Price │
└───────┬───────────────────────────▲────────────────┘
webhooks (signed) │ │ API calls (stripe-go)
checkout.session.completed │ │ - update item quantity
customer.subscription.* │ │ (proration_behavior)
invoice.paid / payment_failed │ │ - fetch subscription (sync)
▼ │
┌─────────────────────────────────────────────────────────┴───────────────┐
│ GO BACKEND │
│ │
│ POST /webhooks/stripe (public, signature-verified, no cookie auth) │
│ 1. webhook.ConstructEvent → verify Stripe-Signature │
│ 2. INSERT stripe_event_inbox (event_id UNIQUE → dedupe) → 200 fast │
│ 3. SyncStripeDataForOrg(customer_id) ◄── also called from the │
│ fetch fresh subscription from Stripe checkout-success path │
│ └─▶ upsert org_subscriptions (never apply the payload directly: │
│ events can arrive out of order) │
│ │
│ Membership change (join / leave / activate / deactivate — ONE path): │
│ tx { mutate user_organizations (+status flip) } → commit │
│ then Stripe: update subscription-item quantity (prorated) │
│ reconciliation sweep heals any Stripe-call failure │
│ │
│ Request gating: RequireActiveSubscription reads org_subscriptions only │
└───────────────┬──────────────────────────────────────────────────────────┘
│ GORM

┌──────────────────────────────────────────────┐
│ POSTGRES │
│ org_subscriptions ← cache of Stripe truth │
│ stripe_event_inbox ← dedupe + audit │
│ user_organizations ← who occupies seats │
│ (+ WorkOS-synced status)│
└──────────────────────────────────────────────┘

This is the same shape as the two webhook integrations already in the codebase — the LiveKit /webhook endpoint and the WorkOS event_inbox poller. Stripe is the third sibling, not a new pattern. New code homes: internal/platform/stripe/ (thin client wrapper over stripe-go) and internal/features/billing/ (webhook handler + sync service + gate middleware).

Why not Supabase Edge Functions (which Stripe's own quickstarts push): those quickstarts assume a backendless app where the only server-side compute is on Supabase. We are not that — we have a Go backend that already receives webhooks. Edge Functions would add a Deno/TypeScript runtime to a Go shop, split billing logic away from the repositories that own the data, increase Supabase lock-in (against the "use Supabase as bare Postgres" verdict), and run outside our compliance boundary. Handle Stripe in Go.

2. Billing model

  • Per-organization subscription. One Stripe Customer per WorkOS org, one Subscription with a single licensed-type subscription item whose quantity = number of occupied seats.
  • Nominative seats. Who occupies each seat is our concern, not Stripe's — Stripe only bills a count. The mapping (which user holds which seat, and when) lives in our DB.
  • Proration on change. A member added mid-cycle is billed for the remaining fraction of the period; Stripe computes this for us when we change the item quantity with proration_behavior: 'create_prorations'. We never do proration math.
  • Numeric limits live in our DB, not Stripe. Stripe Entitlements are boolean-only, so caps like max_seats and monthly_recording_minutes are columns we denormalize from Product metadata at sync time and enforce in-app.

3. Data model

All new objects; nothing here mutates an existing table except the one organizations column.

-- 3.1 Link an org to its Stripe customer. Added early (event R) as scaffolding.
ALTER TABLE private.organizations
ADD COLUMN stripe_customer_id text UNIQUE; -- NULL until the org's first billing touch

-- 3.2 The local cache of Stripe's subscription truth. One row per org.
CREATE TABLE private.org_subscriptions (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
org_id bigint NOT NULL UNIQUE REFERENCES private.organizations(id),
stripe_customer_id text NOT NULL,
stripe_subscription_id text NOT NULL UNIQUE,
stripe_price_id text NOT NULL,
status text NOT NULL
CHECK (status IN ('trialing','active','incomplete','incomplete_expired',
'past_due','canceled','unpaid','paused')),
quantity integer NOT NULL DEFAULT 1, -- seats billed
max_seats integer, -- NULLABLE: plan-tier cap, may not exist
monthly_recording_minutes integer, -- numeric entitlement, enforced in-app
cancel_at_period_end boolean NOT NULL DEFAULT false,
current_period_start timestamptz, -- read from the subscription ITEM (see §7)
current_period_end timestamptz,
trial_end timestamptz,
synced_at timestamptz NOT NULL DEFAULT now(),
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);

-- 3.3 Webhook idempotency + audit. Sibling of the WorkOS event_inbox,
-- but defaults to 'pending' (a real inbox processes, then marks processed).
CREATE TABLE private.stripe_event_inbox (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
event_id text NOT NULL UNIQUE, -- Stripe event.id → dedupe key
event_type text NOT NULL,
stripe_customer_id text,
status text NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','processed','failed')),
error_message text,
event_created_at timestamptz NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
processed_at timestamptz
);

-- 3.4 Seat occupancy + history. user_organizations stays the live "who is in the org"
-- table; the change below gives it history without a separate ledger.
ALTER TABLE private.user_organizations
ADD COLUMN status text NOT NULL DEFAULT 'active' -- synced from WorkOS membership state
CHECK (status IN ('active','inactive','pending')),
ADD COLUMN deactivated_at timestamptz; -- created_at already exists / add if not
-- and: stop hard-deleting rows on leave — flip status instead.

Seat history, and why there is no seat_events table (deferred): nominative billing needs to answer "who occupied which seats during invoice period X". Two sources already cover this without a custom ledger:

  1. user_organizations with WorkOS-synced status + created_at/deactivated_at (above) — current occupancy and the last transition per member, retained because we stop hard-deleting.
  2. Stripe's own subscription-item history — every quantity change, with its proration, is logged on Stripe's side and is the billing source of truth anyway.

A dedicated append-only seat_events ledger would only add value for repeated in-period transitions (a member who joins, leaves, and rejoins inside one cycle — which a single mutable status/deactivated_at overwrites). That's unlikely at pilot churn, easy to add later, and annoying to keep correct if it's written-but-never-read. Defer it; add only if a concrete per-period audit or dispute need appears. If/when added, it's an append-only log (org_id, user_id with no FK CASCADE so history outlives the user, action, quantity_after, occurred_at).

4. Webhook handling

Endpoint: POST /webhooks/stripe — public (no cookie auth), authenticated by the Stripe signature instead.

  1. Verify the Stripe-Signature header with webhook.ConstructEvent and the endpoint signing secret. Reject unverified payloads.
  2. Dedupe + record: INSERT INTO stripe_event_inbox (event_id, ...). The UNIQUE on event_id makes redelivery a no-op (Stripe retries, and may deliver more than once). Respond 200 immediately — Stripe times out slow endpoints and retries, which just amplifies load.
  3. Process async (after the 200): call the one sync function below. On success mark the inbox row processed; on failure mark failed with error_message and let the reconciliation sweep / retry handle it.

Events we care about: checkout.session.completed, customer.subscription.created|updated|deleted, invoice.paid, invoice.payment_failed.

Never apply the webhook payload directly. Stripe does not guarantee delivery order — a stale updated can arrive after a newer one. Every handler does the same thing: extract the customer id, then call the sync function, which re-fetches the current truth from Stripe.

5. The one sync function

SyncStripeDataForOrg(stripe_customer_id):
sub = stripe.Subscriptions.List(customer = stripe_customer_id, limit 1, expand items)
upsert org_subscriptions row from `sub`:
status, quantity, price_id, period start/end (from the ITEM), trial_end,
cancel_at_period_end, max_seats + monthly_recording_minutes (from Product metadata),
synced_at = now()

This is the only writer of org_subscriptions. It is called from every webhook handler and from the post-checkout success path (so the UI reflects the new subscription without waiting for the webhook). Idempotent: running it twice yields the same row. This single-writer discipline is what keeps out-of-order events from corrupting state.

6. Seat quantity sync (the nominative part)

Seat count changes whenever a membership changes. The hard rule: all membership mutation goes through one service method, because today three independent code paths touch user_organizations (the WorkOS event-sync poller, the membership feature endpoints, and admin/member signup) and every one of them now moves money. If they each call Stripe independently, invoices drift.

The single path does, in order:

  1. One DB transaction: mutate user_organizations (add / remove / activate / deactivate — flip status, never hard-delete). Commit.
  2. Then call Stripe to set the subscription-item quantity to the new allocated-seat count, with proration_behavior: 'create_prorations'.
  3. A periodic reconciliation sweep compares the local allocated-seat count against the Stripe subscription-item quantity and corrects drift.

Ordering is local-first, Stripe-second — never the reverse. If the Stripe call fails after the local commit, we have briefly under-billed and the sweep fixes it. The reverse (Stripe succeeds, local fails) would bill for a member who doesn't exist — worse.

Leave behavior is a product decision: create_prorations also credits on a decrease (member leaves mid-cycle → org gets credit). If instead leavers should pay through the period end, apply the decrease with proration_behavior: 'none' at the period boundary rather than immediately.

7. Membership lifecycle via WorkOS activate/deactivate

Member active/inactive state is driven through the WorkOS deactivate / reactivate organization-membership API, not by hard-deleting user_organizations rows. Payment lapse → deactivate; payment resumes → reactivate. WorkOS stays the source of truth for membership state and emits organization_membership.updated, which flows back through the event-sync poller — which is why the single write path (§6) must own activate/deactivate too, not just create/delete.

The one trap — two causes of deactivation that affect billing oppositely:

CauseMeaningEffect on billable quantity
Member-level (left, or admin removed)seat genuinely vacatedreduces quantity (prorated)
Org-level payment lapseaccess suspended, seats still "owned"must NOT reduce quantity — you'd credit a non-payer

So: payment-lapse suspension is enforced at the gate (§8) reading org_subscriptions.status, not by zeroing seats. If you also deactivate the org's memberships on lapse, record the cause so that on reactivation you restore exactly the previously-active set and don't accidentally revive a member an admin had individually removed. Billable seat quantity = count of member-level allocated memberships (active + pending), computed independently of payment suspension.

pending memberships still occupy a billable seat (allocated-seat model). A seat is billed the moment it's allocated to a person — inviting a member is quantity +1 immediately (prorated), whether or not they ever log in. (Decided: the org chose to allocate the seat, so the org pays; whether the invitee claims it is their concern. The alternative — bill only on actual join/login — is a deliberate non-choice.) So pending and active both count toward quantity; only a genuinely vacated membership (left / admin-removed) decrements it.

pending memberships are not deactivatable — skip them in the lapse sweep. WorkOS rejects deactivate on a pending membership (invited, not yet accepted). When the lapse sweep iterates an org's memberships, guard with if status == pending { continue } so one un-accepted invite doesn't abort the batch. This is harmless: a pending member has no session to gate, and the seat stays billed (correct — the org allocated and paid for it). Skipping deactivation does not change quantity.

This also cleanly replaces the legacy room_presence-based force-disconnect: on lapse, deny at the LiveKit room-auth token mint — a deactivated/un-gated member simply fails the gate.

8. Where the gate lives

A dedicated middleware, not an extension of AuthMiddleware:

Request → AuthMiddleware → RequireActiveSubscription → RequirePermission → handler
(who are you?) (did the org pay?) (may you do this?)
  • After auth — the gate needs the validated org_id from the token; there's nothing to gate on before auth.
  • Before RBAC — payment status is org-level and binary; no point evaluating per-route permissions for an unpaid org.
  • Its own middleware, not inside auth — the old LicenseChecker (removed in N1) only checked on the token-refresh path, so a valid token sailed through until expiry: a stale gate. A separate middleware re-checks every request, and lets specific routes opt out by simply not mounting it.
  • Opt-out group (billingGroup) carries only AuthMiddleware: checkout, customer-portal redirect, subscription-status read, sign-out — an unpaid org must still be able to pay and leave.
  • Gate rule: status IN ('active','trialing') plus the chosen past_due grace policy. On failure, return a 402-style domain error via httperror so the frontend routes to the billing page.
  • Cost: one indexed read of org_subscriptions by org_id per request (the UNIQUE index covers it). If it ever matters, add a 30–60s in-process TTL cache — webhook-driven sync keeps the table fresh, so the staleness window is bounded.

9. Lifecycle & dunning

Stripe stateAccessNotes
trialing✅ fullPilots become real subscriptions with trial_period_days.
active✅ fullSteady state.
past_due✅ graceFirst failed payment — don't revoke immediately; Stripe retries (Smart Retries). Keep access through the grace window.
unpaid / canceled❌ revokeRetries exhausted or cancelled. Gate denies.
paused❌ (per policy)Trial ended with no payment method. Set via trial_settings.end_behavior.missing_payment_method = 'pause'. Distinct from past_due.

Pilots: create a real Stripe subscription with trial_period_days and trial_settings.end_behavior.missing_payment_method = 'pause'. The legacy subscription_link invite mechanism either shrinks to an invite token carrying stripe_price_id + trial length, or is replaced outright by Stripe Checkout links.

10. Stripe API gotchas to pin down

  • Pin the API version by pinning the stripe-go module version in go.modnot via a config string. The Go SDK hard-codes its API version per release and Stripe explicitly advises against overriding it in strongly-typed languages (response objects wouldn't match the SDK structs). So "pin the API version" = freeze the stripe-go major version. The Basil API (2025-03-31.basil) ships in stripe-go v82.0.0+ (v82.0.0, 2025-04-01). Bump deliberately, read the migration guide, re-test.
  • Billing periods moved to the subscription item (Basil API, 2025-03-31, confirmed breaking change). Pre-Basil, current_period_start/end lived on the Subscription object; Basil removed them from the Subscription and put them on each subscription item (a sub can have multiple items on different cadences, so the period is per-item). Read from the item — Go path: sub.Items.Data[0].CurrentPeriodStart / .CurrentPeriodEnd, not sub.CurrentPeriodStart (gone → zero value on Basil). The DDL in §3 and the sync in §5 reflect this.
  • Entitlements are boolean-only — numeric caps (max_seats, monthly_recording_minutes) stay in our DB (§2), reported to Stripe only if you bill overage.

11. Adaptation steps (the order to actually do it in)

  1. Prereqs (the blockers in Event S) first — funnel membership writes through one service method; add the WorkOS-synced status column + stop hard-deleting; fix the event-sync cursor so it can't skip membership events; serialize seat counting. Billing on top of these unfixed produces wrong invoices.
  2. internal/platform/stripe/ — client wrapper (pinned version), customer + subscription + checkout helpers.
  3. Migration — the schema changes in §3 (org_subscriptions, stripe_event_inbox, organizations.stripe_customer_id, user_organizations status + timestamps).
  4. internal/features/billing/ — the POST /webhooks/stripe handler (§4) and SyncStripeDataForOrg (§5).
  5. Checkout + customer-portal endpoints; the post-checkout success path also calls the sync function.
  6. Wire seat-quantity sync into the single membership write path (§6) + the reconciliation sweep.
  7. RequireActiveSubscription middleware (§8); mount on the protected group minus billingGroup.
  8. Backfill existing orgs into Stripe (create customers/subscriptions), then drop the legacy user_licenses / user_organizations.license_id / org_licenses remnants if any survived N1, plus the subscription_link reserve/consume dance.
  9. Dunning/lifecycle policy wiring (§9) + the past_due grace decision.

Loose horizon (≤ ~6 months to pilot, not event-bound, not urgent)

L1 · Understand / confirm max pool & connection values

Not urgent — can be month 4/5/6. Confirm the Supabase compute tier (→ exact direct-connection limit), test whether Sevalla can do IPv6 egress (→ Direct mode bypasses the pooler), and size MaxOpenConns × pods headroom for the eventual many-pod k8s setup. Decide the per-pod pool size for horizontal scale (likely < 15). No action forced until we actually scale out.

L2 · Connection-config hygiene (small, can be done whenever database.go is touched)

  • Remove hardcoded PreferSimpleProtocol: true; let the POSTGRES_URL default_query_exec_mode param be the single source of truth. Flip prod from the transaction pooler (:6543) to session mode (:5432) to re-enable prepared statements — test on staging first.
  • Set ConnMaxLifetime ≈ 270s and ConnMaxIdleTime (currently unset → latent stale-connection / EOF bug when the pooler drops idle backends ~300s).
  • Set MaxIdleConns = MaxOpenConns (currently 5 vs 15 → churn).
  • Add statement_timeout (~20s via ALTER ROLE/DSN — we connect as postgres, whose default is 2 minutes), idle_in_transaction_session_timeout (~60s, currently unbounded), application_name, slow-query logging.

L3 · Index / constraint hygiene (mostly enabled by the RLS drop)

  • users.email: unique index on lower(email) — today it's an un-indexed seq scan and duplicate emails resolve to the oldest row (stale-user shadowing bug).
  • chat_messages (conversation_id, id DESC) for keyset pagination; rewrite unread count to m.id > COALESCE(last_read, 0).
  • Composite tenant FKs — once RLS is gone these are the only DB-level cross-tenant guard (nothing currently stops chat_messages.org_id differing from its conversation's org).
  • Partial unique for "one active recording per zone"; status CHECKs on recordings/event_inbox (but not on user_organizations.role — eventsync writes WorkOS slugs verbatim).
  • fillfactor on the churn hotspots event_cursors and user_presence (the latter dies with the presence removal at Event C).

L4 · Refresh-token rotation — review ⚠️ likely an active bug, not just a TODO

You flagged this as "not a problem because WorkOS gives us the same refresh token, worth a TODO later." The schema audit contradicts that: it found WorkOS refresh tokens are single-use, and AuthMiddleware refreshes via WorkOS but never persists refreshResp.RefreshToken (SessionRefreshToken.Set is called only in the finalize/sign-in flows, not the middleware refresh path). So user_sessions.refresh_token holds a consumed token → the session dies on the second refresh cycle, and concurrent expired-cookie requests race the single-use token. If that's right, this is a High-severity prod bug that bricks sessions, not a someday-TODO. Action: verify WorkOS's actual rotation behavior first — if tokens are single-use, fix now (persist the rotated token + SELECT … FOR UPDATE per session); if WorkOS genuinely returns the same token, downgrade to the TODO you described. Don't leave it ambiguous.

L5 · Migration-chain cleanup / portability

  • Squash to a vanilla-Postgres baseline 000001 (force-version existing DBs); delete supabase_stubs.sql, dev/db-stubs.sql, the 42-file eventsync migrations fork, internal/database/migrations/old/, and the stale db_schema.md.
  • Create a least-privilege qubital_app role (vs the current table-owning postgres), reset its search_path, fix CLAUDE.md (it wrongly says GORM "automatic migrations" — it's golang-migrate — and references a non-existent user_sessions.LastLoggedOrg).
  • Lower priority given we're staying on Supabase, but it's what makes the platform genuinely portable and removes the "third source of schema truth" maintenance tax.

L6 · Hygiene sweeps (cheap, do alongside L3/L5)

  • eventsync: stop advancing the cursor past hard-failed events (the comment claims they're retried — they aren't); add a retry sweep + attempts counter.
  • Retention: event_inbox (processed AND created_at < NOW()-90d) and user_sessions (TTL/cleanup — currently encrypted refresh tokens for abandoned sessions accumulate forever).
  • Concurrency fixes: FindOrCreateDM race (advisory lock + dm_key partial unique), FirstOrCreateINSERT … ON CONFLICT on users.workos_id / user_sessions.session_id. Seat-limit serialization becomes mandatory at event S (Stripe seats).
  • Decide chat_reactions / chat_attachments (provably empty, RLS-locked, no Go code): build or drop in the realtime-teardown migration.

⚠️ Go worker / schedulers — correction to "remove entirely"

Your plan says: "Go worker schedulers are not needed; our current jobs are user presence (purged by Centrifugo) and license expiry (purged by Stripe), so remove that part entirely."

There are four pg_cron jobs, not two:

JobPurged by Stripe/Centrifugo?Disposition
check-expired-licenses✅ yes (Stripe)Drop in N1
clear-stale-presence✅ yes (presence removal)Drop at Event C
reconcile-stale-recordingsnoKEEP — without it, recordings stuck after a lost egress webhook sit in finalizing forever
purge-expired-recordingsnoKEEP — daily retention delete of expired recordings

The conclusion still holds — no Go worker is needed — but the reason is different and the action changes: because we're staying on Supabase, pg_cron remains available, so the two recording jobs just stay on pg_cron. Do not drop them in the purge. (A Go worker would only be needed if we left Supabase, which we've parked.) The trade is that these two jobs remain Supabase-coupled — acceptable under the "stay" verdict, and the thing to revisit first if the Sevalla question ever reopens.


Review verdict — is the roadmap fine?

Yes, the spine is right and matches the verified findings: purge billing/licensing → strip RLS/realtime → empty Stripe mw → fix recordings cascade → drop user_presence → fill Stripe later. The event-based framing fits the work cleanly. The corrections to fold in:

  1. Gate 0 first — snapshot prod triggers/functions/policies/cron before any drop migration. The dump is not trustworthy.
  2. RLS/trigger removal is gated on the frontend PostgREST cutover — not unconditionally "now." Confirm the FE has cut over.
  3. Don't drop triggers blindlytrg_update_parent_reply_count + trg_touch_conversation must be kept or folded into the CTE; trg_notify_new_message stays until Centrifugo.
  4. Four cron jobs, not two — keep the two recording jobs on pg_cron. (No Go worker, but for the "we stay on Supabase" reason.)
  5. Refresh-token rotation is probably an active prod bug, not a future TODO — verify WorkOS single-use behavior; the review says sessions die on the 2nd refresh.
  6. room_presence is broken in prod right now — ship the one-line ErrNotFound fix regardless of the restructure.
  7. Note the ungated interim — purging licenses leaves orgs with no billing gate until Stripe; intended, but state it.