Supabase / DB Restructure Roadmap
Date: 2026-06-13 Status: Plan of record. Organized by event, not by calendar week — we are doing a big restructure now but deliberately not covering everything in one shot.
This document is self-contained: it consolidates the findings of a DB/Supabase research spike (schema audit, platform assessment, connection-config review, Stripe/Centrifugo target design) into a single standalone plan. You do not need to read any other document to act on it. The one external dependency it refers to operationally is the in-progress frontend PostgREST → Go-endpoint cutover ("the frontend cutover"), because two migrations are sequenced against it.
How this roadmap is structured
Work is bucketed by the event that unblocks it, not by a date. Three kinds of bucket:
- NOW — the current restructure. Gated only on capturing a prod ground-truth snapshot first.
- Event-gated — happens "when X lands" (X = repo refactor, Centrifugo, Stripe). No date; the event is the trigger.
- Loose horizon (≤ ~6 months, not urgent) — anything not tied to a specific event. "6 months" = our runway to pilots; these can slip to month 4, 5, or 6, they just need to be done before pilot hardening.
Strategic verdict (the frame for everything below)
We keep Supabase, used as bare managed Postgres. Rationale:
- Cheapest managed Postgres provider for our scale.
- SOC 2 / ISO compliant — relevant because we are about to work toward our own certifications, and the DB sitting on a compliant host helps that.
- After this restructure we use ~none of Supabase's value-add (Auth → WorkOS, Storage → R2, Realtime → Centrifugo, RLS → dropped), so the platform is interchangeable — but the migration cost of moving isn't worth paying now.
Consequence that simplifies the whole plan: because we stay on Supabase, pg_cron stays available — which is why we do not need a Go worker/scheduler (see the Go-worker note below). The Sevalla-Postgres consolidation question from the assessment is parked, not killed: revisit only if/when switching cost is near zero.
Gate 0 — capture prod ground truth (do this BEFORE any purge migration)
The schema dump we reviewed does not show triggers, functions, RLS policies, cron jobs, FK actions, or ~50 indexes, and migration 000015 admits to syncing dashboard-made changes out-of-band. So the live function/trigger/policy/cron state is unverified. Before writing a single drop migration, snapshot prod (read-only):
pg_dump --schema-only --schema=private --schema=public --schema=realtime
SELECT jobname, schedule, command FROM cron.job;
SELECT * FROM pg_policies;
SELECT tgname, tgrelid::regclass FROM pg_trigger WHERE NOT tgisinternal;
SELECT pubname, schemaname, tablename FROM pg_publication_tables;
This is the one hard prerequisite. Everything in "NOW" assumes we have this snapshot.
NOW — the current restructure
These are the migrations and code changes we do in this pass. Order matters where noted.
N1 · Purge the billing/licensing surface (one migration + backend cleanup)
Drop and remove all logic for:
org_licensesuser_licensessubscription_link(the special-link gating table)- the
check-expired-licensespg_cron job + the/webhook/license-expiredendpoint + its Vault secret +pg_netplumbing for that job - all backend code that fetches/writes the above (repositories, the AdminSignup reserve/consume dance, the hardcoded
now()+3 monthsexpiry increateorganization.goandeventsync/handlers.go) - the existing
LicenseCheckerinAuthMiddleware(pkg/middleware/auth.go:33,103-128) — it readsuser_licensesonly on the session-refresh path, so once the table is gone it has nothing to query and must come out in this same purge. It's already a weak gate (a valid access token is never re-checked until it expires); removing it now is consistent with the ungated-interim trade-off below, and the real billing gate arrives as a dedicated middleware at Event S.
Bonus: this fixes a real prod bug — today the only access gate is user_licenses.expires_at, a value copied once from org_licenses at membership creation, hardcoded to now()+3 months, with no renewal path anywhere (OrgLicenseRepository.Update has zero callers). So every org bricks 3 months after creation, and even a manual UPDATE can't restore access because the per-membership copies stay expired. Removing the broken gate is strictly better than leaving it.
⚠️ Conscious trade-off: purging these removes the only current access gate. Between now and Stripe landing, access gating is auth → (empty Stripe mw) → RBAC with no billing enforcement. That is intended (pilots are months out, no real billing yet) — but state it so nobody is surprised that orgs are ungated in the interim.
Feeds the next step — OpenAPI contracts. The OpenAPI contract definition is the keystone pulled forward right after this work. N1 removes endpoints (
/webhook/license-expired) and license-related fields from request/response payloads, so whatever surface N1 deletes must be absent from the contracts. Sequence N1 before (or alongside) the contract authoring so the contracts describe the post-purge API, not the legacy one.
N2 · Fix the room → recordings cascade (prod bug)
fk_recordings_room is ON DELETE CASCADE (000007). Deleting a room — a routine RBAC-gated endpoint — silently wipes its recordings, which resets consumed recording-minutes quota and destroys billing/audit history. Recordings are billing/quota objects → they belong to the org, not the room's lifecycle.
Fix: re-point so the org is the durable owner — nullable room_id with ON DELETE SET NULL, keeping org_id as the billing key (or RESTRICT + service-level archival). Same migration should settle recordings.initiated_by / user-deletion FK policy so deleting a user who ever recorded isn't impossible while deleting a user who started a conversation nukes other people's messages.
N3 · Create the Stripe middleware — empty, contracts only
Stand up the middleware in the gating chain now, with no logic:
auth mw → stripe mw (contract only, calls c.Next()) → rbac mw
It defines the position and the interface it will eventually enforce, but for now just passes through. Real logic arrives when Stripe lands (event S). This lets the chain shape stabilize without blocking on billing.
Event-gated work
Event R — "when the repository / DB-access layer is reworked" (the weeks 3–5 backend tidy)
The roadmap's backend-tidy phase reworks folder structure, repo pattern, and service pattern. Tied to that event:
R1 · Re-point the data-access layer at the new (post-purge) schema.
R2 · Delete the dead repo methods the review found (AddUserToOrganization, GetUserOrganizations, GetOrganizationUsers, UserBelongsToOrganization, SubscriptionLinkRepository.*, OrgLicenseRepository.Update, etc.) — less surface to port.
R3 · Centralize tenant scoping (org_id/user_id filtering) in the repository layer — this becomes the single enforcement layer once RLS is gone.
R4 · Add the Stripe scaffolding column so the model is ready: organizations.stripe_customer_id (UNIQUE, NULL until first billing touch). Heavier Stripe tables come with Event S, not here.
Event C — "when Centrifugo lands" (realtime cutover)
This event tears down the entire Supabase realtime + RLS surface and replaces it with Centrifugo. Also gated on the frontend PostgREST → Go cutover: the pieces below are load-bearing for the legacy frontend until it cuts over — drop them at cutover, not before, or you break the live frontend. If the FE has already cut over, this is safe to start.
C1 · Disable RLS + Supabase realtime machinery (migration "032"):
DISABLE ROW LEVEL SECURITYon all tables; drop the 10 policies onprivatetables before the JWT helper functions they reference (pg_dependordering).- Drop the JWT-claim helpers (
current_user_id(),current_org_id(),user_conversation_ids()) and the 9 frontend-facing RPC functions (DROP FUNCTION IF EXISTSwith historical signatures — some arg lists changed). REVOKE ALL … FROM authenticated(the roles are Supabase-managed — revoke, don't drop).- Delete
internal/database/dbsecurity/(its only purpose is verifying this surface). - Triggers — do NOT drop all six blindly (all on
chat_messages):
| Trigger | Action |
|---|---|
trg_enforce_message_defaults | Drop. Reads request.jwt.claims (PostgREST-only) → makes the Go chat-send path fail 100%. The SendMessage CTE already resolves sender/org. |
trg_check_message_rate_limit / trg_check_global_rate_limit | Rewrite to NEW.sender_id/NEW.org_id or drop + rate-limit in Go. They never fire for Go writes today → the Go path currently has zero rate limiting. |
trg_notify_new_message | Drop — replaced by backend publish (C3). |
trg_update_parent_reply_count / trg_touch_conversation | Keep, or fold into the SendMessage CTE — do not just drop. No Supabase dependency; dropping them silently loses reply counts and DM-inbox ordering. |
C2 · Remove user_presence; review room_presence:
user_presence: remove entirely — replaced by Centrifugo. Drop theclear-stale-presencecron with it. Clean up the two Go readers that use it as an org fence (user_repository.go:266,user_organization_repository.go:413) so multi-org users don't read as offline.room_presence: keep it (LiveKit-derived authoritative state; Centrifugo presence is ephemeral per-channel UX only). Two things attach:- 🐞 Broken in prod right now: a sentinel-error refactor made
RoomPresenceRepository.GetByWorkosUserIDreturnErrNotFoundon no row, butpresence_manager.HandleParticipantJoinedstill doesif err != nil { return err }and aborts before the only upsert that creates rows — so the table drains permanently empty (single-session enforcement never fires,lk_participants_connectedreads ~0). One-line fix:if err != nil && !errors.Is(err, apperr.ErrNotFound). Ship this regardless of the restructure — it needn't wait for Centrifugo. - Presence model decision (per-org
(user_id, org_id)vs global(user_id)) — settle before building Centrifugo presence. - ⚠️ Open: who reaps stale
room_presencerows? Theclear-stale-presencecron is dropped here (it was heartbeat-based, tied touser_presence),DeleteStaleEntriesis dead code, and there is no Go worker. Options: reconcile against LiveKitListParticipantson a ticker, a thin pg_cron job scoped toroom_presence, or fold cleanup into the Centrifugo presence flow.
- 🐞 Broken in prod right now: a sentinel-error refactor made
C3 · Centrifugo cutover proper:
- Drop both
realtime.messagespolicies and all 8supabase_realtimepublication entries. - Delete
internal/features/realtime/, the/api/v1/realtime/tokenroute, andSUPABASE_JWT_SECRET. - Replace with backend publish post-commit (Centrifugo server HTTP API,
idempotency_key = message public_id). Channels:personal:#<user_id>,chat:<conversation_public_id>,presence:<room_uuid>.
Event S — "when Stripe is adapted"
Billing model (decided): per-organization subscription with nominative seats — each org pays one subscription; each member occupies a named seat. Mid-cycle joins are prorated (a mid-month join ≈ half a month on the next invoice); leaves credit symmetrically (or pay-to-period-end — open decision). This model raises several findings from "later" to "billing-critical now" — they're captured below.
The full standalone design (mental model, DDL, webhook flow, seat sync, dunning, step-by-step adaptation) lives in "How to adapt Stripe" below. This section is the roadmap-level summary of what lands and in what order.
This is the event that fills the empty middleware from N3:
S1 · Replace the contract-only stripe mw with a dedicated RequireActiveSubscription middleware (not an extension of AuthMiddleware): chain is AuthMiddleware → RequireActiveSubscription → RequirePermission. Gate rule: membership row exists AND org_subscriptions.status IN ('active','trialing') + explicit past_due grace. On failure return a 402-style httperror so the FE routes to billing. Mount on the protected group minus a billingGroup (checkout, customer-portal, subscription-status, sign-out) that carries only AuthMiddleware so unpaid orgs can still pay. Optional 30–60s in-process TTL cache.
- The old refresh-only
LicenseCheckeris already gone (removed in N1 with theuser_licensespurge).RequireActiveSubscriptionis its replacement — and deliberately not an extension ofAuthMiddleware, to avoid the stale-gate pattern (refresh-only re-check) the old checker had.
S2 · Add Stripe endpoints: POST /webhooks/stripe (stripe-go SDK, signature verify → event.id dedup via stripe_event_inbox → 2xx fast → async). Handle webhooks in Go, not Supabase Edge Functions. New code homes: internal/platform/stripe/ (client wrapper), internal/features/billing/ (webhook handler, sync service, gate). Same shape as the existing LiveKit /webhook + WorkOS event_inbox pattern — Stripe is the third sibling.
S3 · DB migration adding the tables in the design section below:
org_subscriptions(status CHECK, quantity/seats, period fields from the subscription item, pinned API version).max_seatsnullable — under pure pay-per-seat the quantity is elastic; it's a plan-tier cap from Product metadata, not a hard schema fact.stripe_event_inbox(default'pending').organizations.stripe_customer_id(added as scaffolding back in Event R).user_organizationsstatus + timestamps: add a WorkOS-syncedstatus/activecolumn andcreated_at/deactivated_at, and stop hard-deleting on leave. This (plus Stripe's own subscription-item change log) is the primary seat-history mechanism. A dedicatedseat_eventsledger is deferred — add only if a concrete per-period audit need appears (see the design section).
S4 · One idempotent SyncStripeDataForOrg(customer_id) called from every webhook and the post-checkout path — never partial-apply payloads (events arrive out of order). Stripe is source of truth; DB is a thin cache; per-request gating reads only the local table.
S5 · Seat quantity sync — event-driven, single write path, with proration:
- The membership mutation (
user_organizations±, including the status flip) is committed before the Stripe call. Then update the subscription-item quantity withproration_behavior: 'create_prorations'. Stripe computes the proration — no custom math. (Decide leave-behavior explicitly:create_prorationscredits on decrease; for pay-to-period-end, decrease withproration_behavior: 'none'at the period boundary.) - Failure ordering is local-first, Stripe-second, never the reverse — billing for a member who failed to be created is worse than briefly under-billing. A reconciliation sweep (count local allocated memberships ↔ Stripe subscription-item quantity) heals any mid-flight gap.
S6 · Membership lifecycle via WorkOS activate/deactivate (decided): treat a member's active/inactive state through the WorkOS deactivate / reactivate organization-membership API rather than hard-deleting user_organizations rows. When an org's payment expires → deactivate; when they pay → reactivate. Implications to design around:
- WorkOS stays the source of truth for membership state, consistent with "WorkOS is identity-only / source of truth." Status changes arrive as
organization_membership.updatedevents through eventsync → which is exactly why the single write path (blocker #1 below) must own deactivate/reactivate too, not just create/delete. - This resolves the "no membership history" blocker (#2 below): a status flip preserves the row, so current occupancy is no longer destroyed on leave. With WorkOS-synced status + timestamps on
user_organizations, plus Stripe's own log of every subscription-item quantity change, we retain enough history without a custom ledger. (A dedicatedseat_eventstable only adds value for repeated in-period transitions — join/leave/rejoin in one cycle — which a single mutable row can't capture; defer it until that's a real need.) - ⚠️ Separate the two causes of deactivation — they affect billing oppositely:
- Member-level (member left, or an admin removed them) → reduces the billable seat quantity (prorated credit/charge per the model above).
- Org-level payment lapse → an access suspension, and it must NOT zero the billable quantity — otherwise you'd credit the org for not paying, which is backwards. Payment-lapse gating is better enforced at the
RequireActiveSubscriptionmiddleware (readingorg_subscriptions.status); if you also deactivate the memberships on lapse, record the cause so reactivation restores exactly the previously-active set and doesn't accidentally revive a member an admin had individually deactivated.
- Seat quantity = count of member-level allocated memberships (active + pending — a seat is billed once allocated to a person, whether or not they've claimed it), computed independently of org-level payment suspension (see above). This is the number synced to the Stripe subscription item.
- Replaces the legacy room_presence-based force-disconnect: on lapse, deny at
room-authtoken mint; deactivated members fail the gate.
Blockers that must be resolved BEFORE the Stripe PR:
- Membership writes are split across three paths — eventsync (WorkOS-driven), the membership feature endpoints, and admin/member signup all touch
user_organizationsindependently. With nominative seats, every one of them changes the billable quantity. They MUST be funneled through a single service method (membership mutation + Stripe quantity sync) or invoices drift. Inventory all writers first. user_organizationsmembership history. Originally a hard blocker (hard-delete, PK(user_id, org_id), no history). The WorkOS deactivate/reactivate decision above resolves it: status-flips preserve the row, so current occupancy survives a leave. Remaining work — add astatus/activecolumn synced from WorkOS membership state pluscreated_at/deactivated_at, and stop hard-deleting. No longer a blocker; the schema change ships in the billing PR. (Noseat_eventsledger needed unless per-period audit becomes a real requirement.)- eventsync cursor advances past hard-failed events — re-graded High → Critical-for-billing. WorkOS
organization_membership.created/deletedevents now drive seat quantity; a silently-skipped event means Stripe quantity diverges from reality → wrong invoices. Fix: stop advancing past poison events (or retry sweep + attempts counter) plus the periodic full reconciliation above. (This is the same item as L6's eventsync fix, but it is not "hygiene" once billing depends on it — it gates the Stripe PR.) - Seat-limit race is billing-critical now, not "mandatory later": today member-signup counts WorkOS memberships while invitation-acceptance counts local rows, both unserialized → concurrent joins exceed the seat cap. Fix: one source of truth (local
user_organizations),SELECT … FOR UPDATEon the subscription row, count, insert, commit, then create the WorkOS membership. A phantom 11th member on a 10-seat sync is an invoice dispute, not a soft overage.
How to adapt Stripe
This is the full, self-contained design for billing. It expands Event S above into something you can implement from directly. Read it top-to-bottom: mental model → data model → webhook flow → seat sync → membership lifecycle → gating → dunning → adaptation steps.
1. Mental model — who owns what
Three rules decide every detail that follows:
- Stripe is the source of truth for billing state (what plan, what status, how many seats billed, when the period ends). We never compute billing state ourselves; we cache Stripe's.
- The Go backend is the only thing that writes Postgres. Stripe never touches the database. Stripe talks to us via webhooks; we talk to Stripe via the API. The frontend never talks to Stripe for state — it reads our DB through our endpoints.
- The per-request hot path never calls Stripe. Gating reads one local cached row (
org_subscriptions). Stripe calls happen only in webhooks, at checkout, and on membership changes — never insideRequireActiveSubscription.
┌────────────────────────────────────────────────────┐
│ STRIPE │
│ Customer ─ Subscription(item: qty = seats) ─ Price │
└───────┬───────────────────────────▲────────────────┘
webhooks (signed) │ │ API calls (stripe-go)
checkout.session.completed │ │ - update item quantity
customer.subscription.* │ │ (proration_behavior)
invoice.paid / payment_failed │ │ - fetch subscription (sync)
▼ │
┌─────────────────────────────────────────────────────────┴───────────────┐
│ GO BACKEND │
│ │
│ POST /webhooks/stripe (public, signature-verified, no cookie auth) │
│ 1. webhook.ConstructEvent → verify Stripe-Signature │
│ 2. INSERT stripe_event_inbox (event_id UNIQUE → dedupe) → 200 fast │
│ 3. SyncStripeDataForOrg(customer_id) ◄── also called from the │
│ fetch fresh subscription from Stripe checkout-success path │
│ └─▶ upsert org_subscriptions (never apply the payload directly: │
│ events can arrive out of order) │
│ │
│ Membership change (join / leave / activate / deactivate — ONE path): │
│ tx { mutate user_organizations (+status flip) } → commit │
│ then Stripe: update subscription-item quantity (prorated) │
│ reconciliation sweep heals any Stripe-call failure │
│ │
│ Request gating: RequireActiveSubscription reads org_subscriptions only │
└───────────────┬──────────────────────────────────────────────────────────┘
│ GORM
▼
┌──────────────────────────────────────────────┐
│ POSTGRES │
│ org_subscriptions ← cache of Stripe truth │
│ stripe_event_inbox ← dedupe + audit │
│ user_organizations ← who occupies seats │
│ (+ WorkOS-synced status)│
└──────────────────────────────────────────────┘
This is the same shape as the two webhook integrations already in the codebase — the LiveKit /webhook endpoint and the WorkOS event_inbox poller. Stripe is the third sibling, not a new pattern. New code homes: internal/platform/stripe/ (thin client wrapper over stripe-go) and internal/features/billing/ (webhook handler + sync service + gate middleware).
Why not Supabase Edge Functions (which Stripe's own quickstarts push): those quickstarts assume a backendless app where the only server-side compute is on Supabase. We are not that — we have a Go backend that already receives webhooks. Edge Functions would add a Deno/TypeScript runtime to a Go shop, split billing logic away from the repositories that own the data, increase Supabase lock-in (against the "use Supabase as bare Postgres" verdict), and run outside our compliance boundary. Handle Stripe in Go.
2. Billing model
- Per-organization subscription. One Stripe
Customerper WorkOS org, oneSubscriptionwith a singlelicensed-type subscription item whosequantity= number of occupied seats. - Nominative seats. Who occupies each seat is our concern, not Stripe's — Stripe only bills a count. The mapping (which user holds which seat, and when) lives in our DB.
- Proration on change. A member added mid-cycle is billed for the remaining fraction of the period; Stripe computes this for us when we change the item quantity with
proration_behavior: 'create_prorations'. We never do proration math. - Numeric limits live in our DB, not Stripe. Stripe Entitlements are boolean-only, so caps like
max_seatsandmonthly_recording_minutesare columns we denormalize from Product metadata at sync time and enforce in-app.
3. Data model
All new objects; nothing here mutates an existing table except the one organizations column.
-- 3.1 Link an org to its Stripe customer. Added early (event R) as scaffolding.
ALTER TABLE private.organizations
ADD COLUMN stripe_customer_id text UNIQUE; -- NULL until the org's first billing touch
-- 3.2 The local cache of Stripe's subscription truth. One row per org.
CREATE TABLE private.org_subscriptions (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
org_id bigint NOT NULL UNIQUE REFERENCES private.organizations(id),
stripe_customer_id text NOT NULL,
stripe_subscription_id text NOT NULL UNIQUE,
stripe_price_id text NOT NULL,
status text NOT NULL
CHECK (status IN ('trialing','active','incomplete','incomplete_expired',
'past_due','canceled','unpaid','paused')),
quantity integer NOT NULL DEFAULT 1, -- seats billed
max_seats integer, -- NULLABLE: plan-tier cap, may not exist
monthly_recording_minutes integer, -- numeric entitlement, enforced in-app
cancel_at_period_end boolean NOT NULL DEFAULT false,
current_period_start timestamptz, -- read from the subscription ITEM (see §7)
current_period_end timestamptz,
trial_end timestamptz,
synced_at timestamptz NOT NULL DEFAULT now(),
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
-- 3.3 Webhook idempotency + audit. Sibling of the WorkOS event_inbox,
-- but defaults to 'pending' (a real inbox processes, then marks processed).
CREATE TABLE private.stripe_event_inbox (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
event_id text NOT NULL UNIQUE, -- Stripe event.id → dedupe key
event_type text NOT NULL,
stripe_customer_id text,
status text NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','processed','failed')),
error_message text,
event_created_at timestamptz NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
processed_at timestamptz
);
-- 3.4 Seat occupancy + history. user_organizations stays the live "who is in the org"
-- table; the change below gives it history without a separate ledger.
ALTER TABLE private.user_organizations
ADD COLUMN status text NOT NULL DEFAULT 'active' -- synced from WorkOS membership state
CHECK (status IN ('active','inactive','pending')),
ADD COLUMN deactivated_at timestamptz; -- created_at already exists / add if not
-- and: stop hard-deleting rows on leave — flip status instead.
Seat history, and why there is no seat_events table (deferred): nominative billing needs to answer "who occupied which seats during invoice period X". Two sources already cover this without a custom ledger:
user_organizationswith WorkOS-syncedstatus+created_at/deactivated_at(above) — current occupancy and the last transition per member, retained because we stop hard-deleting.- Stripe's own subscription-item history — every quantity change, with its proration, is logged on Stripe's side and is the billing source of truth anyway.
A dedicated append-only seat_events ledger would only add value for repeated in-period transitions (a member who joins, leaves, and rejoins inside one cycle — which a single mutable status/deactivated_at overwrites). That's unlikely at pilot churn, easy to add later, and annoying to keep correct if it's written-but-never-read. Defer it; add only if a concrete per-period audit or dispute need appears. If/when added, it's an append-only log (org_id, user_id with no FK CASCADE so history outlives the user, action, quantity_after, occurred_at).
4. Webhook handling
Endpoint: POST /webhooks/stripe — public (no cookie auth), authenticated by the Stripe signature instead.
- Verify the
Stripe-Signatureheader withwebhook.ConstructEventand the endpoint signing secret. Reject unverified payloads. - Dedupe + record:
INSERT INTO stripe_event_inbox (event_id, ...). TheUNIQUEonevent_idmakes redelivery a no-op (Stripe retries, and may deliver more than once). Respond200immediately — Stripe times out slow endpoints and retries, which just amplifies load. - Process async (after the 200): call the one sync function below. On success mark the inbox row
processed; on failure markfailedwitherror_messageand let the reconciliation sweep / retry handle it.
Events we care about: checkout.session.completed, customer.subscription.created|updated|deleted, invoice.paid, invoice.payment_failed.
Never apply the webhook payload directly. Stripe does not guarantee delivery order — a stale updated can arrive after a newer one. Every handler does the same thing: extract the customer id, then call the sync function, which re-fetches the current truth from Stripe.
5. The one sync function
SyncStripeDataForOrg(stripe_customer_id):
sub = stripe.Subscriptions.List(customer = stripe_customer_id, limit 1, expand items)
upsert org_subscriptions row from `sub`:
status, quantity, price_id, period start/end (from the ITEM), trial_end,
cancel_at_period_end, max_seats + monthly_recording_minutes (from Product metadata),
synced_at = now()
This is the only writer of org_subscriptions. It is called from every webhook handler and from the post-checkout success path (so the UI reflects the new subscription without waiting for the webhook). Idempotent: running it twice yields the same row. This single-writer discipline is what keeps out-of-order events from corrupting state.
6. Seat quantity sync (the nominative part)
Seat count changes whenever a membership changes. The hard rule: all membership mutation goes through one service method, because today three independent code paths touch user_organizations (the WorkOS event-sync poller, the membership feature endpoints, and admin/member signup) and every one of them now moves money. If they each call Stripe independently, invoices drift.
The single path does, in order:
- One DB transaction: mutate
user_organizations(add / remove / activate / deactivate — flipstatus, never hard-delete). Commit. - Then call Stripe to set the subscription-item quantity to the new allocated-seat count, with
proration_behavior: 'create_prorations'. - A periodic reconciliation sweep compares the local allocated-seat count against the Stripe subscription-item quantity and corrects drift.
Ordering is local-first, Stripe-second — never the reverse. If the Stripe call fails after the local commit, we have briefly under-billed and the sweep fixes it. The reverse (Stripe succeeds, local fails) would bill for a member who doesn't exist — worse.
Leave behavior is a product decision: create_prorations also credits on a decrease (member leaves mid-cycle → org gets credit). If instead leavers should pay through the period end, apply the decrease with proration_behavior: 'none' at the period boundary rather than immediately.
7. Membership lifecycle via WorkOS activate/deactivate
Member active/inactive state is driven through the WorkOS deactivate / reactivate organization-membership API, not by hard-deleting user_organizations rows. Payment lapse → deactivate; payment resumes → reactivate. WorkOS stays the source of truth for membership state and emits organization_membership.updated, which flows back through the event-sync poller — which is why the single write path (§6) must own activate/deactivate too, not just create/delete.
The one trap — two causes of deactivation that affect billing oppositely:
| Cause | Meaning | Effect on billable quantity |
|---|---|---|
| Member-level (left, or admin removed) | seat genuinely vacated | reduces quantity (prorated) |
| Org-level payment lapse | access suspended, seats still "owned" | must NOT reduce quantity — you'd credit a non-payer |
So: payment-lapse suspension is enforced at the gate (§8) reading org_subscriptions.status, not by zeroing seats. If you also deactivate the org's memberships on lapse, record the cause so that on reactivation you restore exactly the previously-active set and don't accidentally revive a member an admin had individually removed. Billable seat quantity = count of member-level allocated memberships (active + pending), computed independently of payment suspension.
pending memberships still occupy a billable seat (allocated-seat model). A seat is billed the moment it's allocated to a person — inviting a member is quantity +1 immediately (prorated), whether or not they ever log in. (Decided: the org chose to allocate the seat, so the org pays; whether the invitee claims it is their concern. The alternative — bill only on actual join/login — is a deliberate non-choice.) So pending and active both count toward quantity; only a genuinely vacated membership (left / admin-removed) decrements it.
pending memberships are not deactivatable — skip them in the lapse sweep. WorkOS rejects deactivate on a pending membership (invited, not yet accepted). When the lapse sweep iterates an org's memberships, guard with if status == pending { continue } so one un-accepted invite doesn't abort the batch. This is harmless: a pending member has no session to gate, and the seat stays billed (correct — the org allocated and paid for it). Skipping deactivation does not change quantity.
This also cleanly replaces the legacy room_presence-based force-disconnect: on lapse, deny at the LiveKit room-auth token mint — a deactivated/un-gated member simply fails the gate.
8. Where the gate lives
A dedicated middleware, not an extension of AuthMiddleware:
Request → AuthMiddleware → RequireActiveSubscription → RequirePermission → handler
(who are you?) (did the org pay?) (may you do this?)
- After auth — the gate needs the validated
org_idfrom the token; there's nothing to gate on before auth. - Before RBAC — payment status is org-level and binary; no point evaluating per-route permissions for an unpaid org.
- Its own middleware, not inside auth — the old
LicenseChecker(removed in N1) only checked on the token-refresh path, so a valid token sailed through until expiry: a stale gate. A separate middleware re-checks every request, and lets specific routes opt out by simply not mounting it. - Opt-out group (
billingGroup) carries onlyAuthMiddleware: checkout, customer-portal redirect, subscription-status read, sign-out — an unpaid org must still be able to pay and leave. - Gate rule:
status IN ('active','trialing')plus the chosenpast_duegrace policy. On failure, return a 402-style domain error viahttperrorso the frontend routes to the billing page. - Cost: one indexed read of
org_subscriptionsbyorg_idper request (theUNIQUEindex covers it). If it ever matters, add a 30–60s in-process TTL cache — webhook-driven sync keeps the table fresh, so the staleness window is bounded.
9. Lifecycle & dunning
| Stripe state | Access | Notes |
|---|---|---|
trialing | ✅ full | Pilots become real subscriptions with trial_period_days. |
active | ✅ full | Steady state. |
past_due | ✅ grace | First failed payment — don't revoke immediately; Stripe retries (Smart Retries). Keep access through the grace window. |
unpaid / canceled | ❌ revoke | Retries exhausted or cancelled. Gate denies. |
paused | ❌ (per policy) | Trial ended with no payment method. Set via trial_settings.end_behavior.missing_payment_method = 'pause'. Distinct from past_due. |
Pilots: create a real Stripe subscription with trial_period_days and trial_settings.end_behavior.missing_payment_method = 'pause'. The legacy subscription_link invite mechanism either shrinks to an invite token carrying stripe_price_id + trial length, or is replaced outright by Stripe Checkout links.
10. Stripe API gotchas to pin down
- Pin the API version by pinning the
stripe-gomodule version ingo.mod— not via a config string. The Go SDK hard-codes its API version per release and Stripe explicitly advises against overriding it in strongly-typed languages (response objects wouldn't match the SDK structs). So "pin the API version" = freeze thestripe-gomajor version. The Basil API (2025-03-31.basil) ships instripe-gov82.0.0+ (v82.0.0, 2025-04-01). Bump deliberately, read the migration guide, re-test. - Billing periods moved to the subscription item (Basil API, 2025-03-31, confirmed breaking change). Pre-Basil,
current_period_start/endlived on the Subscription object; Basil removed them from the Subscription and put them on each subscription item (a sub can have multiple items on different cadences, so the period is per-item). Read from the item — Go path:sub.Items.Data[0].CurrentPeriodStart/.CurrentPeriodEnd, notsub.CurrentPeriodStart(gone → zero value on Basil). The DDL in §3 and the sync in §5 reflect this. - Entitlements are boolean-only — numeric caps (
max_seats,monthly_recording_minutes) stay in our DB (§2), reported to Stripe only if you bill overage.
11. Adaptation steps (the order to actually do it in)
- Prereqs (the blockers in Event S) first — funnel membership writes through one service method; add the WorkOS-synced
statuscolumn + stop hard-deleting; fix the event-sync cursor so it can't skip membership events; serialize seat counting. Billing on top of these unfixed produces wrong invoices. internal/platform/stripe/— client wrapper (pinned version), customer + subscription + checkout helpers.- Migration — the schema changes in §3 (
org_subscriptions,stripe_event_inbox,organizations.stripe_customer_id,user_organizationsstatus + timestamps). internal/features/billing/— thePOST /webhooks/stripehandler (§4) andSyncStripeDataForOrg(§5).- Checkout + customer-portal endpoints; the post-checkout success path also calls the sync function.
- Wire seat-quantity sync into the single membership write path (§6) + the reconciliation sweep.
RequireActiveSubscriptionmiddleware (§8); mount on the protected group minusbillingGroup.- Backfill existing orgs into Stripe (create customers/subscriptions), then drop the legacy
user_licenses/user_organizations.license_id/org_licensesremnants if any survived N1, plus thesubscription_linkreserve/consume dance. - Dunning/lifecycle policy wiring (§9) + the
past_duegrace decision.
Loose horizon (≤ ~6 months to pilot, not event-bound, not urgent)
L1 · Understand / confirm max pool & connection values
Not urgent — can be month 4/5/6. Confirm the Supabase compute tier (→ exact direct-connection limit), test whether Sevalla can do IPv6 egress (→ Direct mode bypasses the pooler), and size MaxOpenConns × pods headroom for the eventual many-pod k8s setup. Decide the per-pod pool size for horizontal scale (likely < 15). No action forced until we actually scale out.
L2 · Connection-config hygiene (small, can be done whenever database.go is touched)
- Remove hardcoded
PreferSimpleProtocol: true; let thePOSTGRES_URLdefault_query_exec_modeparam be the single source of truth. Flip prod from the transaction pooler (:6543) to session mode (:5432) to re-enable prepared statements — test on staging first. - Set
ConnMaxLifetime ≈ 270sandConnMaxIdleTime(currently unset → latent stale-connection / EOF bug when the pooler drops idle backends ~300s). - Set
MaxIdleConns = MaxOpenConns(currently 5 vs 15 → churn). - Add
statement_timeout(~20s viaALTER ROLE/DSN — we connect aspostgres, whose default is 2 minutes),idle_in_transaction_session_timeout(~60s, currently unbounded),application_name, slow-query logging.
L3 · Index / constraint hygiene (mostly enabled by the RLS drop)
users.email: unique index onlower(email)— today it's an un-indexed seq scan and duplicate emails resolve to the oldest row (stale-user shadowing bug).chat_messages (conversation_id, id DESC)for keyset pagination; rewrite unread count tom.id > COALESCE(last_read, 0).- Composite tenant FKs — once RLS is gone these are the only DB-level cross-tenant guard (nothing currently stops
chat_messages.org_iddiffering from its conversation's org). - Partial unique for "one active recording per zone"; status
CHECKs onrecordings/event_inbox(but not onuser_organizations.role— eventsync writes WorkOS slugs verbatim). fillfactoron the churn hotspotsevent_cursorsanduser_presence(the latter dies with the presence removal at Event C).
L4 · Refresh-token rotation — review ⚠️ likely an active bug, not just a TODO
You flagged this as "not a problem because WorkOS gives us the same refresh token, worth a TODO later." The schema audit contradicts that: it found WorkOS refresh tokens are single-use, and AuthMiddleware refreshes via WorkOS but never persists refreshResp.RefreshToken (SessionRefreshToken.Set is called only in the finalize/sign-in flows, not the middleware refresh path). So user_sessions.refresh_token holds a consumed token → the session dies on the second refresh cycle, and concurrent expired-cookie requests race the single-use token. If that's right, this is a High-severity prod bug that bricks sessions, not a someday-TODO. Action: verify WorkOS's actual rotation behavior first — if tokens are single-use, fix now (persist the rotated token + SELECT … FOR UPDATE per session); if WorkOS genuinely returns the same token, downgrade to the TODO you described. Don't leave it ambiguous.
L5 · Migration-chain cleanup / portability
- Squash to a vanilla-Postgres baseline
000001(force-version existing DBs); deletesupabase_stubs.sql,dev/db-stubs.sql, the 42-file eventsync migrations fork,internal/database/migrations/old/, and the staledb_schema.md. - Create a least-privilege
qubital_approle (vs the current table-owningpostgres), reset itssearch_path, fix CLAUDE.md (it wrongly says GORM "automatic migrations" — it's golang-migrate — and references a non-existentuser_sessions.LastLoggedOrg). - Lower priority given we're staying on Supabase, but it's what makes the platform genuinely portable and removes the "third source of schema truth" maintenance tax.
L6 · Hygiene sweeps (cheap, do alongside L3/L5)
- eventsync: stop advancing the cursor past hard-failed events (the comment claims they're retried — they aren't); add a retry sweep + attempts counter.
- Retention:
event_inbox(processed AND created_at < NOW()-90d) anduser_sessions(TTL/cleanup — currently encrypted refresh tokens for abandoned sessions accumulate forever). - Concurrency fixes:
FindOrCreateDMrace (advisory lock +dm_keypartial unique),FirstOrCreate→INSERT … ON CONFLICTonusers.workos_id/user_sessions.session_id. Seat-limit serialization becomes mandatory at event S (Stripe seats). - Decide
chat_reactions/chat_attachments(provably empty, RLS-locked, no Go code): build or drop in the realtime-teardown migration.
⚠️ Go worker / schedulers — correction to "remove entirely"
Your plan says: "Go worker schedulers are not needed; our current jobs are user presence (purged by Centrifugo) and license expiry (purged by Stripe), so remove that part entirely."
There are four pg_cron jobs, not two:
| Job | Purged by Stripe/Centrifugo? | Disposition |
|---|---|---|
check-expired-licenses | ✅ yes (Stripe) | Drop in N1 |
clear-stale-presence | ✅ yes (presence removal) | Drop at Event C |
reconcile-stale-recordings | ❌ no | KEEP — without it, recordings stuck after a lost egress webhook sit in finalizing forever |
purge-expired-recordings | ❌ no | KEEP — daily retention delete of expired recordings |
The conclusion still holds — no Go worker is needed — but the reason is different and the action changes: because we're staying on Supabase, pg_cron remains available, so the two recording jobs just stay on pg_cron. Do not drop them in the purge. (A Go worker would only be needed if we left Supabase, which we've parked.) The trade is that these two jobs remain Supabase-coupled — acceptable under the "stay" verdict, and the thing to revisit first if the Sevalla question ever reopens.
Review verdict — is the roadmap fine?
Yes, the spine is right and matches the verified findings: purge billing/licensing → strip RLS/realtime → empty Stripe mw → fix recordings cascade → drop user_presence → fill Stripe later. The event-based framing fits the work cleanly. The corrections to fold in:
- Gate 0 first — snapshot prod triggers/functions/policies/cron before any drop migration. The dump is not trustworthy.
- RLS/trigger removal is gated on the frontend PostgREST cutover — not unconditionally "now." Confirm the FE has cut over.
- Don't drop triggers blindly —
trg_update_parent_reply_count+trg_touch_conversationmust be kept or folded into the CTE;trg_notify_new_messagestays until Centrifugo. - Four cron jobs, not two — keep the two recording jobs on pg_cron. (No Go worker, but for the "we stay on Supabase" reason.)
- Refresh-token rotation is probably an active prod bug, not a future TODO — verify WorkOS single-use behavior; the review says sessions die on the 2nd refresh.
room_presenceis broken in prod right now — ship the one-lineErrNotFoundfix regardless of the restructure.- Note the ungated interim — purging licenses leaves orgs with no billing gate until Stripe; intended, but state it.