Skip to main content

Runbooks

Step-by-step operational procedures for handling incidents, performing maintenance, and recovering from failures. Runbooks are written for engineers who are under pressure and may not have full context — they should be executable without having to read reference docs first.

What belongs here

  • Incident response procedures (service down, high error rate, database issue)
  • Maintenance procedures (rotating secrets, deploying a hotfix, scaling a service)
  • Recovery procedures (restoring from backup, re-running a failed job)

What does not belong here

  • Background explanation of how a system works → ../reference/
  • How to do a development task → ../how-to/
  • Why a decision was made → ../adr/

Format conventions

Each runbook should follow this structure:

  1. Symptoms — how you know you need this runbook
  2. Impact — what is broken and who is affected
  3. Steps — numbered, imperative, executable commands and actions
  4. Verification — how to confirm the issue is resolved
  5. Escalation — who to contact if the runbook does not resolve it

Phase 0 note: individual runbooks to be created during baseline generation.