Runbooks
Step-by-step operational procedures for handling incidents, performing maintenance, and recovering from failures. Runbooks are written for engineers who are under pressure and may not have full context — they should be executable without having to read reference docs first.
What belongs here
- Incident response procedures (service down, high error rate, database issue)
- Maintenance procedures (rotating secrets, deploying a hotfix, scaling a service)
- Recovery procedures (restoring from backup, re-running a failed job)
What does not belong here
- Background explanation of how a system works →
../reference/ - How to do a development task →
../how-to/ - Why a decision was made →
../adr/
Format conventions
Each runbook should follow this structure:
- Symptoms — how you know you need this runbook
- Impact — what is broken and who is affected
- Steps — numbered, imperative, executable commands and actions
- Verification — how to confirm the issue is resolved
- Escalation — who to contact if the runbook does not resolve it
Phase 0 note: individual runbooks to be created during baseline generation.