Skip to main content

Runbooks

Step-by-step operational procedures for handling incidents, performing maintenance, and recovering from failures. Runbooks are written for engineers who are under pressure and may not have full context — they should be executable without having to read reference docs first.

What belongs here

Incident response procedures (service down, high error rate, database issue)
Maintenance procedures (rotating secrets, deploying a hotfix, scaling a service)
Recovery procedures (restoring from backup, re-running a failed job)

What does not belong here

Background explanation of how a system works → ../reference/
How to do a development task → ../how-to/
Why a decision was made → ../adr/

Format conventions

Each runbook should follow this structure:

Symptoms — how you know you need this runbook
Impact — what is broken and who is affected
Steps — numbered, imperative, executable commands and actions
Verification — how to confirm the issue is resolved
Escalation — who to contact if the runbook does not resolve it

Phase 0 note: individual runbooks to be created during baseline generation.

What belongs here
What does not belong here
Format conventions