runbook
A runbook is a documented, step-by-step procedure for carrying out a specific operational task or resolving a known production issue, written so that anyone on the team can follow it without improvising. It turns the knowledge in one engineer’s head into a checklist the whole rotation can run, most often triggered by an alert from the team’s observability stack.
A runbook hangs off a single trigger, such as that alert or a routine maintenance window, and reads as an ordered list of checks and actions:
RUNBOOK: API latency alert (p99 over 500ms)
1. Open the latency dashboard and confirm the alert is real.
2. Check the release channel for a deploy in the last 30 minutes.
3. If a deploy lines up, roll it back and re-check p99.
4. If not, inspect the database connection pool for saturation.
5. Escalate to the database on-call if the pool is exhausted.
6. Once p99 recovers, record the cause in the incident channel.
How It Shows Up in Practice
A Python developer meets runbooks on a first on-call shift, when an alert fires and its notification links straight to the page that explains what to do about it. The runbook itself lives wherever the team keeps operational docs, such as a docs/runbooks/ folder in the service repository, a shared wiki, or a file attached to the alert in tools like PagerDuty or Opsgenie.
Google’s site reliability engineering practice pairs every alert with a matching entry and reports that writing the response down ahead of time yields roughly a threefold improvement in mean time to repair over improvising under pressure.
Runbooks fall on a spectrum from manual to automated. A manual runbook is prose a human reads and executes one line at a time, while an automated runbook encodes the same steps as code a machine can run, on platforms such as AWS Systems Manager, Azure Automation, or Rundeck. Even a plain function captures the shape of one automated step, a decision followed by an action:
def remediate(queue_depth, threshold=1000):
if queue_depth < threshold:
return "ok: queue within limits"
return f"action: queue at {queue_depth}, scaling out a worker"
for depth in (120, 1500):
print(remediate(depth))
ok: queue within limits
action: queue at 1500, scaling out a worker
The verification steps a runbook ends with, such as a smoke test against the recovered service, are usually the first parts a team automates.
Runbook vs. Playbook
Teams often use the terms runbook and playbook interchangeably, but the two usually split along scope. A runbook is tactical and narrow, covering the exact steps for one task, such as rotating a credential or clearing a stuck queue.
A playbook is strategic and broad, describing how a team coordinates a whole class of situation, like declaring an incident, assigning a commander, and keeping customers informed. A playbook typically points at several runbooks for the hands-on work.
The naming isn’t universal. Google’s SRE books call the per-alert tactical guide a playbook, the reverse of the split above, so the safer move is to read how a given team defines the word rather than assume. When the procedure a runbook points to is a one-line emergency fix, the steps to ship it live in a hotfix.
Related Resources
Tutorial
Logging in Python
If you use Python's print() function to get information about the flow of your programs, logging is the natural next step. Create your first logs and curate them to grow with your projects.
For additional information on related topics, take a look at the following resources:
- Continuous Integration With Python: An Introduction (Tutorial)
- Continuous Integration and Deployment for Python With GitHub Actions (Tutorial)
- Build Robust Continuous Integration With Docker and Friends (Tutorial)
- Add Logging and Notification Messages to Flask Web Projects (Tutorial)
- Continuous Integration With Python (Course)
- Logging Inside Python (Course)
- Logging in Python (Quiz)
- Python Continuous Integration and Deployment Using GitHub Actions (Course)
- GitHub Actions for Python (Quiz)
By Martin Breuss • Updated June 22, 2026