Docs · runbook

Incident response runbook.

What to check, in what order, when something is wrong. Skim it once now so you remember where to look at 3am.

Status page reads "incident"

Your dashboard or our public status page shows a monitor as down or degraded.

Open the affected monitor's detail page. Look at "Recent checks" — are all recent checks failing the same way? Same status code? Same error?
If so, the issue is likely on the target service, not on our checking. Verify the URL responds correctly from your own machine.
If verification from your own machine succeeds, your service may be partitioned away from our checking infrastructure. Try reaching it from a different network.
Acknowledge the incident in the dashboard so other team members know it's being looked at. Post an update with what you've found.
Resolve the incident when the service is back. The next check tick will mark the monitor as up automatically.

You have an alert webhook / Slack / Discord configured but no message arrived for a known-down monitor.

Open the monitor → "Send test alert to all configured channels". Each channel reports per-channel pass/fail with the exact error.
If a channel is "not configured", check the monitor's edit page — the URL field may be blank.
If a channel reports "failed: HTTP 401" (or similar), the URL/token is wrong or revoked. Re-copy from the upstream provider (Slack/Discord/etc) and save.
If all channels are configured AND test passes BUT real alerts don't arrive: check your alert threshold (default 1, but if you've raised it, the monitor needs that many consecutive failures before paging).
Check our public status page — if our alert pipeline itself is degraded, you'll see it there.

Your cron job runs but the heartbeat monitor still goes down intermittently.

Check the timing: if your job takes > (intervalSec + heartbeatGraceSec), every late completion looks like a miss. Increase the grace seconds on the monitor.
Check the receive URL is correct — log it from your job and compare to the dashboard.
Verify the POST returns 204 (use curl -fsS -X POST <url>). 4xx means the token is rotated or invalid.
Heartbeat URL leaked? Open the monitor → "Rotate". The old URL is invalidated immediately.

You ran docker compose up -d and one or more services are crashlooping.

docker compose logs <service> — read the last error.
The most common cause: missing or weak env vars. JWT_SECRET must be ≥64 chars, ENCRYPTION_KEY must be ≥32 chars. Check your .env.
If the API can't reach Postgres: check Postgres is healthy with docker compose ps. The API depends on it.
If migrations haven't run, the API will crash on first DB call. Run docker compose run --rm --entrypoint "" -w /app/packages/db api node dist/migrate.js.

The check-results table grows over time. Check disk usage with docker system df -v.
The 90-day TTL on the time-series table prunes automatically. If you need it shorter, edit the TTL in the migration and re-apply.
Log volume can also grow. Loki retention can be tightened in infra/loki/config.yaml.

Treat as a security incident. Time matters.

If a personal access token leaked: Settings → API tokens → Revoke. Anything using it stops working immediately.
If your webhook signing secret leaked: Settings → Webhook signing secret → Rotate. Future webhooks signed with the new secret; receivers must update.
If a heartbeat URL leaked: open the monitor → Rotate. Old URL is invalidated.
If your password leaked: Settings → Change password (proves you have the current one).
If the entire DB is suspected compromised: rotate ENCRYPTION_KEY (re-encrypts headers on next monitor PATCH), rotate JWT_SECRET (invalidates all sessions), rotate every PAT.
Email [email protected] if you suspect the issue is in our software.

GitHub Issues — bugs, feature requests, public discussion
[email protected] — hosted-account questions
[email protected] — vulnerability disclosures (preferred PGP available on request)