Rollback & Incidents¶

TL;DR — Rollback is a manual process. Two options: revert code + redeploy, or ask DevOps to re-tag an older :{commit-hash} image as :latest in ECR. A formal incident management process is not currently documented.

Overview¶

There is no automated rollback process. Every rollback is a manual operator action. There is no one-click rollback, no automated canary analysis, and no documented incident runbook.

Rollback options¶

Option A — Revert code + redeploy¶

Open a revert PR on the affected service repo.
Merge → Jenkins pipeline kicks in automatically (see jenkins-k8s-jobs.md).
Deploy to the target environment.
Wait for rollout to complete.

Pros: clean git history, easy to understand. Cons: takes as long as a full CI + CD cycle (minutes).

Option B — DevOps re-tags an older ECR image¶

Identify the previous known-good commit hash (visible in ECR as :{commit-hash}, see jenkins-k8s-jobs.md).
Ask DevOps to re-tag that older :{commit-hash} as :latest on ECR.
kubectl rollout restart on the deployment to pull the re-tagged image.

Pros: faster than a code revert. Cons: no git trace of the rollback, temporary fix — still need to revert the code afterwards.

Use Option B for urgent prod incidents

If prod is down right now and CI/CD will take too long, escalate to DevOps for an ECR re-tag. Follow up with a proper revert PR within the hour.

Incident management¶

No formal process is currently documented

A formal incident management process has not been documented. Alerts are configured in Logz.io (see observability.md); no runbook, on-call rotation, or post-mortem template is published in the config repo.

Current practice:

Detection: Logz.io alerts → Slack (see observability.md).
Response: handled by whichever team member is available.
Communication: Slack threads.
Post-mortem: not formalized.

Tags¶

rollback #incidents #on-call #ecr #oncall¶