Rollback & Incidents¶
TL;DR — Rollback is a manual process. Two options: revert code + redeploy, or ask DevOps to re-tag an older
:{commit-hash}image as:latestin ECR. A formal incident management process is not currently documented.
Overview¶
There is no automated rollback process. Every rollback is a manual operator action. There is no one-click rollback, no automated canary analysis, and no documented incident runbook.
Rollback options¶
Option A — Revert code + redeploy¶
- Open a revert PR on the affected service repo.
- Merge → Jenkins pipeline kicks in automatically (see jenkins-k8s-jobs.md).
- Deploy to the target environment.
- Wait for rollout to complete.
Pros: clean git history, easy to understand. Cons: takes as long as a full CI + CD cycle (minutes).
Option B — DevOps re-tags an older ECR image¶
- Identify the previous known-good commit hash (visible in ECR as
:{commit-hash}, see jenkins-k8s-jobs.md). - Ask DevOps to re-tag that older
:{commit-hash}as:lateston ECR. kubectl rollout restarton the deployment to pull the re-tagged image.
Pros: faster than a code revert. Cons: no git trace of the rollback, temporary fix — still need to revert the code afterwards.
Use Option B for urgent prod incidents
If prod is down right now and CI/CD will take too long, escalate to DevOps for an ECR re-tag. Follow up with a proper revert PR within the hour.
Incident management¶
No formal process is currently documented
A formal incident management process has not been documented. Alerts are configured in Logz.io (see observability.md); no runbook, on-call rotation, or post-mortem template is published in the config repo.
Current practice:
- Detection: Logz.io alerts → Slack (see observability.md).
- Response: handled by whichever team member is available.
- Communication: Slack threads.
- Post-mortem: not formalized.
See also¶
- Jenkins K8s Jobs — Understand the CD flow you may need to trigger.
- Observability Stack — How errors surface in the first place (Logz.io + Slack).