Skip to content

Rollback & Incidents

TL;DR — Rollback is a manual process. Two options: revert code + redeploy, or ask DevOps to re-tag an older :{commit-hash} image as :latest in ECR. A formal incident management process is not currently documented.

Overview

There is no automated rollback process. Every rollback is a manual operator action. There is no one-click rollback, no automated canary analysis, and no documented incident runbook.

Rollback options

Option A — Revert code + redeploy

  1. Open a revert PR on the affected service repo.
  2. Merge → Jenkins pipeline kicks in automatically (see jenkins-k8s-jobs.md).
  3. Deploy to the target environment.
  4. Wait for rollout to complete.

Pros: clean git history, easy to understand. Cons: takes as long as a full CI + CD cycle (minutes).

Option B — DevOps re-tags an older ECR image

  1. Identify the previous known-good commit hash (visible in ECR as :{commit-hash}, see jenkins-k8s-jobs.md).
  2. Ask DevOps to re-tag that older :{commit-hash} as :latest on ECR.
  3. kubectl rollout restart on the deployment to pull the re-tagged image.

Pros: faster than a code revert. Cons: no git trace of the rollback, temporary fix — still need to revert the code afterwards.

Use Option B for urgent prod incidents

If prod is down right now and CI/CD will take too long, escalate to DevOps for an ECR re-tag. Follow up with a proper revert PR within the hour.

Incident management

No formal process is currently documented

A formal incident management process has not been documented. Alerts are configured in Logz.io (see observability.md); no runbook, on-call rotation, or post-mortem template is published in the config repo.

Current practice:

  • Detection: Logz.io alerts → Slack (see observability.md).
  • Response: handled by whichever team member is available.
  • Communication: Slack threads.
  • Post-mortem: not formalized.

See also

Tags

rollback #incidents #on-call #ecr #oncall