Checkout Meltdown

advancedMultipleFailuresBOSS

Incident Report

INCIDENT: Checkout Meltdown (Boss Room)

Severity: P1 - Revenue Impact Reported: 14:32 UTC Status: OPEN - Awaiting remediation

Incident Summary

Customers cannot complete purchases. The checkout API was deployed 15 minutes ago and has been returning errors ever since. Cart abandonment is spiking.

Initial Report

"We deployed the new checkout-api and it shows as running in the dashboard, but customers are getting 503 errors. I checked and the pods are up. No idea what's happening." — On-call engineer

What We Know

The checkout-api deployment was applied successfully
Pods show status Running in kubectl get pods
The checkout-service Service exists
Customers hitting the service get 503 Service Unavailable
Multiple teams have looked at this and fixed "their part" but it's still broken

Triage Checklist

Start your investigation here:

# 1. Get overall status
kubectl get all -n escape-boss-checkout-meltdown

# 2. Check pod readiness (READY column)
kubectl get pods -n escape-boss-checkout-meltdown

# 3. Check service endpoints
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown

# 4. Check events for clues
kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp'

# 5. Describe the service and pods
kubectl describe svc checkout-service -n escape-boss-checkout-meltdown
kubectl describe pod -l app=checkout-api -n escape-boss-checkout-meltdown

Success Criteria

All checkout-api pods are in Running state AND show 1/1 Ready
The checkout-service has endpoints (not <none>)
Curling the service returns HTTP 200

Namespace

All resources are in the escape-boss-checkout-meltdown namespace.

On-call engineer, there's more than one thing broken here. Find them all, or customers keep seeing errors.

Quick Start

Run this command in your terminal to set up the room:

$ make room-apply ROOM=boss-checkout-meltdown

This creates the namespace escape-boss-checkout-meltdown with the broken resources.

Other useful commands:

$ make room-test ROOM=boss-checkout-meltdown

Verify the room is in the expected broken state

$ make room-escape-test ROOM=boss-checkout-meltdown

Test if you have successfully fixed all issues

$ make room-reset ROOM=boss-checkout-meltdown

Reset the room to try again

Useful Commands

Check pod status

$ kubectl get pods -n escape-boss-checkout-meltdown

See the current state of pods in the namespace

View events

$ kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp'

Check recent events for error details

Describe pods

$ kubectl describe pods -n escape-boss-checkout-meltdown

Get detailed information about pods

Check logs

$ kubectl logs -l app.kubernetes.io/part-of=K8sEscapeRoom -n escape-boss-checkout-meltdown

View the application logs

Hints

0/4 revealed

Submit Proof

View Solution (Spoiler)

Solution preview locked

Complete the room to unlock the full solution here

Run this to see the full solution:

$ make room-solution ROOM=boss-checkout-meltdown

Show solution anyway (spoiler)

Solution: Checkout Meltdown

Root Causes (MULTIPLE)

This incident has two independent failures that must both be fixed:

Failure #1: Service Selector Mismatch

# Service selector (WRONG):
selector:
  app: checkout      # Looking for "checkout"

# Pod labels (ACTUAL):
labels:
  app: checkout-api  # Pods have "checkout-api"

Result: Service has 0 endpoints, all traffic returns 503.

Failure #2: Readiness Probe Misconfigured

readinessProbe:
  httpGet:
    path: /
    port: 8080       # nginx listens on 80, not 8080

Result: Pods are Running but never become Ready (0/1).

Why this is tricky: Fixing EITHER problem alone doesn't restore service:

Fix selector only → endpoints still empty (pods not ready)
Fix probe only → endpoints still empty (selector wrong)

Diagnosis Steps

# Step 1: Check pod status - notice 0/1 Ready
kubectl get pods -n escape-boss-checkout-meltdown
# NAME                            READY   STATUS    RESTARTS   AGE
# checkout-api-xxxxx              0/1     Running   0          5m
# checkout-api-yyyyy              0/1     Running   0          5m

# Step 2: Check endpoints - notice <none>
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown
# NAME               ENDPOINTS   AGE
# checkout-service   <none>      5m

# Step 3: Compare selector vs labels
kubectl get svc checkout-service -n escape-boss-checkout-meltdown -o jsonpath='{.spec.selector}'
# {"app":"checkout"}

kubectl get pods -n escape-boss-checkout-meltdown --show-labels
# app=checkout-api  ← MISMATCH!

# Step 4: Check probe configuration
kubectl get deployment checkout-api -n escape-boss-checkout-meltdown -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}'
# port: 8080  ← WRONG! nginx listens on 80

# Step 5: Check events for probe failures
kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp' | grep -i readiness
# Warning  Unhealthy  Readiness probe failed: dial tcp ...:8080: connect: connection refused

The Fixes

Fix #1: Correct Service Selector

Edit the service to fix the selector:

kubectl edit svc checkout-service -n escape-boss-checkout-meltdown
# Change: app: checkout
# To:     app: checkout-api

Alternative — use a JSON patch to make the change non-interactively. This is useful in scripts or CI/CD pipelines where kubectl edit isn't practical:

kubectl patch svc checkout-service -n escape-boss-checkout-meltdown \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/selector/app", "value": "checkout-api"}]'

Fix #2: Correct Readiness Probe

Edit the deployment to fix the probe port:

kubectl edit deployment checkout-api -n escape-boss-checkout-meltdown
# Change readinessProbe port from 8080 to 80

Alternative — use a JSON patch for non-interactive environments (scripts, CI/CD):

kubectl patch deployment checkout-api -n escape-boss-checkout-meltdown \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/httpGet/port", "value": 80}]'

Verification

# Wait for new pods to roll out
kubectl rollout status deployment/checkout-api -n escape-boss-checkout-meltdown

# Check pods are now 1/1 Ready
kubectl get pods -n escape-boss-checkout-meltdown
# NAME                            READY   STATUS    RESTARTS   AGE
# checkout-api-xxxxx              1/1     Running   0          30s

# Check endpoints exist
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown
# NAME               ENDPOINTS           AGE
# checkout-service   10.x.x.x:80,...     5m

# Test the service
kubectl run test --rm -it --image=curlimages/curl --restart=Never \
  -n escape-boss-checkout-meltdown -- curl -s http://checkout-service
# Should return nginx welcome page

Lessons Learned

Multiple failures can mask each other - pods not ready means selector fix won't help
Check both labels AND readiness when debugging service connectivity
Running ≠ Ready - a pod can be Running but not receiving traffic
Always verify endpoints as part of service debugging
Read the full probe config - the port must match what the container actually listens on

Real-World Considerations

This pattern often occurs when:

Different engineers set up Deployment and Service separately
Copy-paste from different environments with different naming
Probe configuration copied from another app without adjustment
Rushed deployments skip validation steps

Prevention:

Use Helm charts or Kustomize for consistent naming
Include health endpoints in all applications
Add pre-deploy validation for selector/label matching
Use admission controllers to validate probe configurations