Checkout Meltdown
Incident Report
INCIDENT: Checkout Meltdown (Boss Room)
Severity: P1 - Revenue Impact Reported: 14:32 UTC Status: OPEN - Awaiting remediation
Incident Summary
Customers cannot complete purchases. The checkout API was deployed 15 minutes ago and has been returning errors ever since. Cart abandonment is spiking.
Initial Report
"We deployed the new checkout-api and it shows as running in the dashboard, but customers are getting 503 errors. I checked and the pods are up. No idea what's happening." — On-call engineer
What We Know
- The
checkout-apideployment was applied successfully - Pods show status
Runninginkubectl get pods - The
checkout-serviceService exists - Customers hitting the service get 503 Service Unavailable
- Multiple teams have looked at this and fixed "their part" but it's still broken
Triage Checklist
Start your investigation here:
# 1. Get overall status
kubectl get all -n escape-boss-checkout-meltdown
# 2. Check pod readiness (READY column)
kubectl get pods -n escape-boss-checkout-meltdown
# 3. Check service endpoints
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown
# 4. Check events for clues
kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp'
# 5. Describe the service and pods
kubectl describe svc checkout-service -n escape-boss-checkout-meltdown
kubectl describe pod -l app=checkout-api -n escape-boss-checkout-meltdown
Success Criteria
- All
checkout-apipods are inRunningstate AND show1/1Ready - The
checkout-servicehas endpoints (not<none>) - Curling the service returns HTTP 200
Namespace
All resources are in the escape-boss-checkout-meltdown namespace.
On-call engineer, there's more than one thing broken here. Find them all, or customers keep seeing errors.
Quick Start
Run this command in your terminal to set up the room:
$ make room-apply ROOM=boss-checkout-meltdownThis creates the namespace escape-boss-checkout-meltdown with the broken resources.
Other useful commands:
$ make room-test ROOM=boss-checkout-meltdownVerify the room is in the expected broken state
$ make room-escape-test ROOM=boss-checkout-meltdownTest if you have successfully fixed all issues
$ make room-reset ROOM=boss-checkout-meltdownReset the room to try again
Useful Commands
Check pod status
$ kubectl get pods -n escape-boss-checkout-meltdownSee the current state of pods in the namespace
View events
$ kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp'Check recent events for error details
Describe pods
$ kubectl describe pods -n escape-boss-checkout-meltdownGet detailed information about pods
Check logs
$ kubectl logs -l app.kubernetes.io/part-of=K8sEscapeRoom -n escape-boss-checkout-meltdownView the application logs
Hints
Submit Proof
Login to submit proof and track your progress.
Login with GitHubView Solution (Spoiler)
Solution preview locked
Complete the room to unlock the full solution here
Run this to see the full solution:
$ make room-solution ROOM=boss-checkout-meltdownShow solution anyway (spoiler)
Solution: Checkout Meltdown
Root Causes (MULTIPLE)
This incident has two independent failures that must both be fixed:
Failure #1: Service Selector Mismatch
# Service selector (WRONG):
selector:
app: checkout # Looking for "checkout"
# Pod labels (ACTUAL):
labels:
app: checkout-api # Pods have "checkout-api"
Result: Service has 0 endpoints, all traffic returns 503.
Failure #2: Readiness Probe Misconfigured
readinessProbe:
httpGet:
path: /
port: 8080 # nginx listens on 80, not 8080
Result: Pods are Running but never become Ready (0/1).
Why this is tricky: Fixing EITHER problem alone doesn't restore service:
- Fix selector only → endpoints still empty (pods not ready)
- Fix probe only → endpoints still empty (selector wrong)
Diagnosis Steps
# Step 1: Check pod status - notice 0/1 Ready
kubectl get pods -n escape-boss-checkout-meltdown
# NAME READY STATUS RESTARTS AGE
# checkout-api-xxxxx 0/1 Running 0 5m
# checkout-api-yyyyy 0/1 Running 0 5m
# Step 2: Check endpoints - notice <none>
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown
# NAME ENDPOINTS AGE
# checkout-service <none> 5m
# Step 3: Compare selector vs labels
kubectl get svc checkout-service -n escape-boss-checkout-meltdown -o jsonpath='{.spec.selector}'
# {"app":"checkout"}
kubectl get pods -n escape-boss-checkout-meltdown --show-labels
# app=checkout-api ← MISMATCH!
# Step 4: Check probe configuration
kubectl get deployment checkout-api -n escape-boss-checkout-meltdown -o jsonpath='{.spec.template.spec.containers[0].readinessProbe}'
# port: 8080 ← WRONG! nginx listens on 80
# Step 5: Check events for probe failures
kubectl get events -n escape-boss-checkout-meltdown --sort-by='.lastTimestamp' | grep -i readiness
# Warning Unhealthy Readiness probe failed: dial tcp ...:8080: connect: connection refused
The Fixes
Fix #1: Correct Service Selector
Edit the service to fix the selector:
kubectl edit svc checkout-service -n escape-boss-checkout-meltdown
# Change: app: checkout
# To: app: checkout-api
Alternative — use a JSON patch to make the change non-interactively. This is useful in scripts or CI/CD pipelines where kubectl edit isn't practical:
kubectl patch svc checkout-service -n escape-boss-checkout-meltdown \
--type='json' \
-p='[{"op": "replace", "path": "/spec/selector/app", "value": "checkout-api"}]'
Fix #2: Correct Readiness Probe
Edit the deployment to fix the probe port:
kubectl edit deployment checkout-api -n escape-boss-checkout-meltdown
# Change readinessProbe port from 8080 to 80
Alternative — use a JSON patch for non-interactive environments (scripts, CI/CD):
kubectl patch deployment checkout-api -n escape-boss-checkout-meltdown \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/httpGet/port", "value": 80}]'
Verification
# Wait for new pods to roll out
kubectl rollout status deployment/checkout-api -n escape-boss-checkout-meltdown
# Check pods are now 1/1 Ready
kubectl get pods -n escape-boss-checkout-meltdown
# NAME READY STATUS RESTARTS AGE
# checkout-api-xxxxx 1/1 Running 0 30s
# Check endpoints exist
kubectl get endpoints checkout-service -n escape-boss-checkout-meltdown
# NAME ENDPOINTS AGE
# checkout-service 10.x.x.x:80,... 5m
# Test the service
kubectl run test --rm -it --image=curlimages/curl --restart=Never \
-n escape-boss-checkout-meltdown -- curl -s http://checkout-service
# Should return nginx welcome page
Lessons Learned
- Multiple failures can mask each other - pods not ready means selector fix won't help
- Check both labels AND readiness when debugging service connectivity
- Running ≠ Ready - a pod can be Running but not receiving traffic
- Always verify endpoints as part of service debugging
- Read the full probe config - the port must match what the container actually listens on
Real-World Considerations
This pattern often occurs when:
- Different engineers set up Deployment and Service separately
- Copy-paste from different environments with different naming
- Probe configuration copied from another app without adjustment
- Rushed deployments skip validation steps
Prevention:
- Use Helm charts or Kustomize for consistent naming
- Include health endpoints in all applications
- Add pre-deploy validation for selector/label matching
- Use admission controllers to validate probe configurations