Overzealous Warden

advancedMultipleFailuresBOSS

Incident Report

INCIDENT: Overzealous Warden (Boss Room)

Severity: P1 - Application Down Reported: 09:15 UTC Status: OPEN - Awaiting remediation

Incident Summary

The security team hardened the production namespace overnight. This morning the escape-app deployment won't start. Pods are stuck and never become Ready.

Initial Report

"Security pushed new pod security policies last night. Now our nginx pods won't even start. We've been told we can't just remove the security settings — we need to make the app work with them." — On-call engineer

What We Know

The escape-app Deployment was redeployed with new security context settings
Pods are NOT in Running state
The security team requires runAsNonRoot and readOnlyRootFilesystem to stay enabled
There may be more than one issue — fixing the first problem could reveal another

Triage Checklist

Start your investigation here:

# 1. Get overall status
kubectl get all -n escape-boss-overzealous-warden

# 2. Check pod status and events
kubectl get pods -n escape-boss-overzealous-warden
kubectl describe pod -l app=escape-app -n escape-boss-overzealous-warden

# 3. Check the security context
kubectl get deployment escape-app -n escape-boss-overzealous-warden \
  -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq .

# 4. Check events for clues
kubectl get events -n escape-boss-overzealous-warden --sort-by='.lastTimestamp'

Success Criteria

All escape-app pods are in Running state AND show 1/1 Ready
runAsNonRoot: true is still set (don't just remove security)
readOnlyRootFilesystem: true is still set (don't just remove security)

Namespace

All resources are in the escape-boss-overzealous-warden namespace.

On-call engineer, the security settings must stay. Make the app work within the constraints.

Quick Start

Run this command in your terminal to set up the room:

$ make room-apply ROOM=boss-overzealous-warden

This creates the namespace escape-boss-overzealous-warden with the broken resources.

Other useful commands:

$ make room-test ROOM=boss-overzealous-warden

Verify the room is in the expected broken state

$ make room-escape-test ROOM=boss-overzealous-warden

Test if you have successfully fixed all issues

$ make room-reset ROOM=boss-overzealous-warden

Reset the room to try again

Useful Commands

Check pod status

$ kubectl get pods -n escape-boss-overzealous-warden

See the current state of pods in the namespace

View events

$ kubectl get events -n escape-boss-overzealous-warden --sort-by='.lastTimestamp'

Check recent events for error details

Describe pods

$ kubectl describe pods -n escape-boss-overzealous-warden

Get detailed information about pods

Check logs

$ kubectl logs -l app.kubernetes.io/part-of=K8sEscapeRoom -n escape-boss-overzealous-warden

View the application logs

Hints

0/4 revealed

Submit Proof

View Solution (Spoiler)

Solution preview locked

Complete the room to unlock the full solution here

Run this to see the full solution:

$ make room-solution ROOM=boss-overzealous-warden

Show solution anyway (spoiler)

Solution: Security Lockdown

Root Causes (MULTIPLE)

This incident has two layered failures — the second is invisible until the first is fixed:

Failure #1: runAsNonRoot Without runAsUser

securityContext:
  runAsNonRoot: true    # Requires non-root user
  # runAsUser: ???      # But no user is specified!

The nginx:1.25-alpine image runs as root by default (UID 0). When runAsNonRoot: true is set without specifying a runAsUser, Kubernetes checks the image's default user, sees it's root, and refuses to start the container.

Result: CreateContainerConfigError — container never starts.

Failure #2: Read-Only Filesystem Without Writable /tmp

securityContext:
  readOnlyRootFilesystem: true  # Entire filesystem is read-only
  # No emptyDir volume for /tmp!

The nginx.conf is already configured to write its PID file, cache, and all temp files to /tmp. But readOnlyRootFilesystem: true makes /tmp read-only along with everything else. nginx crashes immediately on startup.

Result: CrashLoopBackOff — container starts but crashes on first write.

Why this is tricky: Bug #2 is completely hidden while Bug #1 is active. The container never starts, so you never see the filesystem error.

Diagnosis Steps

# Step 1: Check pod status — notice CreateContainerConfigError
kubectl get pods -n escape-boss-overzealous-warden
# NAME                          READY   STATUS                       RESTARTS   AGE
# escape-app-xxxxx              0/1     CreateContainerConfigError   0          5m

# Step 2: Describe pod for the error message
kubectl describe pod -l app=escape-app -n escape-boss-overzealous-warden
# Events:
#   Warning  Failed  container has runAsNonRoot and image will run as root

# Step 3: Check the security context
kubectl get deployment escape-app -n escape-boss-overzealous-warden \
  -o jsonpath='{.spec.template.spec.containers[0].securityContext}'
# {"readOnlyRootFilesystem":true,"runAsNonRoot":true}
# Notice: no runAsUser!

# Step 4: After fixing runAsUser, pod crashes — check logs
kubectl logs -l app=escape-app -n escape-boss-overzealous-warden --previous
# nginx: [emerg] open() "/tmp/nginx.pid" failed (30: Read-only file system)

The Fix

Open the deployment in your editor:

kubectl edit deployment escape-app -n escape-boss-overzealous-warden

You can fix both bugs in one edit. Here's what to change — lines marked with # <-- ADD are the only additions:

    spec:
      containers:
        - name: nginx
          # ...
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf   # already exists
              name: nginx-config                  # already exists
              subPath: nginx.conf                 # already exists
              readOnly: true                      # already exists
            - mountPath: /tmp                     # <-- ADD
              name: tmp                           # <-- ADD
          securityContext:
            runAsNonRoot: true
            runAsUser: 101                        # <-- ADD (nginx user in alpine)
            readOnlyRootFilesystem: true
      volumes:
        - configMap:                              # already exists
            name: nginx-config                    # already exists
          name: nginx-config                      # already exists
        - emptyDir: {}                            # <-- ADD
          name: tmp                               # <-- ADD

Save and close — Kubernetes rolls out a new pod automatically.

What each change does:

runAsUser: 101 — tells Kubernetes to run the container as the nginx user (UID 101) instead of root, satisfying runAsNonRoot
emptyDir at /tmp — provides a writable directory for nginx's PID file, cache, and temp files, while the rest of the filesystem stays read-only

Verification

# Wait for rollout
kubectl rollout status deployment/escape-app -n escape-boss-overzealous-warden

# Check pods are Running and Ready
kubectl get pods -n escape-boss-overzealous-warden
# NAME                          READY   STATUS    RESTARTS   AGE
# escape-app-xxxxx              1/1     Running   0          30s

# Verify security context is still enforced
kubectl get deployment escape-app -n escape-boss-overzealous-warden \
  -o jsonpath='{.spec.template.spec.containers[0].securityContext}'
# Should still have runAsNonRoot: true AND readOnlyRootFilesystem: true

Lessons Learned

Layered failures hide each other — the container must start before filesystem errors appear
runAsNonRoot requires explicit runAsUser when the image defaults to root
readOnlyRootFilesystem requires writable volumes for any directory the app writes to
Consolidate writable paths to /tmp — a single emptyDir is simpler than many
Don't remove security to fix issues — work within the constraints using volumes and user settings

Real-World Considerations

This pattern is extremely common in production:

Pod Security Standards (PSS) enforce runAsNonRoot at the namespace level
CIS benchmarks recommend readOnlyRootFilesystem for all containers
Many popular images (nginx, redis, postgres) default to running as root
Teams often enable security policies without testing existing deployments

Prevention:

Use distroless or non-root base images
Always specify runAsUser alongside runAsNonRoot
Test with readOnlyRootFilesystem: true during development
Configure apps to write all temp/cache/pid files under /tmp
Use Pod Security Admission to catch misconfigurations before deployment