Skip to main content

Command Palette

Search for a command to run...

Building a Deployment Health Validator

5 Subtle Bugs That Break Production (and how to find them)

Published
19 min read
Building a Deployment Health Validator
S

A versatile DevSecOps Engineer specialized in creating secure, scalable, and efficient systems that bridge development and operations. My expertise lies in automating complex processes, integrating AI-driven solutions, and ensuring seamless, secure delivery pipelines. With a deep understanding of cloud infrastructure, CI/CD, and cybersecurity, I thrive on solving challenges at the intersection of innovation and security, driving continuous improvement in both technology and team dynamics.

A deep-dive into microservice health checking, topological startup ordering and why HTTP 200 does not mean a service is healthy.


The Incident

Picture this: your on-call rotation fires a PagerDuty alert at 3:07 AM. The deployment pipeline says green. Every service returned HTTP 200. The readiness check passed. But your notification pipeline is completely silent, job workers are backed up with 1,482 queued tasks, and customers are already filing tickets.

You pull up the deployment validator logs. Everything looks fine on the surface. The tool reported overall_status: healthy. It was wrong on every count.

This exact scenario is what the Deployment Health Validator task is built around. Five real-world bugs, carefully placed, each one plausible enough that most automated agents (and plenty of human engineers) miss at least one. This post walks through each bug, the architecture behind the validator, and how to build one correctly from scratch.


What Are We Building?

A deployment health validator does four things:

  1. Reads a manifest that describes your microservices (ports, health endpoints, dependencies, criticality)

  2. Probes each service's health endpoint and parses the response body, not just the HTTP status

  3. Computes a weighted readiness score based on how critical each service is

  4. Runs a topological sort (Kahn's algorithm) to produce a correct startup order where every dependency starts before the services that depend on it

The output is a structured JSON report that your deployment pipeline can parse and gate on.


Architecture Overview

Production Stack (mock services)

Service Port Endpoint Response
auth-service :8081 /health {"status": "ok"}
api-gateway :8082 /health {"status": "healthy"}
cache-service :8083 /ping "pong" (plain text!)
worker-service :8084 /status {"status": "degraded", "queue_depth": 1482}
notification-service :8085 /health {"status": "ok"}

Dependency Graph


Real-World Parallel: Netflix's Hystrix, AWS Health Dashboards and Kubernetes Readiness Probes

Before diving into code, here's why this problem matters in production systems.

AWS Elastic Load Balancer will route traffic to a backend if its health check returns HTTP 200 on /health. That's it. If your app returns {"status": "starting_up"} with a 200, ELB thinks it's healthy and sends it live traffic. This exact failure mode took down a major e-commerce platform in 2019 during a Black Friday deploy.

Kubernetes readiness probes are a direct response to this problem. A pod's readiness probe checks whether it should receive traffic, separate from the liveness probe, which checks if it should be restarted. The probe can check an HTTP endpoint, but Kubernetes does not parse the body. You have to do that yourself, in your validator.

Netflix's Chaos Engineering toolkit specifically tests whether services correctly report unhealthy states under load. worker-service in this task is modelled on exactly this pattern: the service is technically "up" (HTTP 200) but degraded under load (queue_depth: 1482). A naive health checker marks it healthy. A correct one reads the body.


Project Structure

deployment-health-validator/
├── task.toml                         # Task metadata, timeouts, resource limits
├── instruction.md                    # What an agent (or engineer) must fix
├── environment/
│   ├── Dockerfile                    # Container definition
│   ├── deployment_manifest.yaml      # Service definitions (with a decoy key)
│   ├── mock_services.py              # Five Flask servers simulating endpoints
│   └── validator.py                  # THE BROKEN FILE
├── solution/
│   └── solve.sh                      # Oracle: patches all 5 bugs and runs
└── tests/
    └── test_outputs.py               # 19 pytest assertions

GitHub: terminal-bench-2-hard-devops-diagnostics


Step-by-Step Implementation Guide

Step 1: Set Up the Environment

git clone https://github.com/SubhanshuMG/terminal-bench-2-hard-devops-diagnostics.git
cd terminal-bench-2-hard-devops-diagnostics

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install flask requests pyyaml pytest

Step 2: Write the Deployment Manifest

The manifest has one deliberate trap: a top-level services: key that contains only a legacy monitoring entry. The real services live under deployment.services. This mirrors real-world YAML configs where legacy keys accumulate over time and create ambiguity.

# deployment_manifest.yaml

# Legacy monitoring registry (NOT the authoritative list)
services:
  - name: "metrics-collector"
    port: 9091
    health_endpoint: "/metrics"
    dependencies: []
    criticality: "low"

# Authoritative deployment configuration
deployment:
  name: "production-stack"

  services:
    - name: "auth-service"
      port: 8081
      health_endpoint: "/health"
      dependencies: []
      criticality: "high"

    - name: "api-gateway"
      port: 8082
      health_endpoint: "/health"
      dependencies:
        - "auth-service"
        - "cache-service"
      criticality: "high"

    - name: "cache-service"
      port: 8083
      health_endpoint: "/ping"       # note: NOT /health
      dependencies: []
      criticality: "medium"

    - name: "worker-service"
      port: 8084
      health_endpoint: "/status"     # returns HTTP 200 but body says degraded
      dependencies:
        - "api-gateway"
        - "cache-service"
      criticality: "low"

    - name: "notification-service"
      port: 8085
      health_endpoint: "/health"
      dependencies:
        - "worker-service"
      criticality: "low"

Step 3: Build the Mock Services

These five Flask servers simulate real microservice health endpoints. The key design choice: worker-service returns HTTP 200 with {"status": "degraded"}. Any validator that only checks the status code will silently miss this.

# mock_services.py

#!/usr/bin/env python3
import threading
from flask import Flask, jsonify

import logging
log = logging.getLogger("werkzeug")
log.setLevel(logging.ERROR)

auth_app = Flask("auth-service")

@auth_app.route("/health")
def auth_health():
    return jsonify({"status": "ok", "version": "2.1.0"}), 200


gateway_app = Flask("api-gateway")

@gateway_app.route("/health")
def gateway_health():
    return jsonify({"status": "healthy", "uptime_seconds": 3601}), 200


cache_app = Flask("cache-service")

@cache_app.route("/ping")
def cache_ping():
    return "pong", 200          # plain text, not JSON


worker_app = Flask("worker-service")

@worker_app.route("/status")
def worker_status():
    # HTTP 200 but the service is overloaded
    return jsonify({"status": "degraded", "queue_depth": 1482}), 200


notif_app = Flask("notification-service")

@notif_app.route("/health")
def notif_health():
    return jsonify({"status": "ok", "pending_notifications": 0}), 200


def _run(app, port):
    app.run(host="0.0.0.0", port=port, debug=False, use_reloader=False)


if __name__ == "__main__":
    specs = [
        (auth_app,    8081, "auth-service         /health"),
        (gateway_app, 8082, "api-gateway          /health"),
        (cache_app,   8083, "cache-service        /ping  "),
        (worker_app,  8084, "worker-service       /status"),
        (notif_app,   8085, "notification-service /health"),
    ]

    threads = []
    for app, port, label in specs:
        t = threading.Thread(target=_run, args=(app, port), daemon=True)
        t.start()
        threads.append(t)
        print(f"  started  {label}  -> http://0.0.0.0:{port}")

    print("All mock services running. Ctrl-C to stop.")
    for t in threads:
        t.join()

Start them in one terminal: python mock_services.py

Verify manually:

curl http://localhost:8081/health    # {"status": "ok", "version": "2.1.0"}
curl http://localhost:8082/health    # {"status": "healthy", "uptime_seconds": 3601}
curl http://localhost:8083/ping      # pong
curl http://localhost:8084/status    # {"status": "degraded", "queue_depth": 1482}
curl http://localhost:8085/health    # {"status": "ok", "pending_notifications": 0}

First discipline: always curl your endpoints before writing a health checker. The field names, the endpoint paths, and the semantic meaning of a 200 response all matter.


Step 4: The Broken Validator (And All Five Bugs)

Here is validator.py as it ships in the repository, with each bug annotated:

# validator.py (BROKEN)

#!/usr/bin/env python3
import json
import yaml
import requests
from datetime import datetime, timezone
from collections import deque


def load_services(manifest_path: str) -> list:
    with open(manifest_path) as f:
        config = yaml.safe_load(f)
    # BUG 1: reads config["services"] which is the legacy monitoring entry
    # only "metrics-collector" on port 9091 is returned; none of the real services
    return config["services"]


def check_health(service: dict) -> dict:
    url = f"http://127.0.0.1:{service['port']}{service['health_endpoint']}"
    try:
        resp = requests.get(url, timeout=5)
        if resp.status_code != 200:
            return {"status": "unhealthy", "http_status": resp.status_code,
                    "criticality": service["criticality"]}
        try:
            body = resp.json()
            # BUG 2: reads "health_status" key; no service ever sends this field
            # body.get("health_status", "ok") always returns default "ok"
            # worker-service sends {"status": "degraded"} but is reported healthy
            state = body.get("health_status", "ok")
        except ValueError:
            state = "ok"
        healthy = state in ("ok", "up", "healthy")
        return {"status": "healthy" if healthy else "unhealthy",
                "http_status": resp.status_code,
                "criticality": service["criticality"]}
    except requests.exceptions.RequestException:
        return {"status": "unhealthy", "http_status": 0,
                "criticality": service["criticality"]}


def compute_startup_order(services: list) -> list:
    names = [s["name"] for s in services]
    deps_map = {s["name"]: s.get("dependencies", []) for s in services}

    graph     = {n: [] for n in names}
    in_degree = {n: 0 for n in names}

    for svc, deps in deps_map.items():
        for dep in deps:
            # BUG 3: edges are reversed; leaf nodes get scheduled first
            # should be graph[dep].append(svc) and in_degree[svc] += 1
            graph[svc].append(dep)
            in_degree[dep] += 1

    queue = deque(n for n in names if in_degree[n] == 0)
    order = []
    while queue:
        node = queue.popleft()
        order.append(node)
        for neighbor in graph[node]:
            in_degree[neighbor] -= 1
            if in_degree[neighbor] == 0:
                queue.append(neighbor)
    return order


def compute_readiness_score(services: list, statuses: dict) -> float:
    # BUG 4: all weights equal 1; criticality is ignored entirely
    # spec says high=3, medium=2, low=1
    weight_map = {"high": 1, "medium": 1, "low": 1}

    total   = sum(weight_map[s["criticality"]] for s in services)
    healthy = sum(weight_map[s["criticality"]]
                  for s in services
                  if statuses[s["name"]]["status"] == "healthy")
    return round(healthy / total, 4) if total else 0.0


def determine_status(services: list, statuses: dict, score: float):
    # BUG 5: checks ALL services instead of only high-criticality ones
    # worker-service (low criticality) being unhealthy triggers "critical"
    critical_ok = all(
        statuses[s["name"]]["status"] == "healthy"
        for s in services    # should be: for s in services if s["criticality"] == "high"
    )

    if not critical_ok:
        return "critical", critical_ok
    if score >= 0.95:
        return "healthy", critical_ok
    if score >= 0.70:
        return "degraded", critical_ok
    return "not_ready", critical_ok

Breaking Down Each Bug

Bug 1: Wrong YAML Key Path

# BROKEN:
return config["services"]

# FIXED:
return config["deployment"]["services"]

The manifest has two services keys at different nesting levels. The top-level one is explicitly labelled as a "legacy monitoring registry." The validator reads the wrong one, picks up only metrics-collector, and probes port 9091 instead of the real five services.

Why it's hard to catch: The broken code runs without errors. It probes an endpoint (9091) that isn't listening, gets a connection refused, marks metrics-collector as unhealthy, and writes a report that looks structurally valid. No stack trace — just wrong data.

Bug 2: Wrong JSON Body Field Name (the hardest one)

# BROKEN:
state = body.get("health_status", "ok")

# FIXED:
state = body.get("status", "ok")

This is the most insidious bug in the set. worker-service returns {"status": "degraded", "queue_depth": 1482} with HTTP 200. The broken code reads the health_status field, which doesn't exist in any response. dict.get() returns the default "ok". The validator marks worker-service as healthy.

Real-world equivalent: An AWS Lambda function returns {"statusCode": 200, "body": "{\"error\": \"DB_TIMEOUT\"}"}. If you check response.statusCode == 200 and call it done, you miss the error in the body entirely.

Why agents miss this: The output looks correct. Four services are healthy. No exceptions are thrown. You would only catch it by curling the endpoint yourself and tracing exactly which field the code reads from the response.

Service Body Field read (broken) Result
auth-service {"status": "ok"} health_status (missing) default "ok" → healthy
api-gateway {"status": "healthy"} health_status (missing) default "ok" → healthy
cache-service "pong" (not JSON) ValueError caught "ok" → healthy
worker-service {"status": "degraded"} health_status (missing) default "ok" → wrongly healthy
notification-service {"status": "ok"} health_status (missing) default "ok" → healthy

Bug 3: Reversed Topological Sort

# BROKEN (edges reversed):
graph[svc].append(dep)
in_degree[dep] += 1

# FIXED (dep must start before svc):
graph[dep].append(svc)
in_degree[svc] += 1

Kahn's algorithm itself is structurally correct. The bug is in how the graph is constructed. The broken code points from a service back to its dependencies, inverting the dependency flow. Nodes with no outgoing edges (the real leaf nodes like notification-service) end up with zero in-degree and get scheduled first.

Correct startup order:

auth-service → cache-service → api-gateway → worker-service → notification-service

Broken output:

notification-service → worker-service → api-gateway → cache-service → auth-service

Real-world consequence: You start notification-service before worker-service is ready. It tries to connect, fails, and crashes. Your deployment fails not because of a bug in a service, but because of a bug in the tool that decides the order to start services.

Bug 4: All Criticality Weights Equal 1

# BROKEN:
weight_map = {"high": 1, "medium": 1, "low": 1}

# FIXED:
weight_map = {"high": 3, "medium": 2, "low": 1}

The task specification defines weights of 3/2/1 for high/medium/low criticality. With equal weights:

Broken score:  4 healthy out of 5 total = 0.8
Correct score: (3 + 3 + 2 + 1) / (3 + 3 + 2 + 1 + 1) = 9/10 = 0.9

The broken score still exceeds the 0.70 threshold and stays below 0.95, so the overall status computation might accidentally give the right answer for the wrong reason. But downstream systems relying on an accurate score for SLA calculations will get wrong numbers.

Bug 5: Critical Services Check Ignores Criticality

# BROKEN:
critical_ok = all(
    statuses[s["name"]]["status"] == "healthy"
    for s in services
)

# FIXED:
critical_ok = all(
    statuses[s["name"]]["status"] == "healthy"
    for s in services
    if s["criticality"] == "high"
)

With worker-service (low criticality) being unhealthy, the broken code sets critical_services_healthy = False and returns overall_status: "critical". The correct answer is "degraded"All high-criticality services are healthy, but the readiness score (0.9) falls below the 0.95 threshold for a fully healthy deployment.

Real-world consequence: A "critical" status might trigger an automated rollback, page every on-call engineer in the org, or block a release. Triggering a critical alarm because a low-priority background worker is degraded is exactly the kind of alert fatigue that causes engineers to start ignoring pages.


Step 5: The Fixed Validator

# validator_fixed.py

#!/usr/bin/env python3
import json
import yaml
import requests
from datetime import datetime, timezone
from collections import deque


def load_services(manifest_path: str) -> list:
    with open(manifest_path) as f:
        config = yaml.safe_load(f)
    # FIX 1: correct key path
    return config["deployment"]["services"]


def check_health(service: dict) -> dict:
    url = f"http://127.0.0.1:{service['port']}{service['health_endpoint']}"
    try:
        resp = requests.get(url, timeout=5)
        if resp.status_code != 200:
            return {"status": "unhealthy", "http_status": resp.status_code,
                    "criticality": service["criticality"]}
        try:
            body = resp.json()
            # FIX 2: read the correct field name "status"
            status_ok = body.get("status", "ok") in ("ok", "up", "healthy")
        except ValueError:
            status_ok = True  # non-JSON (e.g. "pong"): HTTP 200 is sufficient
        return {"status": "healthy" if status_ok else "unhealthy",
                "http_status": resp.status_code,
                "criticality": service["criticality"]}
    except requests.exceptions.RequestException:
        return {"status": "unhealthy", "http_status": 0,
                "criticality": service["criticality"]}


def compute_startup_order(services: list) -> list:
    names = [s["name"] for s in services]
    deps_map = {s["name"]: s.get("dependencies", []) for s in services}

    graph     = {n: [] for n in names}
    in_degree = {n: 0 for n in names}

    for svc, deps in deps_map.items():
        for dep in deps:
            # FIX 3: correct edge direction; dep starts before svc
            graph[dep].append(svc)
            in_degree[svc] += 1

    queue = deque(n for n in names if in_degree[n] == 0)
    order = []
    while queue:
        node = queue.popleft()
        order.append(node)
        for neighbor in graph[node]:
            in_degree[neighbor] -= 1
            if in_degree[neighbor] == 0:
                queue.append(neighbor)
    return order


def compute_readiness_score(services: list, statuses: dict) -> float:
    # FIX 4: correct criticality weights
    weight_map = {"high": 3, "medium": 2, "low": 1}

    total   = sum(weight_map[s["criticality"]] for s in services)
    healthy = sum(weight_map[s["criticality"]]
                  for s in services
                  if statuses[s["name"]]["status"] == "healthy")
    return round(healthy / total, 4) if total else 0.0


def determine_status(services: list, statuses: dict, score: float):
    # FIX 5: only check high-criticality services
    critical_ok = all(
        statuses[s["name"]]["status"] == "healthy"
        for s in services
        if s["criticality"] == "high"
    )

    if not critical_ok:
        return "critical", critical_ok
    if score >= 0.95:
        return "healthy", critical_ok
    if score >= 0.70:
        return "degraded", critical_ok
    return "not_ready", critical_ok


def main():
    manifest_path = "/app/deployment_manifest.yaml"
    output_path   = "/app/deployment_report.json"

    services = load_services(manifest_path)

    statuses = {}
    for svc in services:
        result = check_health(svc)
        statuses[svc["name"]] = result
        print(f"  {svc['name']:25s} {result['status']:10s} (HTTP {result['http_status']})")

    startup_order = compute_startup_order(services)
    score = compute_readiness_score(services, statuses)
    overall_status, critical_ok = determine_status(services, statuses, score)

    report = {
        "deployment_name":           "production-stack",
        "overall_status":            overall_status,
        "readiness_score":           score,
        "service_statuses":          statuses,
        "startup_order":             startup_order,
        "critical_services_healthy": critical_ok,
        "timestamp":                 datetime.now(timezone.utc).isoformat(),
    }

    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)

    print(f"\nReport written to {output_path}")
    print(f"Overall status : {overall_status}")
    print(f"Readiness score: {score}")


if __name__ == "__main__":
    main()

Step 6: Expected Output

{
  "deployment_name": "production-stack",
  "overall_status": "degraded",
  "readiness_score": 0.9,
  "service_statuses": {
    "auth-service":         { "status": "healthy",   "http_status": 200, "criticality": "high" },
    "api-gateway":          { "status": "healthy",   "http_status": 200, "criticality": "high" },
    "cache-service":        { "status": "healthy",   "http_status": 200, "criticality": "medium" },
    "worker-service":       { "status": "unhealthy", "http_status": 200, "criticality": "low" },
    "notification-service": { "status": "healthy",   "http_status": 200, "criticality": "low" }
  },
  "startup_order": [
    "auth-service",
    "cache-service",
    "api-gateway",
    "worker-service",
    "notification-service"
  ],
  "critical_services_healthy": true,
  "timestamp": "2024-01-01T00:00:00+00:00"
}

Score derivation (high=3, medium=2, low=1):

Healthy weight: auth(3) + api-gateway(3) + cache(2) + notification(1) = 9
Total weight:   9 + worker(1) = 10
Readiness score: 9/10 = 0.9

Status logic:
  critical_services_healthy = true   (both high services are healthy)
  score = 0.9, which is < 0.95
  result: "degraded"

Step 7: The Dockerfile

FROM python:3.12-slim

RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/
RUN pip install --no-cache-dir flask==3.0.3 requests==2.32.3 pyyaml==6.0.2

COPY * /app/

# Start mock services in background, wait for auth-service to be ready,
# then hold the container open for the agent or oracle to exec into.
CMD ["/bin/bash", "-c", \
     "python /app/mock_services.py > /tmp/mock_services.log 2>&1 & \
      until curl -sf http://localhost:8081/health > /dev/null 2>&1; do sleep 1; done && \
      sleep infinity"]

Build and run:

docker build -t deployment-validator ./deployment-health-validator/environment/
docker run -it --rm deployment-validator bash

# Inside the container:
python /app/mock_services.py &
sleep 2
python /app/validator.py     # or run the fixed version
cat /app/deployment_report.json

Step 8: Run the Test Suite

The 19 tests break down as follows:

Test Catches
test_report_file_exists Validator ran at all
test_report_top_level_keys Schema correctness
test_deployment_name Bug 1 (wrong key path returns wrong name)
test_timestamp_format ISO 8601 UTC format
test_all_five_services_present Bug 1 (only 1 service if wrong key)
test_auth_service_healthy Service probing works
test_api_gateway_healthy Service probing works
test_cache_service_healthy Agent read the manifest (endpoint is /ping)
test_worker_service_unhealthy Bug 2 (wrong field name)
test_notification_service_healthy Service probing works
test_service_criticality_values Manifest parsing
test_readiness_score Bug 4 (weights)
test_critical_services_healthy Bug 5 (criticality filter)
test_overall_status_degraded All bugs combined
test_startup_order_has_all_services Bug 3 (topo sort)
test_startup_order_auth_before_gateway Bug 3
test_startup_order_cache_before_gateway Bug 3
test_startup_order_gateway_before_worker Bug 3
test_startup_order_worker_before_notification Bug 3

Run them:

# Inside the container, after running the validator
pytest /app/tests/ -v

A passing run looks like:

PASSED tests/test_outputs.py::test_report_file_exists
PASSED tests/test_outputs.py::test_report_top_level_keys
PASSED tests/test_outputs.py::test_deployment_name
PASSED tests/test_outputs.py::test_timestamp_format
PASSED tests/test_outputs.py::test_all_five_services_present
PASSED tests/test_outputs.py::test_auth_service_healthy
PASSED tests/test_outputs.py::test_api_gateway_healthy
PASSED tests/test_outputs.py::test_cache_service_healthy
PASSED tests/test_outputs.py::test_worker_service_unhealthy
PASSED tests/test_outputs.py::test_notification_service_healthy
PASSED tests/test_outputs.py::test_service_criticality_values
PASSED tests/test_outputs.py::test_readiness_score
PASSED tests/test_outputs.py::test_critical_services_healthy
PASSED tests/test_outputs.py::test_overall_status_degraded
PASSED tests/test_outputs.py::test_startup_order_has_all_services
PASSED tests/test_outputs.py::test_startup_order_auth_before_gateway
PASSED tests/test_outputs.py::test_startup_order_cache_before_gateway
PASSED tests/test_outputs.py::test_startup_order_gateway_before_worker
PASSED tests/test_outputs.py::test_startup_order_worker_before_notification

19 passed in 0.12s

Step 9: Running as a Terminal Bench 2.0 Task

This repository is also a valid Terminal Bench 2.0 task submission. You can run it against an AI agent to measure how reliably the agent finds all five bugs.

pip install bespokelabs-harbor

export GROQ_API_KEY=<your-key>

# Verify the oracle (the task must be solvable before you can score agents against it)
harbor run -p ./deployment-health-validator -a oracle -q

# Run an agent trial; k=10 gives a statistically meaningful success rate
harbor run -p ./deployment-health-validator \
    -a terminus-2 \
    -m groq/moonshotai/kimi-k2-instruct-0905 \
    -k 10

The Difficulty Calibration Story

Getting the task into the "hard" range (between 0% and 70% agent success) required five iterations:

Iteration Bug 2 Design Agent Success
1 "healthy" missing from accepted values 100% — too easy
2 HTTP-only check, no body parsing 0% — too hard
3 No body check, but explicit hint about valid values 90% — still too easy
4 No body check, vague hint about inspecting body 0% — agents ignore it
5 (final) Wrong field name with plausible default ~40–60% — correct range

The key insight: a bug must produce plausible-looking output without throwing exceptions. A validator that crashes is trivially easy to fix. A validator that silently produces wrong answers is genuinely hard, because you have to know what the correct answer should be before you can spot the discrepancy.

This is the same principle behind production incident post-mortems. The hardest outages aren't the ones where something crashes; they're the ones where something runs successfully while doing the wrong thing.


Key Takeaways

  1. HTTP status codes are not health signals. Always parse the response body and check the semantic status field. A 200 OK with {"status": "degraded"} means the service is degraded.

  2. Read your YAML carefully. Manifest files accumulate legacy keys over time. Know which section of the config is authoritative. Comment it explicitly.

  3. Topological sort edge direction matters. In Kahn's algorithm, an edge from A to B means A comes before B. If your dependency means "A must exist before B starts," the edge is A → B, in_degree[B] += 1. It's easy to get this exactly backwards.

  4. Weighted scoring should reflect business priority. Equal weights mean a low-priority background worker failing counts the same as your authentication service being down. That's not true in production.

  5. Critical service filtering should be explicit. If your alerting fires "critical" when any service is unhealthy, you'll desensitize your on-call team within a week. Be precise about which services actually matter for the critical threshold.

  6. Test with assertions about semantic values, not just structure. Checking that deployment_report.json exists is not enough. Check that worker-service.status == "unhealthy", that readiness_score == 0.9, and that startup_order.index("auth-service") < startup_order.index("api-gateway").

AI-Native Infrastructure & Security Architecture Research | Subhanshu Mohan Gupta

Part 2 of 50

Independent research and deep technical exploration of AI-driven DevSecOps, resilient cloud architecture, cross-chain systems and large-scale distributed architecture.

Up next

Platform Engineering at the Edge

Designing IDPs for Satellites, Factories and Hospitals