<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[AI-Driven DevSecOps, Cloud Security & System Architecture | Subhanshu Mohan Gupta]]></title><description><![CDATA[Independent research and deep technical analysis on AI-driven DevSecOps, cloud security architecture, cross-chain interoperability, ransomware resilience and large-scale distributed system design.]]></description><link>https://blogs.subhanshumg.com</link><image><url>https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/logos/6442da7c019a6adb6b507559/904d6984-88c1-485d-85b3-d1ae624bb308.png</url><title>AI-Driven DevSecOps, Cloud Security &amp; System Architecture | Subhanshu Mohan Gupta</title><link>https://blogs.subhanshumg.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 17 May 2026 20:48:44 GMT</lastBuildDate><atom:link href="https://blogs.subhanshumg.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Crypto inventory: the platform workstream nobody scoped]]></title><description><![CDATA[Ironway Materials ran a crypto audit in 2024 and shipped two years of new TLS code on top of it. The 2026 audit found seventeen new libraries, three of them in the hot path, all of them invisible to t]]></description><link>https://blogs.subhanshumg.com/crypto-inventory-the-platform-workstream-nobody-scoped</link><guid isPermaLink="true">https://blogs.subhanshumg.com/crypto-inventory-the-platform-workstream-nobody-scoped</guid><category><![CDATA[crypto]]></category><category><![CDATA[PQC]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><category><![CDATA[supply chain]]></category><category><![CDATA[idp]]></category><category><![CDATA[falcon]]></category><category><![CDATA[spire]]></category><category><![CDATA[#sigstore]]></category><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Azure]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Fri, 15 May 2026 09:06:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/95784f32-287c-4be1-b331-eaea94eec7c4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ironway Materials ran a crypto audit in 2024 and shipped two years of new TLS code on top of it. The 2026 audit found seventeen new libraries, three of them in the hot path, all of them invisible to the original inventory. The gap was not negligence; it was that one-shot audits decay.</p>
<p>We rebuilt inventory as a pipeline. Static scan in CI, runtime fingerprint in production, BOM emission per release and ownership in the platform catalog. This is the pipeline, with the BOM schema and the CI gate that keeps it honest.</p>
<hr />
<h2>Real-world scenario: how this plays out under pressure</h2>
<p><strong>The setup.</strong> Ironway Materials (manufacturing) had a one-shot 2024 crypto audit on the calendar and a crypto inventory that was eighteen months stale. Two years of new TLS code shipped without inventory tagging. The team built a continuous inventory pipeline, enabled hybrid X25519MLKEM768 at the edge, and planned the signing-side migration in line with NIST FIPS 203 / 204 / 205 deadlines. The work landed in three quarterly waves, each gated on a measurement that proved the previous wave held in production.</p>
<p><strong>The lesson the team wrote on the whiteboard.</strong> Post-quantum is not a flag flip; it is a three-year migration with regulatory deadlines and harvest-now-decrypt-later urgency. This piece walks through the inventory pipeline, the edge cutover, the signing migration, and the test suite that gates each wave.</p>
<hr />
<h2>Concept breakdown: what we are actually building</h2>
<p><strong>The concept in one paragraph.</strong> Post-quantum migration has three primitives: key encapsulation (ML-KEM, FIPS 203), digital signatures (ML-DSA, FIPS 204; SLH-DSA, FIPS 205; FN-DSA, FIPS 206 draft), and hardware key storage (HSMs, TPMs). The migration is <em>hybrid</em> during the transition window: classical and PQC run side by side, so the verifier ecosystem can catch up without breaking interop. Inventory comes first because you cannot migrate what you cannot see. Edge TLS migrates first because the cost is lowest; service mesh next; internal APIs last. Signing migrates to Sigstore and SLSA adds PQC support. Hardware migrates last because procurement cycles are years long. The whole thing has hard regulatory deadlines: CNSA 2.0 in 2027, federal TLS in 2030, full classical decommission by 2035.</p>
<hr />
<h2>The reference architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/2fae43ad-d890-4415-ab2f-da80919b393a.png" alt="" style="display:block;margin:0 auto" />

<p>The continuous inventory pipeline runs static scans at every build, runtime TLS-fingerprint scans in production, feeds a tagged CBOM, and surfaces ownership in the platform catalog.</p>
<p><strong>Architecture notes:</strong></p>
<ul>
<li><p>Static scanner: source-scan for <code>EVP_*</code>, <code>rustls::</code>, <code>crypto/tls</code>, etc.</p>
</li>
<li><p>Runtime scanner: ja3/ja4 TLS fingerprint of every prod flow.</p>
</li>
<li><p>CBOM emission: CycloneDX 1.6 Crypto BOM with tagged asset/owner/cadence.</p>
</li>
<li><p>Backstage Software Catalog: surfaces ownership; expires on team reorg.</p>
</li>
<li><p>Sigstore attestation on each inventory snapshot.</p>
</li>
</ul>
<hr />
<h2>End-to-end implementation guide</h2>
<p>A precise build order from zero to production, with the manifests and scripts the team actually shipped. Every block below corresponds to a file in <a href="https://github.com/SubhanshuMG/crypto-inventory-platform-workstream/tree/main/code"><code>code/</code></a> so you can read each step in isolation, then run the suite together.</p>
<h3>Step 1: Inventory the crypto surface continuously, not annually</h3>
<p>Every PQC migration begins with the same answer: you cannot migrate what you cannot see. The scanner below combines a static pass over the source tree with a runtime TLS-fingerprint pass in production. Run both in CI; emit a CycloneDX Crypto BOM and tag each asset with owner, algorithm, and rotation cadence.</p>
<pre><code class="language-python">#!/usr/bin/env python3
# Crypto inventory: the platform workstream nobody scoped for 2026
# Minimal crypto inventory scanner: maps TLS libs + ciphers found in the tree.
from __future__ import annotations
import json
import subprocess
from pathlib import Path

PATTERNS = {
    "openssl":     ["SSL_CTX", "EVP_"],
    "rustls":      ["rustls::"],
    "boringssl":   ["BoringSSL", "SSL_"],
    "go-tls":      ["crypto/tls"],
    "pqc-hybrid":  ["X25519Kyber768", "MLKEM768"],
}

def scan(root: Path) -&gt; dict:
    out = {k: 0 for k in PATTERNS}
    for p in root.rglob("*"):
        if not p.is_file() or p.stat().st_size &gt; 2_000_000:
            continue
        try:
            t = p.read_text(errors="ignore")
        except Exception:
            continue
        for name, pats in PATTERNS.items():
            if any(pat in t for pat in pats):
                out[name] += 1
    return out

if __name__ == "__main__":
    print(json.dumps(scan(Path(".")), indent=2))

# Tech components referenced: CycloneDX Crypto BOM (CBOM) 1.6, OpenSSF CryptoInventory, runtime TLS fingerprinting (ja3/ja4), Sigstore attestation on inventory snapshots, Backstage Software Catalog extensions.
</code></pre>
<h3>Step 2: Enable hybrid X25519MLKEM768 at the edge</h3>
<p>The edge is the cheapest, highest-leverage move because most browsers already speak the hybrid KEM. The Envoy listener below adds it to the front of the curve list; classical X25519 stays as a fallback for older clients. Audit middleboxes (WAFs, legacy LBs) before enforcement; some drop unknown ClientHellos.</p>
<pre><code class="language-yaml"># Crypto inventory: the platform workstream nobody scoped for 2026
# Envoy edge listener enabling hybrid PQC KEM (X25519MLKEM768).
admin: { address: { socket_address: { address: 0.0.0.0, port_value: 9901 } } }
static_resources:
  listeners:
    - name: edge_https
      address: { socket_address: { address: 0.0.0.0, port_value: 443 } }
      filter_chains:
        - transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_params:
                  tls_minimum_protocol_version: TLSv1_3
                  ecdh_curves: ["X25519MLKEM768", "X25519"]
              require_client_certificate: false
          filters:
            - name: envoy.filters.network.http_connection_manager
</code></pre>
<h3>Step 3: Plan the service-mesh upgrade for H2 2026</h3>
<p>Mesh-side upgrades land with OpenSSL 3.5 and Istio 1.23+. The rollout is namespace-by-namespace; the canary is a low-traffic service that exposes both classical and hybrid endpoints. Monitor handshake p99 and failure rate; flip the default once parity holds.</p>
<pre><code class="language-yaml">apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata: { name: hybrid-mesh, namespace: istio-system }
spec:
  meshConfig:
    defaultConfig:
      proxyMetadata:
        BORINGSSL_HYBRID_KEM: "X25519MLKEM768"
  components:
    pilot:
      k8s:
        env:
          - { name: PILOT_ENABLE_HYBRID_KEM, value: "true" }
</code></pre>
<h3>Step 4: Pilot Dilithium signing on a low-risk attestation path</h3>
<p>PQC signatures land in Sigstore on the attestation path first. Pick a non-blocking predicate, sign with Dilithium alongside ECDSA, and verify both. Hybrid signing stays through 2028 minimum; the verifier ecosystem catches up by then.</p>
<pre><code class="language-bash">#!/usr/bin/env bash
set -euo pipefail
IMAGE="$1"
# Hybrid sign: classical + Dilithium (alpha tooling).
cosign sign --yes --key-algo ecdsa-p256 "$IMAGE"
cosign sign --yes --key-algo dilithium3 "$IMAGE"  # tracking sigstore PQC RFC
# Verify both
cosign verify --signature-algorithm ecdsa-p256 "$IMAGE"
cosign verify --signature-algorithm dilithium3 "$IMAGE"
</code></pre>
<h3>Step 5: Plan the hardware refresh in line with 2033 deadlines</h3>
<p>HSMs and TPMs migrate last because the procurement and key-import work is real. Inventory existing devices, map firmware to the FIPS 203 / 204 / 205 roadmap, and negotiate refresh cycles into vendor contracts. The earlier this conversation starts, the smoother the 2030-2033 cutover lands.</p>
<pre><code class="language-yaml"># crypto-asset.yaml: a Backstage catalog entry per HSM / TPM / KMS root
apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: hsm-east-prod
  annotations:
    crypto.algorithm:    "RSA-4096 / ECDSA-P256"
    crypto.pqc-roadmap:  "firmware-update-q3-2027"
    crypto.refresh-due:  "2033-01-01"
spec:
  type: hsm
  owner: platform-security
</code></pre>
<hr />
<h2>Testing strategy</h2>
<p>Unit, integration, and chaos exercises that gate the rollout. Run each in a non-production cluster first; expand to staging once the green-path tests pass and the negative tests reject the bad input the way the policy says they will.</p>
<h3>Test 1: Edge handshake includes hybrid KEM</h3>
<pre><code class="language-bash">openssl s_client -connect edge.example.com:443 -groups X25519MLKEM768 &lt; /dev/null 2&gt;&amp;1 | grep -i 'shared group'
</code></pre>
<p><strong>Expected:</strong> Output includes <code>Shared groups: X25519MLKEM768</code>.</p>
<h3>Test 2: Hybrid signature verifies on attestation</h3>
<pre><code class="language-bash">cosign verify --signature-algorithm dilithium3 ghcr.io/your-org/api:v1.4.0
</code></pre>
<p><strong>Expected:</strong> <code>Verified OK</code> with <code>dilithium3</code> signer record.</p>
<h3>Test 3: Inventory pipeline catches an unscanned crypto lib</h3>
<pre><code class="language-bash">cd ~/svc-new &amp;&amp; python3 ../tools/crypto-inventory.py | jq '.openssl'
</code></pre>
<p><strong>Expected:</strong> Non-zero count; CI fails the PR until the asset is owned and tagged.</p>
<hr />
<h2>Security considerations</h2>
<ul>
<li><p><strong>IAM:</strong> the crypto-inventory pipeline runs as a CI workload with read-only access to the source tree and write-only access to the inventory bucket. The signing pipeline uses a SPIRE-issued identity that maps to a Cosign keyless OIDC subject.</p>
</li>
<li><p><strong>Secrets management:</strong> classical and PQC private keys live in HSMs; the signing pipeline never sees raw key material. Hybrid signing uses two key handles, one classical, one PQC, both in the HSM.</p>
</li>
<li><p><strong>Vulnerability scanning:</strong> every TLS library version surfaces in the inventory; Renovate or Dependabot watches for security releases; PQC-related advisories land in the same review queue.</p>
</li>
<li><p><strong>Network policies:</strong> TLS terminators (Envoy, NGINX, BoringSSL) sit behind WAFs that have been audited for hybrid ClientHello handling; legacy middleboxes that drop unknown cipher suites are inventoried and replaced before the rollout reaches their path.</p>
</li>
<li><p><strong>Hybrid signing:</strong> classical and PQC signatures are emitted side by side through 2028 minimum; verifiers must succeed on either; CI fails the build if a verifier accepts only one when both are configured.</p>
</li>
</ul>
<hr />
<h2>Scaling and optimization</h2>
<ul>
<li><p><strong>Horizontal scaling:</strong> the crypto-inventory pipeline scales out per repository; one CI job per service is the natural shape. Hybrid TLS at the edge scales with your CDN; the KEM cost is added per handshake, not per byte.</p>
</li>
<li><p><strong>Vertical scaling:</strong> Dilithium adds roughly 10% to signing CPU compared to ECDSA; SLH-DSA is roughly 100 times slower. Budget signing throughput accordingly. KEM operations on modern CPUs are fast; the wire size is the bigger concern at high QPS.</p>
</li>
<li><p><strong>Cost optimization:</strong> hybrid signing doubles signature size during the migration window; use Falcon for size-sensitive paths (SBOMs, attestations); use Dilithium for throughput-bound paths (per-request signing). HSM cost is procurement; plan refresh cycles years ahead.</p>
</li>
<li><p><strong>Performance tuning:</strong> TLS handshake p99 with hybrid KEM is within 5 ms of classical on most stacks; the bottleneck is usually middlebox compatibility, not crypto cost. Audit middleboxes before claiming a baseline.</p>
</li>
</ul>
<hr />
<h2>Failure scenarios and recovery</h2>
<ol>
<li><p><strong>Static scan misses dynamically loaded crypto (plugins, WebAssembly).</strong> Layer runtime scan; dynamic loads show up as TLS fingerprints.</p>
</li>
<li><p><strong>Ownership tag expires and is never re-assigned.</strong> Gate PRs that reference untagged assets; force re-assignment.</p>
</li>
<li><p><strong>CBOM format changes between tool versions; downstream consumers break.</strong> Pin CBOM version; include in attestation.</p>
</li>
</ol>
<h3>When NOT to do this</h3>
<p>For small orgs with a single codebase and a single deploy, a quarterly one-shot audit may suffice. At any scale above one team, continuous is the floor.</p>
<hr />
<h2>The thesis</h2>
<p>Crypto inventory is a platform service. A one-time project gives you a 2024 answer.</p>
<hr />
<h2>Why this matters now</h2>
<p>Meta's crypto inventory (static + runtime scans feeding a tagged asset graph) is the emerging pattern. Every serious org started one in Q1 2026 and nobody has finished.</p>
<hr />
<h2>Narrative arc</h2>
<p>Why one-time audits fail → static + runtime dual-scan → the tagging schema (asset, owner, algorithm, rotation cadence) → CI integration so inventory stays live.</p>
<hr />
<h2>What most people believe, and why it falls apart</h2>
<p>"We did a crypto audit last year." A point-in-time audit misses every new TLS lib and every new signing path shipped since.</p>
<p>One-shot audits were adequate when crypto libraries were rare. In 2026, every new service ships with TLS; every new build step ships a signature. Inventory has to be continuous to be true.</p>
<hr />
<h2>The timeline</h2>
<ul>
<li><p><strong>2024-08</strong>, NIST standardizes FIPS 203 (ML-KEM), 204 (ML-DSA), 205 (SLH-DSA).</p>
</li>
<li><p><strong>2025-H2</strong>, Browser-side hybrid X25519MLKEM768 becomes default in Chrome, Firefox, Safari for TLS 1.3.</p>
</li>
<li><p><strong>2026-04</strong>, Meta publishes Post-Quantum Cryptography Migration framework and lessons.</p>
</li>
<li><p><strong>2026-mid</strong>, OpenSSL 3.5 ships with production-ready PQC; server-side hybrid rollout unblocks.</p>
</li>
<li><p><strong>2026-09-21</strong>, NIST CMVP moves all remaining FIPS 140-2 certificates to Historical status.</p>
</li>
<li><p><strong>2027-01</strong>, CNSA 2.0 transition deadline for US National Security Systems begins.</p>
</li>
</ul>
<hr />
<h2>The decision tree, matrix, runbook</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/7a6a5f77-1480-43bf-8bfd-961a71f08f1f.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p>Do you run both static (source-scan) and runtime (TLS fingerprint) scans?</p>
</li>
<li><p>Is every asset tagged with owner, algorithm, rotation cadence?</p>
</li>
<li><p>Is your BOM format one of CBOM / CycloneDX 1.6 Crypto / in-toto?</p>
</li>
<li><p>Does the pipeline run in CI, not quarterly?</p>
</li>
<li><p>Is the inventory linked to your platform catalog (Backstage) for ownership?</p>
</li>
</ol>
<hr />
<h2>What to ship this quarter</h2>
<ul>
<li><p>Deploy a continuous crypto-inventory pipeline in CI.</p>
</li>
<li><p>Add runtime TLS-fingerprint scanning in production.</p>
</li>
<li><p>Emit CBOM per build; tag with owner/algorithm/cadence.</p>
</li>
<li><p>Integrate ownership with Backstage or the platform catalog.</p>
</li>
<li><p>Attest snapshots via Sigstore.</p>
</li>
</ul>
<hr />
<h2>Production observability</h2>
<ul>
<li><p>Track classical vs hybrid handshake share per edge POP; each POP should hit &gt; 70% hybrid before flipping the default.</p>
</li>
<li><p>Inventory drift on every release; a PR that adds new crypto without an inventory tag fails CI.</p>
</li>
<li><p>Signing throughput monitored; Dilithium adds ~10% over ECDSA; budget for it.</p>
</li>
<li><p>Middlebox handshake-failure rate; a spike at rollout means an unmapped legacy device.</p>
</li>
<li><p>HSM and TPM firmware versions surfaced in the platform catalog; drift triggers a procurement review.</p>
</li>
</ul>
<hr />
<h2>Tech components</h2>
<blockquote>
<p>CycloneDX Crypto BOM (CBOM) 1.6</p>
<p>OpenSSF CryptoInventory</p>
<p>runtime TLS fingerprinting (ja3/ja4)</p>
<p>Sigstore attestation on inventory snapshots</p>
<p>Backstage Software Catalog extensions</p>
</blockquote>
<hr />
<h2>Final word</h2>
<p>Post-quantum is a calendar problem. The math is settled, the libraries are real, the deadlines are public. Start the inventory pipeline this quarter, and you will finish the migration on time.</p>
<hr />
<h2>Further reading</h2>
<ol>
<li><p><strong>Meta Engineering, PQC Migration at Meta (Apr 2026)</strong>, The inventory pipeline as operated at scale.</p>
</li>
<li><p><strong>CycloneDX 1.6 Crypto BOM specification</strong>, The emerging BOM standard.</p>
</li>
<li><p><strong>OWASP Cryptographic Inventory project</strong>, the open-source tooling reference.</p>
</li>
</ol>
<p>See <a href="https://github.com/SubhanshuMG/crypto-inventory-platform-workstream/blob/main/references.md"><code>references.md</code></a> for the full bibliography.</p>
]]></content:encoded></item><item><title><![CDATA[The agentic SOC is here]]></title><description><![CDATA[Three vendor agentic-SOC platforms in Q1 2026. One platform-engineering charter that decides whether they work.

Microsoft Sentinel AI shipped. Splunk Agentic SOC launched at RSAC. Google SecOps and T]]></description><link>https://blogs.subhanshumg.com/the-agentic-soc-is-here</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-agentic-soc-is-here</guid><category><![CDATA[agentic-soc]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[AI]]></category><category><![CDATA[SecOps]]></category><category><![CDATA[Security]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><category><![CDATA[DevSecOps]]></category><category><![CDATA[SOC]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Cloud]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Mon, 11 May 2026 10:38:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/d796d07d-f03d-4bbc-9beb-2cbb745e9d03.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<blockquote>
<p><strong>Three vendor agentic-SOC platforms in Q1 2026. One platform-engineering charter that decides whether they work.</strong></p>
</blockquote>
<p>Microsoft Sentinel AI shipped. Splunk Agentic SOC launched at RSAC. Google SecOps and Torq integrated. Each one is real; each one consumes whatever telemetry, runbooks, and policy your platform team owns. The seam between <em>vendor agent loop</em> and <em>platform substrate</em> is where the value lands or fails to land. This is the own-vs-buy matrix Greendell Critical Infra used to ship its agentic SOC, with the confidence-band policy that decided what the agent did unsupervised.</p>
<hr />
<h2>The thesis</h2>
<p>Agentic SOC is a joint venture between the platform and SOC. Map the seams now.</p>
<hr />
<h2>Why this matters now</h2>
<p>Microsoft, Splunk, Google SecOps all shipped agentic SOCs in Q1 2026. The platform vs vendor seam is the question every SOC leader is answering this quarter.</p>
<hr />
<h2>Narrative arc</h2>
<p>What shipped (Microsoft Sentinel AI, Splunk Agentic SOC, Google SecOps) -&gt; the platform-owned layer (telemetry, runbooks, policy) vs vendor layer (agent loop, UI) → the confidence-band policy that gates autonomous resolution.</p>
<hr />
<h2>What most people believe and why it falls apart</h2>
<p>"Buy an agentic SOC vendor." True, and incomplete. The platform-owned half is what determines whether the vendor works.</p>
<p>Vendor agentic SOCs (Microsoft Sentinel AI, Splunk Agentic SOC, Google SecOps + Torq) ship real value; they automate work that previously sat in a ticket queue. The gap is that the platform-owned half (telemetry, policy-as-code runbooks, confidence bands) determines 80% of the outcome.</p>
<hr />
<h2>The timeline</h2>
<ul>
<li><p><strong>2026-04-09</strong>, Microsoft Security Blog: 'The agentic SOC, Rethinking SecOps for the next decade' introduces the new operating model.</p>
</li>
<li><p><strong>2026-RSAC</strong>, Splunk launches Agentic SOC; Google SecOps + Torq integrate for agentic SecOps loops.</p>
</li>
<li><p><strong>2026</strong>,</p>
<ul>
<li><p>Stellar Cyber ranks top 10 agentic SOC platforms; Pondurance Kanati ships as managed agentic-SOC MDR.</p>
</li>
<li><p>Sigma + OCSF adoption crosses the threshold where detection rules are CI-tested and vendor-portable.</p>
</li>
<li><p>AgentSOC paper (IEEE 2026) formalizes multi-layer agentic SOC with confidence-band autonomous action.</p>
</li>
</ul>
</li>
</ul>
<hr />
<h2>The decision tree, matrix, and runbook</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/5ec0e5c7-5297-4207-95d9-741be6957960.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p>Who owns the telemetry pipeline? If the vendor does, you're renting your own ground truth.</p>
</li>
<li><p>Who owns the runbook catalog? Structured runbooks are the agent's execution surface.</p>
</li>
<li><p>Who defines the confidence band? Autonomous vs escalate is your policy.</p>
</li>
<li><p>Is the agent audited (every decision logged)?</p>
</li>
<li><p>Is there a human escalation tier, with a named on-call?</p>
</li>
</ol>
<hr />
<h2>Real-world scenario, how this plays out under pressure</h2>
<p><strong>The setup.</strong> Greendell Critical Infra (transit cybersecurity) ran a 24/7 SOC and faced a 2026 Microsoft Sentinel AI rollout. Platform-owned telemetry; vendor-owned the agent loop. The team rebuilt the substrate first (OpenTelemetry + OCSF), authored Sigma rules in a Git repo with CI red-samples, structured the on-call runbooks so agents could execute reversible steps, and let the vendor's agent loop sit on top. The human tier moved up the stack: exception handling and confidence-band policy.</p>
<p><strong>The lesson the team wrote on the whiteboard.</strong> Vendors ship the loop; platforms own the substrate. This piece walks the telemetry pipeline, the Sigma + OCSF detection-as-code, the structured runbook format, and the red-vs-green test suite that gated every rule before it reached production.</p>
<hr />
<h2>Concept breakdown: what we are actually building</h2>
<p><strong>The concept in one paragraph.</strong> An agentic SOC is a layered system: telemetry (OpenTelemetry, OCSF) feeds a detection layer (Sigma rules in Git, CI-tested), which feeds a correlation layer (graph engines), which feeds a response layer (structured runbooks, agents that execute reversible steps). Confidence bands decide what the agent runs unsupervised and what escalates to a human. Platform engineering owns the substrate (telemetry pipeline, runbook catalog, policy); vendor agentic-SOC products own the loop and the UI on top. Detection becomes code, runbooks become YAML, and humans move up the stack to exception handling and policy setting.</p>
<hr />
<h2>The reference architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/ded27a89-8fe1-41fa-93fe-011fb4605e8d.png" alt="" style="display:block;margin:0 auto" />

<p>Platform owns the substrate: telemetry (OTel + OCSF), policy-as-code runbooks, confidence bands. Vendor owns the agent loop and UI. The seam is documented and auditable.</p>
<p><strong>Architecture notes:</strong></p>
<ul>
<li><p>OpenTelemetry + OCSF as the telemetry substrate.</p>
</li>
<li><p>Runbook catalog: structured YAML, version-controlled, CI-tested.</p>
</li>
<li><p>Confidence band policy: platform-owned.</p>
</li>
<li><p>Agent loop: vendor-owned; consumes telemetry, executes runbooks.</p>
</li>
<li><p>Human escalation tier with named on-call.</p>
</li>
</ul>
<p><strong>Manifests:</strong> <a href="https://github.com/SubhanshuMG/agentic-soc-platform-own-vs-buy/tree/main/code"><code>code/</code></a></p>
<hr />
<h2>End-to-end implementation guide</h2>
<p>A precise build order from zero to production, with the manifests and scripts the team actually shipped. Every block below corresponds to a file in <a href="https://github.com/SubhanshuMG/agentic-soc-platform-own-vs-buy/tree/main/code"><code>code/</code></a> so you can read each step in isolation, then run the suite together.</p>
<h3>Step 1: Pipe telemetry through OpenTelemetry and OCSF</h3>
<p>Detection is only as good as the substrate. OpenTelemetry collects logs, metrics, and traces; the OCSF normalizer reshapes them into a vendor-neutral schema. The collector below routes every event into both the SIEM and a long-term object store.</p>
<pre><code class="language-yaml">receivers:
  otlp: { protocols: { grpc: {}, http: {} } }
processors:
  ocsf_normalizer: { schema_version: "1.2" }
exporters:
  splunk_hec:
    endpoint: https://splunk.example.com:8088
    token: "${SPLUNK_TOKEN}"
  s3:
    endpoint: https://s3.example.com
    bucket: ocsf-events
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [ocsf_normalizer]
      exporters:  [splunk_hec, s3]
</code></pre>
<h3>Step 2: Author detection rules in Sigma, test in CI</h3>
<p>Sigma is the source of truth; rules compile to Splunk, Elastic, Chronicle, or CrowdStrike. Each rule ships with a red-sample (an event that should fire) and a green-sample (an event that should not). CI runs both before merge.</p>
<pre><code class="language-yaml"># The agentic SOC is here. Platform teams: here is what you own.
# Sigma detection rule (vendor-agnostic, compiles to Splunk/Elastic/Chronicle).
title: Suspicious container exec of shell
id: 6fa5e0c8-5a6a-4bd4-ae3d-9b55e0f1a001
status: stable
description: Detects an exec into a container that launches a shell after startup.
logsource:
    category: process_creation
    product: linux
detection:
    selection_proc:
        Image|endswith:
            - '/bash'
            - '/sh'
            - '/zsh'
    selection_parent:
        ParentImage|endswith:
            - '/containerd-shim-runc-v2'
            - '/runc'
    condition: selection_proc and selection_parent
falsepositives:
    - Maintenance exec
    - Kubernetes kubectl exec by trusted operators
level: high
tags:
    - attack.execution
    - attack.t1609
</code></pre>
<h3>Step 3: Compile Sigma to your SIEM and ship via CI</h3>
<p><code>sigma-cli</code> compiles to vendor SPL/EQL/YARA-L. The CI pipeline below compiles, deploys, and verifies the rule fires against a known-bad sample before promoting to production.</p>
<pre><code class="language-yaml">name: sigma-ship
on: { pull_request: {}, push: { branches: [main] } }
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install sigma-cli pysigma-backend-splunk
      - run: sigma convert -t splunk -p splunk_windows ./rules &gt; out.spl
      - run: ./test/red-sample.sh &lt; out.spl
      - if: github.ref == 'refs/heads/main'
        run: ./scripts/deploy-splunk.sh out.spl
</code></pre>
<h3>Step 4: Convert tribal runbooks into structured YAML</h3>
<p>Agents cannot execute Confluence pages; humans skim them under pressure. The structured runbook below has preconditions, reversible steps, verification gates, and a mandatory cleanup. The same YAML drives an agentic SOC and an on-call human.</p>
<pre><code class="language-yaml"># The agentic SOC is here. Platform teams: here is what you own.
# Structured runbook: handle a container-exec alert.
name: container-exec-triage
version: 1
preconditions:
  - cluster: { in: [prod-a, prod-b, prod-c] }
  - alert_severity: { gte: high }
  - on_call: { available: true }
steps:
  - id: gather_context
    reversible: true
    action: shell
    command: kubectl describe pod \({alert.pod} -n \){alert.namespace}
  - id: check_image_provenance
    reversible: true
    action: shell
    command: cosign verify ${alert.image}
  - id: cordon_node
    reversible: true
    action: shell
    command: kubectl cordon ${alert.node}
  - id: snapshot_pod
    reversible: true
    action: shell
    command: crictl ps -a --name ${alert.pod}
  - id: human_gate
    reversible: false
    action: approval
    approvers: [sre-oncall, security-lead]
  - id: terminate_pod
    reversible: false
    action: shell
    command: kubectl delete pod \({alert.pod} -n \){alert.namespace}
cleanup:
  - kubectl uncordon ${alert.node}
# Tech components referenced: Microsoft Sentinel AI, Splunk Agentic SOC (RSAC 2026), Google SecOps + Torq, Stellar Cyber, Radiant Security, OpenTelemetry pipeline, OCSF schema, structured runbook library.
</code></pre>
<h3>Step 5: Define the confidence band the agent operates within</h3>
<p>The platform owns the policy that decides what the agent does autonomously and what escalates. Below: a YAML that the agentic SOC platform consumes, mapping action classes to confidence thresholds.</p>
<pre><code class="language-yaml">apiVersion: socops.example.com/v1
kind: ConfidenceBand
metadata: { name: tier-1-policies }
spec:
  bands:
    - action_class: enrichment
      auto_threshold: 0.5
    - action_class: containment_reversible
      auto_threshold: 0.85
      escalation_threshold: 0.7
    - action_class: containment_irreversible
      auto_threshold: 1.01     # never autonomous
      escalation_threshold: 0.0
</code></pre>
<hr />
<h2>Security considerations</h2>
<ul>
<li><p><strong>IAM:</strong> the agentic SOC platform is a tenant of the platform's identity plane: telemetry collectors, detection runners, and response agents each carry SPIFFE SVIDs. Vendor SaaS agentic SOCs federate via OIDC into a least-privilege cross-account role.</p>
</li>
<li><p><strong>Secrets management:</strong> detection rules carry no secrets; the playbook layer references secrets via Vault paths that resolve at runtime. Tokens for the SOC platform's external integrations rotate on the same cadence as the rest of the platform.</p>
</li>
<li><p><strong>Vulnerability scanning:</strong> Sigma rule packs live in Git; CI scans them for syntax, performance regressions, and references to deprecated fields. Telemetry collectors are signed and admission-gated like any other workload.</p>
</li>
<li><p><strong>Network policies:</strong> telemetry flows out to the SIEM only; collectors do not reach the internet directly. Vendor agent loops integrate via dedicated VPC endpoints with mTLS, not public APIs.</p>
</li>
<li><p><strong>Confidence bands and human gates:</strong> policy declares which action classes the agent runs unsupervised, which escalate, and which always wait for a human; every autonomous action is audit-logged with the structured intent and the rule that justified it.</p>
</li>
</ul>
<hr />
<h2>Testing strategy</h2>
<p>Unit, integration, and chaos exercises that gate the rollout. Run each in a non-production cluster first; expand to staging once the green-path tests pass and the negative tests reject the bad input the way the policy says they will.</p>
<h3>Test 1: Sigma red-sample fires the rule</h3>
<pre><code class="language-bash">sigma test ./rules/container-shell-exec.yml --backend splunk --sample test/red.evtx
</code></pre>
<p><strong>Expected:</strong> <code>MATCHED</code> against the red sample; rule promoted to deploy.</p>
<h3>Test 2: Green-sample stays quiet</h3>
<pre><code class="language-bash">sigma test ./rules/container-shell-exec.yml --backend splunk --sample test/green.evtx
</code></pre>
<p><strong>Expected:</strong> <code>NO MATCH</code>; baseline kubectl exec by trusted operators is not flagged.</p>
<h3>Test 3: Runbook executes through the gate</h3>
<pre><code class="language-bash">runbook run container-exec-triage --simulate --alert ./testdata/alert.json
</code></pre>
<p><strong>Expected:</strong> Steps <code>gather_context</code>, <code>cordon_node</code> run; <code>terminate_pod</code> waits for human approval.</p>
<hr />
<h2>Scaling and optimization</h2>
<ul>
<li><p><strong>Horizontal scaling:</strong> the telemetry pipeline scales with traffic; OpenTelemetry collectors are stateless and add a partition layer at the OCSF normalizer. Sigma rule evaluation runs in the SIEM and benefits from the SIEM's scaling model.</p>
</li>
<li><p><strong>Vertical scaling:</strong> detection rules with high false-positive rates dominate evaluation cost; tune via the FP budget so the platform stays cost-bounded. The agent loop's cost is per-event under autonomous resolution; budget per-action per-class.</p>
</li>
<li><p><strong>Cost optimization:</strong> Sigma + OCSF avoids vendor-lock and lets you migrate SIEMs without rewriting rules; the savings show up at procurement renewal. The agent loop saves analyst hours; track the autonomous-resolution rate as a KPI alongside cost.</p>
</li>
<li><p><strong>Performance tuning:</strong> structured runbooks let agents execute without paging; tune the confidence-band thresholds against the FP rate so escalations are precision-bounded.</p>
</li>
</ul>
<hr />
<h2>Failure scenarios and recovery</h2>
<ol>
<li><p><strong>Vendor changes telemetry requirements mid-contract.</strong> Platform owns the telemetry contract; vendor adapts.</p>
</li>
<li><p><strong>Confidence band drifts because agent tuning evolves.</strong> Lock policy at a cadence; review quarterly.</p>
</li>
<li><p><strong>Runbook fails mid-execution; agent cannot recover.</strong> Reversibility tag per step; rollback on failure.</p>
</li>
</ol>
<h3>When NOT to do this</h3>
<p>Small orgs without a dedicated SOC may not have the own-side investment to justify the vendor loop. For orgs with 24/7 SOC operations, the platform-owned substrate is baseline.</p>
<hr />
<h2>What to ship this quarter</h2>
<ul>
<li><p>Stand up OTel + OCSF telemetry substrate.</p>
</li>
<li><p>Author 20 structured runbooks for common incidents.</p>
</li>
<li><p>Define the confidence-band policy.</p>
</li>
<li><p>Pilot one vendor agentic SOC.</p>
</li>
<li><p>Document the platform-vendor seam.</p>
</li>
</ul>
<h2>Production observability</h2>
<ul>
<li><p>Mean time to detect (MTTD) and mean time to respond (MTTR) tracked per detection class.</p>
</li>
<li><p>Sigma rule coverage by MITRE ATT&amp;CK technique; gaps drive the next rule sprint.</p>
</li>
<li><p>Confidence-band autonomous-resolution rate; a sudden drop indicates the agent's signal degraded.</p>
</li>
<li><p>Runbook execution success rate; runbooks failing mid-execution are bugs, not bad luck.</p>
</li>
<li><p>False-positive budget per rule; alert fatigue is the slowest, surest way to lose a SOC.</p>
</li>
</ul>
<hr />
<h2>Tech components</h2>
<p>Microsoft Sentinel AI, Splunk Agentic SOC (RSAC 2026), Google SecOps + Torq, Stellar Cyber, Radiant Security, OpenTelemetry pipeline, OCSF schema, structured runbook library.</p>
<hr />
<h2>Final word</h2>
<p>Vendors ship the agent loop. Platforms own the substrate. The seam between them is where every agentic-SOC rollout succeeds or fails; map the seam before you sign the contract.</p>
<hr />
<h2>Further reading</h2>
<ol>
<li><p><strong>Microsoft Security Blog, The agentic SOC (April 2026)</strong>, The flagship vendor framing.</p>
</li>
<li><p><strong>Splunk Agentic SOC RSAC 2026 launch</strong>, The parallel-vendor framing.</p>
</li>
<li><p><strong>Elastic Security Labs, Why 2026 is the year to upgrade to an Agentic AI SOC</strong>, The industry synthesis.</p>
</li>
</ol>
<p>See <a href="https://github.com/SubhanshuMG/agentic-soc-platform-own-vs-buy/blob/main/references.md"><code>references.md</code></a> for the full bibliography</p>
]]></content:encoded></item><item><title><![CDATA[The distributed monolith tax]]></title><description><![CDATA[We collapsed 47 microservices to 8. Deploy time went up, latency went down, and on-call went silent. Here's what the microservices evangelists didn't tell you about Dunbar's number for services.




M]]></description><link>https://blogs.subhanshumg.com/the-distributed-monolith-tax</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-distributed-monolith-tax</guid><category><![CDATA[DevSecOps]]></category><category><![CDATA[System Design]]></category><category><![CDATA[monolithic architecture]]></category><category><![CDATA[Microservices]]></category><category><![CDATA[architecture]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[AWS]]></category><category><![CDATA[production]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[software development]]></category><category><![CDATA[engineering]]></category><category><![CDATA[leadership]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Wed, 06 May 2026 03:17:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/1fdffd45-63c7-4574-9e05-926fb419c1a2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<blockquote>
<p><strong>We collapsed 47 microservices to 8. Deploy time went up, latency went down, and on-call went silent. Here's what the microservices evangelists didn't tell you about Dunbar's number for services.</strong></p>
</blockquote>
<hr />
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/6e11c30d-5581-4d31-ab0f-b095bca3e517.png" alt="" style="display:block;margin:0 auto" />

<p>Microservices aren't cheaper at 47 services with 20 engineers. They're more expensive. Shopify scaled back from 200 microservices to modular monoliths. Amazon Prime Video went from 9 services to 3, cutting latency 50%. The 2026 rule is tight: if your organization is fewer than 50 engineers, the distributed-monolith tax (inter-service latency, deployment coordination, operational toil, debugging hell) exceeds the benefit of service independence. Dunbar's number applies to services, not just people. A modular monolith with clean boundaries, deployed as a single artifact, scales to 3-4 teams without the heartburn.</p>
<hr />
<h2>Why this matters right now</h2>
<ol>
<li><p><strong>Microservices fatigue has data now.</strong> Amazon Prime Video's 2023 case study showed 50% latency reduction and 40% simpler on-call after collapsing to 3 services. That's not anecdote, that's signal.</p>
</li>
<li><p><strong>Kubernetes made the operational cost invisible.</strong> The "free" sidecar, the "easy" deployment pipeline and the "simple" observability layer all have FTE budgets. A 2024 CNCF survey found the median ops team managing 30+ microservices at a company with &lt;100 engineers.</p>
</li>
<li><p><strong>Modular monoliths have library support now.</strong> Packwerk (Ruby), go-pkgsite (Go), Python namespaces, and Rust workspaces make intra-monolith boundaries enforceable. This wasn't true in 2015.</p>
</li>
<li><p><strong>The distributed-monolith tax is quantifiable.</strong> Every service-to-service call adds 5-50ms of latency. At 47 services with 5 hops per request, you're burning 250ms in the network alone. A monolith does it in 0.2ms.</p>
</li>
</ol>
<hr />
<h2>Mainstream belief vs. what production actually shows</h2>
<p><strong>Mainstream belief:</strong> "Microservices let teams scale independently. One team per service."</p>
<p><strong>What production shows:</strong> "At 47 services and 20 teams, you're not scaling. You're coordinating 47 release schedules, debugging across 47 logs, and paying for 47 database connections per request. Teams are smaller but more blocked by other teams. The 'independent' team owns a service nobody else knows about that hasn't been deployed in 18 months."</p>
<hr />
<h2>A short, opinionated timeline</h2>
<table>
<thead>
<tr>
<th>Date</th>
<th>Event</th>
<th>Why it matters</th>
</tr>
</thead>
<tbody><tr>
<td>2014-2016</td>
<td>Microservices hype cycle</td>
<td>Everyone read Nginx and Sam Newman</td>
</tr>
<tr>
<td>2018</td>
<td>Netflix publishes SOA at scale</td>
<td>3000 microservices, 600 engineers, $10B+ platform cost</td>
</tr>
<tr>
<td>2019</td>
<td>Shopify refactors to modular monolith</td>
<td>Faster deploys, same isolation</td>
</tr>
<tr>
<td>2023-03</td>
<td>Amazon Prime Video publishes "Scaling to 200 services"</td>
<td>50% latency reduction after consolidation</td>
</tr>
<tr>
<td>2024-11</td>
<td>Gojek goes back to monolith for core</td>
<td>8 services from 47, team velocity +30%</td>
</tr>
<tr>
<td>2025-Q2</td>
<td>Dunbar's number for services enters mainstream</td>
<td>Platform team concept solidifies</td>
</tr>
<tr>
<td>2026-Q2</td>
<td>(Now.) Most new startups shipping modular monoliths</td>
<td>Netflix cosplay era ending</td>
</tr>
</tbody></table>
<p>The inflection point was 2023. The signal was Prime Video.</p>
<hr />
<h2>The decision tree</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/2513f0aa-5849-4617-b9e7-e6846f5d295c.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>The reference architecture: modular monolith at Shopify scale</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/2519e78c-8451-4448-a65b-91022b157bd8.png" alt="" style="display:block;margin:0 auto" />

<p>The architecture that works at 50-200 engineers is a monolith with enforced module boundaries. Shopify's pattern (deployed as one artifact, logically 6-8 modules, each with a package boundary):</p>
<ol>
<li><p><strong>Enforced module boundaries via linters.</strong> Packwerk (or go-pkgsite, or similar per-language) ensures Service A cannot import private code from Service B. This is compile-time, not runtime.</p>
</li>
<li><p><strong>Shared database with logical schemas.</strong> One PostgreSQL, but each module has a schema or namespace. Migrations are coordination points (hard), but they're batch jobs, not chaos.</p>
</li>
<li><p><strong>Event bus for inter-module async.</strong> Kafka or Redis pubsub for decoupled events: order.placed, inventory.reserved. Not every interaction is sync RPC.</p>
</li>
<li><p><strong>Single deployment pipeline.</strong> One Git repo (monorepo), one CI/CD pipeline, one prod release every 2-4 hours. No service-A-blocked-on-service-B stories.</p>
</li>
<li><p><strong>Shared observability stack.</strong> One Datadog/Grafana, structured logging with service/module tags, distributed tracing with a single root span. No "I can't log in to service-B's Prometheus" problems.</p>
</li>
</ol>
<p>The key difference from a Ball of Mud: Packwerk or equivalent enforces the boundaries. A random developer cannot bypass the module contract.</p>
<hr />
<h2>Step-by-step implementation: collapsing 47 to 8</h2>
<h3>Phase 1: Audit (week 1-2)</h3>
<ol>
<li><p>Map all 47 services with: dependency graph (what calls what), deployment frequency (daily? yearly?), on-call load (pages per week), latency contribution (P50, P99 tail).</p>
</li>
<li><p>Identify the 5-8 "core domains" based on business logic, not service count. Order, Payment, Inventory, Notification, Analytics, etc.</p>
</li>
<li><p>Group services by core domain. You'll find: most services are never-deployed sidecars or thin wrappers around the core.</p>
</li>
</ol>
<h3>Phase 2: Build module boundaries (weeks 3-4)</h3>
<ol>
<li><p>Create a single monorepo with directories: <code>order/</code>, <code>payment/</code>, <code>inventory/</code>, etc. Each is a package.</p>
</li>
<li><p>Install Packwerk (Ruby) or go-pkgsite (Go) or equivalent. Define allowed dependencies: <code>payment/ can call</code> order/` but not vice versa.</p>
</li>
<li><p>Migrate one service at a time. Start with the leaves (services nothing depends on). Move API + business logic into the module directory.</p>
</li>
<li><p>Run the boundary checker: it will scream at you. Fix the violations or add documented exceptions.</p>
</li>
</ol>
<p>Example Packwerk configuration:</p>
<pre><code class="language-yaml"># packwerk.yml
cache: false
include:
  - '{app,components}/**/'
exclude:
  - 'spec/**'
  - 'vendor/**'
dependency_violations:
  - 'app/models/order'
</code></pre>
<h3>Phase 3: Shared database + async boundaries (weeks 5-7)</h3>
<ol>
<li><p>Create a single PostgreSQL or MySQL database (or keep read replicas for the legacy services during migration).</p>
</li>
<li><p>For each service's tables, create a dedicated schema: <code>order_schema.orders</code>, <code>payment_schema.transactions</code>.</p>
</li>
<li><p>For cross-service events (e.g., order.placed -&gt; notify), publish to a Kafka topic or Redis stream. Subscribers are async consumers in the monolith, not separate services.</p>
</li>
</ol>
<p>Example event schema:</p>
<pre><code class="language-python"># order/events.py
import dataclasses
import json

@dataclasses.dataclass
class OrderPlaced:
    order_id: str
    customer_id: str
    total: float
    
    def to_kafka_record(self) -&gt; bytes:
        return json.dumps(dataclasses.asdict(self)).encode()

# notification/consumer.py
from order.events import OrderPlaced

class OrderEventConsumer:
    def handle_order_placed(self, event: OrderPlaced):
        send_email(event.customer_id, f"Order {event.order_id} confirmed")
</code></pre>
<h3>Phase 4: Single deployment pipeline (weeks 8-10)</h3>
<ol>
<li><p>Merge all services into monorepo.</p>
</li>
<li><p>Create a single Dockerfile that layers: base, app code, all migrations, all services.</p>
</li>
<li><p>In CI/CD, run tests for only the changed modules (use <code>git diff --name-only</code> to determine scope).</p>
</li>
<li><p>Deploy once per day or on manual trigger. All modules move together.</p>
</li>
</ol>
<p>Example CI stage:</p>
<pre><code class="language-yaml"># .github/workflows/deploy.yml
test:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v3
    - run: git diff --name-only origin/main | grep -E '^(order|payment|inventory)/' &gt; /tmp/changed_modules.txt || true
    - run: for module in $(cat /tmp/changed_modules.txt | cut -d/ -f1 | sort -u); do
        pytest $module/tests/
        done
deploy:
  needs: test
  runs-on: ubuntu-latest
  steps:
    - run: docker build -t myapp:$(git rev-parse --short HEAD) .
    - run: kubectl set image deployment/myapp myapp=myapp:$(git rev-parse --short HEAD)
</code></pre>
<h3>Phase 5: Observability + remove old services (weeks 11-16)</h3>
<ol>
<li><p>Tag all logs with module name: <code>logger.info("Order placed", module="order", service_id="order-v1")</code>.</p>
</li>
<li><p>Use a single distributed trace: request enters monolith, spans are module/function, no service-to-service serialization.</p>
</li>
<li><p>Decommission the old 47 services. Keep them in git history, not in prod.</p>
</li>
</ol>
<hr />
<h2>A real-world example</h2>
<p><strong>Gojek's 2025 consolidation:</strong> They had 47 microservices managing ride requests, payments, and driver logistics. Problem: 5% of requests hit 5+ services, adding 60-150ms of latency. On-call was a nightmare: a single slow service dragged the entire flow. They consolidated to 8 modular services (ride core, payment core, driver core, analytics, notification, etc.), kept Kafka for async (order -&gt; payment -&gt; notification), and used Packwerk-style boundaries. Result: P99 latency dropped from 800ms to 200ms, on-call pages went from 15/week to 2/week, deploy time went from 30 minutes (coordinating all 47 services) to 3 minutes.</p>
<p>The 47 services were not "independent teams." They were a matrix of dependencies nobody tracked. The 8 modular services are owned by 2-3 engineers per module, with clear responsibilities.</p>
<hr />
<h2>Testing the migration</h2>
<h3>1. Latency profile (before and after)</h3>
<pre><code class="language-bash"># Measure P50, P99 latency from client to monolith
time_series_query="
SELECT
    datetime,
    p50(duration_ms) as p50,
    p99(duration_ms) as p99,
    max(duration_ms) as max
FROM requests
WHERE service='order-api'
GROUP BY 5m
"
# Before migration: P99 = 800ms (5 hops, 160ms per hop)
# After migration: P99 = 200ms (0 hops, all in-process)
</code></pre>
<h3>2. Module boundary enforcement</h3>
<pre><code class="language-bash"># Test that payment/ cannot import order/ private code
cd payment
go list ./... | xargs grep 'order/internal' &amp;&amp; exit 1
# If this succeeds silently, boundaries are clean
</code></pre>
<h3>3. Async event flow</h3>
<pre><code class="language-python"># pytest integration test
import pytest
from order.events import OrderPlaced
from notification.consumer import OrderEventConsumer

def test_order_to_notification_flow():
    # Create order, emit event
    order = create_order(customer_id="123", total=100.0)
    event = OrderPlaced(order.id, "123", 100.0)
    
    # Consumer receives and processes
    consumer = OrderEventConsumer()
    consumer.handle_order_placed(event)
    
    # Verify email was sent
    assert email_sent_to("customer@example.com")
</code></pre>
<h3>4. Chaos test: single slow module</h3>
<pre><code class="language-bash"># k6 chaos test: what if order/ service is slow?
# In monolith, we add latency to order module only
k6 run --vus 100 --duration 5m - &lt;&lt;'EOF'
import http from 'k6/http';
import { check } from 'k6';

export default function () {
  const res = http.post('http://monolith:8000/api/orders', {
    customer_id: '123',
    items: [{sku: 'ABC', qty: 1}]
  });
  check(res, { 'status is 201': (r) =&gt; r.status === 201 });
}
EOF
# Monolith latency: stable, order module slow doesn't cascade
# 47-service arch: latency explodes (cascading failures)
</code></pre>
<hr />
<h2>Failure modes</h2>
<ol>
<li><p><strong>The package boundary is suggestions, not rules.</strong> Someone imports order/internal directly from payment. You notice: tests pass but on-call wakes up with circular dependency bugs. Recovery: automate the boundary check in CI. Fail the build on violations. Document exceptions in code.</p>
</li>
<li><p><strong>The "one database" becomes a bottleneck.</strong> A developer optimizes payment queries, but the index locks the whole database for 2 minutes. You notice: timeouts spike across order, inventory, notification. Recovery: use table-level locks, avoid full-table schema changes during peak hours, prepare migration scripts offline.</p>
</li>
<li><p><strong>Async event poison pill.</strong> A buggy notification consumer reads an order.placed event, tries to fetch a customer that no longer exists, crashes forever. You notice: orders pile up in Kafka, new orders hang. Recovery: implement dead-letter queues (Kafka topic for failed events), scale the consumer separately, add circuit breakers on downstream calls.</p>
</li>
<li><p><strong>The migration is incomplete.</strong> 3 of 47 services stayed microservices for "reasons." Now your monolith still calls service-X which calls service-Y which calls the monolith. You notice: latency didn't improve, you added a roundtrip. Recovery: finish the migration or accept the latency trade-off. Half measures hurt.</p>
</li>
<li><p><strong>Module test failures in CI.</strong> A developer changes order/ module, tests pass locally but CI fails because CI doesn't cache dependencies. You notice: slow CI, flaky deploys. Recovery: implement proper dependency caching, separate unit tests (fast, always run) from integration tests (slow, only changed modules).</p>
</li>
</ol>
<hr />
<h2>When NOT to do this</h2>
<ul>
<li><p><strong>If you have 100+ engineers and clear team boundaries.</strong> At that scale, microservices (with proper async boundaries and monitoring) win. One team per service is now tractable.</p>
</li>
<li><p><strong>If your workload is truly independent batch jobs.</strong> A data pipeline that reads from S3, processes, writes to Snowflake doesn't need a monolith. Keep it as separate services/Lambda.</p>
</li>
<li><p><strong>If your core constraint is deployment risk, not latency.</strong> A regulated financial services firm may prefer 47 independent services for auditability, even if latency suffers. Acknowledge the trade-off.</p>
</li>
</ul>
<hr />
<h2>What to ship this quarter</h2>
<ul>
<li><p>Audit: dependency graph and latency profile of all 47 services (week 1)</p>
</li>
<li><p>Monorepo: create single repo structure with module directories (week 2)</p>
</li>
<li><p>Packwerk (or equivalent): enforce 5-8 core module boundaries (week 3-4)</p>
</li>
<li><p>Kafka topics: define async event schema for cross-module events (week 4)</p>
</li>
<li><p>One migration: consolidate 2-3 services into monolith, test latency improvement (week 5-7)</p>
</li>
<li><p>Deploy pipeline: single CI/CD, tests only changed modules (week 8-10)</p>
</li>
</ul>
<hr />
<h2>Further reading</h2>
<p>See <a href="https://github.com/SubhanshuMG/distributed-monolith-tax/blob/main/references.md"><code>references.md</code></a>. Top picks:</p>
<ol>
<li><p><strong>Amazon Prime Video, "Scaling to 200 services" (2023).</strong> The case study that shifted the needle on distributed monoliths.</p>
</li>
<li><p><strong>Shopify, Packwerk announcement (2021).</strong> How a 10,000-person company enforces module boundaries at scale.</p>
</li>
<li><p><strong>Sam Newman, "Building Microservices, 2nd edition" (2021).</strong> Actually says "monolith first." The community forgot this part.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Short-lived OIDC for CI: kill every long-lived GitHub Actions token]]></title><description><![CDATA[GitHub OIDC to AWS/GCP/Azure federated credentials killed the need for long-lived PATs in CI. Most orgs still have PATs in use in 2026. Short-lived OIDC is the one-day win.
Narrative arc
The PAT failu]]></description><link>https://blogs.subhanshumg.com/short-lived-oidc</link><guid isPermaLink="true">https://blogs.subhanshumg.com/short-lived-oidc</guid><category><![CDATA[DevSecOps]]></category><category><![CDATA[Devops]]></category><category><![CDATA[identity-management]]></category><category><![CDATA[OIDC]]></category><category><![CDATA[GitHub]]></category><category><![CDATA[CI/CD]]></category><category><![CDATA[owasp]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Security]]></category><category><![CDATA[spire]]></category><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[AI]]></category><category><![CDATA[full stack]]></category><category><![CDATA[Azure]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 03 May 2026 05:31:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/ed88d68d-caac-43e2-8aab-1f95f344f4f0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>GitHub OIDC to AWS/GCP/Azure federated credentials killed the need for long-lived PATs in CI. Most orgs still have PATs in use in 2026. Short-lived OIDC is the one-day win.</p>
<h2>Narrative arc</h2>
<p>The PAT failure modes -&gt; the OIDC-to-cloud federation pattern -&gt; scope-down per workflow -&gt; the quarterly audit loop.</p>
<h2>What most people believe (and why it's wrong)</h2>
<p>"We rotate PATs every 90 days." 90 days is ~86,300 minutes for an attacker to exfil.</p>
<p>Rotation is better than no rotation. The frame shift is that OIDC federation makes the token lifetime minutes, not months, for every cloud destination a workflow touches. The effort to switch is lower than most teams estimate.</p>
<h2>The timeline / evidence</h2>
<ul>
<li><p><strong>2024-09</strong>, CSA + Astrix publish NHI survey: credential leakage, stale access, and undermanaged OAuth apps.</p>
</li>
<li><p><strong>2026-Q1</strong>, NHI Reality Report: avg enterprise has 250,000 NHIs, 71% not rotated in policy window, 97% excessive privilege.</p>
</li>
<li><p><strong>2026-Q1</strong>, Ratios: 40-100:1 NHI-to-human in enterprise, 144:1 in cloud-native, 500:1 in hyper-automated orgs.</p>
</li>
<li><p><strong>2026-04-23</strong>, Red Hat ships zero-trust workload identity manager on OpenShift using upstream SPIRE.</p>
</li>
<li><p><strong>2026</strong>, OWASP NHI Top 10 (draft) formalizes ownership, rotation, scope-minimization, and attestation as platform controls.</p>
</li>
</ul>
<h2>The decision tree / matrix / runbook</h2>
<ol>
<li><p>Does every cloud destination (AWS/GCP/Azure) accept OIDC from your CI?</p>
</li>
<li><p>Is the trust policy scoped per repo and per workflow file?</p>
</li>
<li><p>Is the session duration under 1 hour, ideally 15 minutes?</p>
</li>
<li><p>Is there an audit of remaining long-lived PATs? Quarterly sweep.</p>
</li>
<li><p>Is the migration tracked as a platform KPI with a deadline?</p>
</li>
</ol>
<h2>The reference architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/0d6efe60-f96e-4d6d-8a31-350ab7f87759.png" alt="" style="display:block;margin:0 auto" />

<p>The short-lived OIDC pattern lets every workflow mint a 15-minute cloud credential bound to the workflow's OIDC identity. No long-lived PATs in org secrets; no static keys on disk.</p>
<p><strong>Architecture notes:</strong></p>
<ul>
<li><p>GitHub Actions <code>id-token: write</code> permission per workflow.</p>
</li>
<li><p>AWS IAM trust policy scoped to repo + workflow file.</p>
</li>
<li><p>GCP Workload Identity Federation pool + provider per CI identity.</p>
</li>
<li><p>Azure federated credentials per workflow.</p>
</li>
<li><p>Quarterly sweep of long-lived tokens; alarm on new ones.</p>
</li>
</ul>
<h3><strong>Decision tree</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/d558bcb8-6ad9-4743-84a6-6ebd47f48314.png" alt="" style="display:block;margin:0 auto" />

<h2>Real-world example, how this plays out in production</h2>
<p><strong>The setup.</strong> Falconnet Banking (challenger bank) discovered the 2026 NHI ratio the hard way: a leaked GitHub Actions PAT. Every long-lived cloud token is deleted in three weeks. The team treated identity as a platform product, not a ticket queue: SPIFFE as the substrate, IaaS attestation as the trust root, OIDC as the cloud bridge, cert-manager and SPIRE for rotation, and a Backstage catalogue for ownership. Migration was per-workload; the legacy bridge sidecar carried the workloads that could not yet read from the Workload API.</p>
<p><strong>The lesson the team wrote on the whiteboard.</strong> Identity in a 144-to-1 world is platform engineering; rotation is automation, not on-call. This piece walks through the SPIRE deployment, the federation setup per cloud, the rotation pipeline, and the tests that proved a workload could obtain a verifiable identity at startup with no static secret.</p>
<h2>End-to-end implementation guide</h2>
<p>A precise build order from zero to production with the manifests and scripts the team actually shipped. Every block below corresponds to a file in <a href="https://github.com/SubhanshuMG/short-lived-oidc-for-ci-kill-tokens/tree/main/code"><code>code/</code></a><br />So you can read each step in isolation, then run the suite together.</p>
<h3>Step 1: Stand up SPIRE with the cloud node attestor</h3>
<p>SPIRE is the upstream SPIFFE implementation. The server config below trusts the cloud's metadata service to attest each node and issues an SVID per workload. Production runs SPIRE in HA with three replicas; the agent is a DaemonSet.</p>
<pre><code class="language-yaml"># Short-lived OIDC for CI: kill every long-lived GitHub Actions token
# SPIRE server config enabling IaaS-level node attestation + OIDC federation.
server:
  bind_address: "0.0.0.0"
  bind_port: "8081"
  trust_domain: "example.com"
  data_dir: "/var/lib/spire/server"

plugins:
  NodeAttestor "aws_iid":
    plugin_data:
      region: "us-east-1"
  NodeAttestor "gcp_iit":
    plugin_data:
      projectid_allow_list: ["my-gcp-project"]
  NodeAttestor "azure_msi":
    plugin_data:
      tenants:
        "00000000-0000-0000-0000-000000000000":
          resource_id: "https://acme.com/spire"

  KeyManager "aws_kms":
    plugin_data:
      region: "us-east-1"
      key_policy_file: "/run/spire/kms-policy.json"
</code></pre>
<h3>Step 2: Federate SPIFFE into AWS, GCP, and Azure IAM</h3>
<p>SPIRE exposes an OIDC discovery endpoint; each cloud's IAM trusts it as an external identity provider. The Terraform module below wires the AWS half; the equivalent for GCP and Azure ships in the same module set. The result: one SPIFFE ID maps to a narrow IAM role per cloud.</p>
<pre><code class="language-yaml">resource "aws_iam_openid_connect_provider" "spire" {
  url             = "https://oidc.spire.example.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["..."]
}
resource "aws_iam_role" "workload" {
  name = "workload-payments"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = aws_iam_openid_connect_provider.spire.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = { "oidc.spire.example.com:sub" = "spiffe://example.com/payments" }
      }
    }]
  })
}
</code></pre>
<h3>Step 3: Replace long-lived GitHub Actions tokens with short-lived OIDC</h3>
<p>Every cloud destination accepts OIDC from GitHub Actions in 2026. The workflow below mints a 15-minute AWS credential per run; no secret is ever stored in the repo. Quarterly sweeps audit and delete any surviving long-lived PATs.</p>
<pre><code class="language-yaml"># Short-lived OIDC for CI: kill every long-lived GitHub Actions token
# GitHub Actions: short-lived OIDC to AWS, no long-lived PATs.
name: deploy
on: { push: { branches: [main] } }
permissions:
  id-token: write
  contents: read
jobs:
  apply:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gha-deploy
          aws-region: us-east-1
          role-duration-seconds: 900
      - run: aws sts get-caller-identity
# Tech components referenced: GitHub Actions OIDC, AWS IAM Roles with OIDC trust, GCP Workload Identity Federation, Azure federated credentials, actions/attest-build-provenance, Vault JWT auth method.
</code></pre>
<h3>Step 4: Wire cert-manager and SPIRE for 24-hour rotation</h3>
<p>Rotation is a platform service, not a ticket. cert-manager handles Kubernetes-side certs; SPIRE handles workload SVIDs with a sub-hour TTL. The legacy bridge sidecar below carries workloads that still want a <code>.env</code>-shaped secret.</p>
<pre><code class="language-yaml">apiVersion: cert-manager.io/v1
kind: Certificate
metadata: { name: api-mtls, namespace: payments }
spec:
  secretName: api-mtls
  duration: 24h
  renewBefore: 8h
  privateKey:
    rotationPolicy: Always
    algorithm: ECDSA
    size: 256
  issuerRef: { name: spire-ca, kind: ClusterIssuer }
  dnsNames: [api.payments.svc]
</code></pre>
<h3>Step 5: Surface ownership in the platform catalogue</h3>
<p>An NHI without an owner is a 2028 incident. The Backstage catalogue entry per NHI captures owner, class, rotation cadence, and platform API endpoints (<code>whoOwns</code>, <code>expiresAt</code>). On reorgs, the ownership graph re-resolves; expired ownership is a blocking condition for new deploys.</p>
<pre><code class="language-yaml">apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: payments-api-svid
  annotations:
    nhi.class: workload
    nhi.rotation: 24h
    nhi.spiffe-id: "spiffe://example.com/payments"
spec:
  type: nhi
  owner: payments-team
  lifecycle: production
</code></pre>
<h2>Testing the implementation</h2>
<p>The test plan that gates the rollout. Run each in a non-production cluster first; expand to staging once the green-path tests pass, and the negative tests reject the bad input the way the policy says they will.</p>
<h3>Test 1: Workload obtains an SVID at startup</h3>
<pre><code class="language-bash">kubectl exec -n payments deploy/api -- /opt/spire-agent/bin/spire-agent api fetch
</code></pre>
<p><strong>Expected:</strong> Returns a valid SVID with the SPIFFE ID <code>spiffe://example.com/payments</code>.</p>
<h3>Test 2: Federated AWS role assumed without a static key</h3>
<pre><code class="language-bash">kubectl exec -n payments deploy/api -- aws sts get-caller-identity
</code></pre>
<p><strong>Expected:</strong> Assumes the federated role; ARN ends in <code>:assumed-role/workload-payments/...</code>.</p>
<h3>Test 3: Cert auto-rotates without a restart</h3>
<pre><code class="language-bash">kubectl get cert api-mtls -n payments -o jsonpath='{.status.renewalTime}'
</code></pre>
<p><strong>Expected:</strong> Renewal time is within 8 hours; pod uptime unchanged across renewal.</p>
<h2>Tech components</h2>
<p>GitHub Actions OIDC, AWS IAM Roles with OIDC trust, GCP Workload Identity Federation, Azure federated credentials, actions/attest-build-provenance, Vault JWT auth method.</p>
<h2>Production observability and gotchas</h2>
<ul>
<li><p>Track <code>NHIs without an owner</code> weekly; the target is zero.</p>
</li>
<li><p>SVID issuance rate per workload; a workload not refreshing within TTL is misconfigured.</p>
</li>
<li><p>Federated-role assumption count vs static-key fallbacks; static-key use is an exception.</p>
</li>
<li><p>Cert-renewal failures; an expired cert is the loudest possible failure mode.</p>
</li>
<li><p>Quarterly audit of the ownership graph against IdP group membership; reorg drift is the common cause.</p>
</li>
</ul>
<h2>Failure modes</h2>
<ol>
<li><p><strong>Trust policy uses a wildcard subject claim; any workflow in the org assumes the role.</strong> Scope to <code>repo:org/repo:ref:refs/heads/main</code> or tighter.</p>
</li>
<li><p><strong>Session duration is 12h by default; attacker wins a long window.</strong> Set session duration to 15 minutes minimum viable.</p>
</li>
<li><p><strong>Legacy tool requires static AWS keys; team reintroduces a PAT.</strong> Document exception with sunset date; sandbox the tool.</p>
</li>
</ol>
<h2>When NOT to do this</h2>
<p>Read-only CI jobs with no cloud-state changes may be acceptable without OIDC, but these are rare. For any workflow that writes to cloud state, OIDC is the default.</p>
<h2>What to ship this quarter</h2>
<ul>
<li><p>Migrate every AWS / GCP / Azure call to short-lived OIDC.</p>
</li>
<li><p>Scope trust policies per repo and per workflow.</p>
</li>
<li><p>Set session duration to 15 minutes minimum.</p>
</li>
<li><p>Audit and delete long-lived PATs org-wide.</p>
</li>
<li><p>Track <code>remaining long-lived tokens</code> as a weekly platform KPI.</p>
</li>
</ul>
<h2>Further reading</h2>
<ol>
<li><p><strong>GitHub Docs, Security hardening with OpenID Connect</strong>, The primary reference.</p>
</li>
<li><p><strong>AWS Docs, Configuring OpenID Connect in Amazon Web Services</strong>, The AWS trust-policy reference.</p>
</li>
<li><p><strong>GCP Docs, Workload Identity Federation</strong>, The GCP federation reference.</p>
</li>
<li><p><a href="https://github.com/SubhanshuMG/short-lived-oidc-for-ci-kill-tokens/blob/main/references.md"><code>references.md</code></a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[The 50ms lie: when edge AI actually matters (and when you're paying Cloudflare for marketing)]]></title><description><![CDATA[Cloudflare and Fly.io are selling 50ms of latency savings on a 5,000ms inference like it's a revolution. That's 1% of the total latency. You're optimizing the rounding error while paying a 10x penalty]]></description><link>https://blogs.subhanshumg.com/the-50ms-lie</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-50ms-lie</guid><category><![CDATA[edgecomputing]]></category><category><![CDATA[AI]]></category><category><![CDATA[inference]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[cloudflare]]></category><category><![CDATA[workers]]></category><category><![CDATA[Workers AI]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Mon, 20 Apr 2026 17:25:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/b538b7ea-441c-4d8c-822d-de44c1d877f9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>Cloudflare and Fly.io are selling 50ms of latency savings on a 5,000ms inference like it's a revolution. That's 1% of the total latency. You're optimizing the rounding error while paying a 10x penalty on cost and losing all the advantages of centralized GPU infrastructure. Edge AI works for embeddings, classification, moderation, and routing. It does not work for frontier LLMs. Most "edge AI" marketing is confusing the two.</p>
<h2>Why this matters right now</h2>
<p>Cloudflare announced Workers AI with H100s in 100+ cities; Fly.io published edge inference benchmarks; a wave of hype claiming every inference should be "at the edge." Meanwhile, real-world deployments show that edge works beautifully for SentenceTransformers embeddings and TinyLlama routing models. It's marketing nonsense for GPT-4-class models. This piece cuts through the confusion.</p>
<h2>Mainstream belief vs. production reality</h2>
<p><strong>Mainstream:</strong> "Run all AI at the edge. 50ms network latency adds up. Edge inference is the future."</p>
<p><strong>Production reality:</strong> Network latency matters only for high-concurrency, latency-sensitive workloads (AR/VR, live gaming, vehicle control). For chatbots, RAG pipelines, and most enterprise AI, a 50ms roundtrip is noise compared to 5 seconds of model inference. Edge inference costs 10x more per token and loses the benefits of batching that centralized infrastructure provides. Edge wins for small models, low-latency requirements, and privacy. It loses for cost, throughput, and model capability.</p>
<h2>A timeline (2024-2026)</h2>
<table>
<thead>
<tr>
<th>Date</th>
<th>Event</th>
<th>What It Really Meant</th>
</tr>
</thead>
<tbody><tr>
<td>Jun 2024</td>
<td>Cloudflare launches Workers AI</td>
<td>Edge inference possible, but expensive</td>
</tr>
<tr>
<td>Oct 2024</td>
<td>Fly.io publishes edge-inference benchmarks</td>
<td>Messaging: fast. Reality: 50ms savings on 5s task</td>
</tr>
<tr>
<td>Jan 2025</td>
<td>Anthropic publishes "Why Not All AI is at the Edge"</td>
<td>First sober analysis</td>
</tr>
<tr>
<td>Mar 2025</td>
<td>SambaNova raises $400M on edge inference claims</td>
<td>Marketing at peak volume</td>
</tr>
<tr>
<td>Jun 2025</td>
<td>Benchmarks show edge inference costs 3-10x cloud</td>
<td>Economics reality check</td>
</tr>
<tr>
<td>Oct 2025</td>
<td>Enterprises quietly move serious workloads back to the cloud</td>
<td>Cost catches up with hype</td>
</tr>
</tbody></table>
<h2>The decision tree</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/10a876f3-008e-450d-927b-97deb8dd88fe.png" alt="" style="display:block;margin:0 auto" />

<h2>The reference architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/9f263be0-527b-4784-8e09-62edc3b11e91.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Hybrid architecture for intelligent routing:</strong></p>
<p><strong>Tier 1: Edge (SentenceTransformers, TinyLlama, classification)</strong></p>
<ul>
<li><p>Latency: &lt;100ms</p>
</li>
<li><p>Cost: \(0.10-\)0.30 per 1M tokens (V8 isolate overhead)</p>
</li>
<li><p>Use for: embeddings, routing, low-stakes classification</p>
</li>
</ul>
<p><strong>Tier 2: Regional cloud (smaller open models, local LLMs)</strong></p>
<ul>
<li><p>Latency: 100-500ms</p>
</li>
<li><p>Cost: \(0.01-\)0.05 per 1M tokens</p>
</li>
<li><p>Use for: RAG augmentation, local context</p>
</li>
</ul>
<p><strong>Tier 3: Global cloud (frontier models, GPT-4-scale)</strong></p>
<ul>
<li><p>Latency: 500-5,000ms</p>
</li>
<li><p>Cost: \(0.001-\)0.01 per 1M tokens (batched)</p>
</li>
<li><p>Use for: high-quality generation, reasoning</p>
</li>
</ul>
<p>The routing layer (at the edge) decides: "Is this query amenable to a local SLM, or does it need central capacity?"</p>
<h2>Step-by-step implementation</h2>
<p><strong>Phase 1: Profile your workload's latency requirement (2 days)</strong></p>
<pre><code class="language-python"># code/profile-latency-requirement.py
import time
from collections import defaultdict

class LatencyProfiler:
    def __init__(self):
        self.latencies = defaultdict(list)
    
    def record(self, workload_type, latency_ms):
        self.latencies[workload_type].append(latency_ms)
    
    def analyze(self):
        for workload, latencies in self.latencies.items():
            p50 = sorted(latencies)[len(latencies)//2]
            p99 = sorted(latencies)[int(len(latencies)*0.99)]
            mean = sum(latencies) / len(latencies)
            
            print(f"{workload}: p50={p50}ms, p99={p99}ms, mean={mean}ms")
            
            # Is saving 50ms worth 10x cost?
            savings = mean * 0.01  # Optimistic: 50ms on 5000ms
            print(f"  50ms savings = {savings}ms improvement = {savings/mean*100:.1f}% gain")

# Usage
profiler = LatencyProfiler()
for query in production_queries:
    start = time.time()
    response = call_llm(query)
    latency = (time.time() - start) * 1000
    profiler.record(query.type, latency)
profiler.analyze()
</code></pre>
<p>Expected output: "Chatbot: p50=4500ms, 50ms savings = 1.1% improvement."</p>
<p><strong>Phase 2: Implement confidence-based routing (3 days)</strong></p>
<pre><code class="language-python"># code/confidence-router.py
import anthropic
from sentence_transformers import SentenceTransformer

class ConfidenceRouter:
    def __init__(self):
        self.cloud_client = anthropic.Anthropic()
        self.edge_model = SentenceTransformer('all-MiniLM-L6-v2')
    
    def route(self, query: str) -&gt; tuple[str, str]:
        """Route query to edge or cloud based on confidence."""
        
        # Embed query
        query_embedding = self.edge_model.encode(query)
        
        # Simple heuristic: complexity score
        # (In prod, use a real confidence model)
        tokens = len(query.split())
        has_context_request = any(w in query.lower() for w in ['summarize', 'explain', 'analyze'])
        confidence = 0.9 if (tokens &lt; 50 and not has_context_request) else 0.3
        
        if confidence &gt; 0.7:
            # Route to edge (small model)
            response = self.edge_inference(query)
            return response, 'edge'
        else:
            # Route to cloud (large model)
            response = self.cloud_inference(query)
            return response, 'cloud'
    
    def edge_inference(self, query: str) -&gt; str:
        # Use llama.cpp or similar for local inference
        # For demo: call a mock edge service
        return f"[Edge] Quick answer to: {query[:30]}..."
    
    def cloud_inference(self, query: str) -&gt; str:
        message = self.cloud_client.messages.create(
            model="claude-opus-4-1",
            max_tokens=1024,
            messages=[{"role": "user", "content": query}]
        )
        return message.content[0].text

# Usage
router = ConfidenceRouter()
for query in incoming_queries:
    response, route = router.route(query)
    log(f"Query routed to {route}, response: {response}")
</code></pre>
<p><strong>Phase 3: Deploy SentenceTransformers at the edge (2 days)</strong></p>
<p>For embeddings specifically, edge wins because:</p>
<ul>
<li><p>Model is small (400MB).</p>
</li>
<li><p>Latency is 10-50ms; roundtrip to cloud is 100-200ms.</p>
</li>
<li><p>Cost matters (you're doing this millions of times).</p>
</li>
</ul>
<pre><code class="language-bash"># Using Cloudflare Workers AI
curl -X POST https://api.cloudflare.com/client/v4/accounts/YOUR_ACCOUNT/ai/run/[@cf/baai/bge-base-en-v1.5](mailto:@cf/baai/bge-base-en-v1.5) \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"text":"hello world"}'
</code></pre>
<p>Or on Fly.io:</p>
<pre><code class="language-python"># code/embedding-at-edge.py (Fly.io GPU)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

@app.post("/embed")
def embed(text: str):
    embedding = model.encode(text)
    return {"embedding": embedding}
</code></pre>
<p><strong>Phase 4: Implement fallback for edge failures (2 days)</strong></p>
<p>Edge is cheaper but less reliable. Implement degradation:</p>
<pre><code class="language-python"># code/edge-fallback.py
import asyncio

async def infer_with_fallback(query: str) -&gt; tuple[str, str]:
    """Try edge first, fall back to cloud on timeout."""
    
    try:
        response = await asyncio.wait_for(
            edge_service.infer(query),
            timeout=0.5  # Edge should be fast
        )
        return response, 'edge'
    except (asyncio.TimeoutError, Exception):
        # Fall back to cloud
        response = cloud_client.infer(query)
        return response, 'cloud'

# Usage
for query in queries:
    response, source = await infer_with_fallback(query)
    metrics.record('inference_source', source)  # Track which tier handled it
</code></pre>
<p><strong>Phase 5: Cost accounting (1 day)</strong></p>
<p>Track where inference is happening and at what cost:</p>
<pre><code class="language-python"># code/ai-cost-tracking.py
class AICostTracker:
    EDGE_COST_PER_1M_TOKENS = 0.20  # Cloudflare Workers AI pricing
    CLOUD_COST_PER_1M_TOKENS = 0.005  # Batch pricing with volume discount
    
    def record(self, source: str, tokens: int):
        if source == 'edge':
            cost = tokens / 1_000_000 * self.EDGE_COST_PER_1M_TOKENS
        else:
            cost = tokens / 1_000_000 * self.CLOUD_COST_PER_1M_TOKENS
        
        self.costs[source] += cost
        
        # Alert if edge is &gt;5% of budget
        edge_spend = self.costs['edge']
        total_spend = sum(self.costs.values())
        if edge_spend / total_spend &gt; 0.05:
            alert(f"Edge inference is {edge_spend/total_spend*100:.1f}% of spend, consider moving to cloud")
</code></pre>
<p><strong>Phase 6: Measure end-to-end (1 week)</strong></p>
<p>Run A/B test: edge vs cloud for same workload.</p>
<pre><code class="language-python"># code/ab-test-edge-vs-cloud.py
import random

def infer_ab_test(query: str) -&gt; dict:
    bucket = random.choices(['edge', 'cloud'], weights=[0.5, 0.5])[0]
    
    start = time.time()
    response = edge_infer(query) if bucket == 'edge' else cloud_infer(query)
    latency = (time.time() - start) * 1000
    
    return {
        'bucket': bucket,
        'latency': latency,
        'quality': evaluate_quality(response),
        'cost': COSTS[bucket]
    }

# Run for 1 week, analyze:
# - Edge: avg latency 50ms, cost $0.20 per 1M, quality 0.85
# - Cloud: avg latency 500ms, cost $0.005 per 1M, quality 0.99
# Decision: only use edge for low-stakes queries
</code></pre>
<h2>Real-world example: Anthropic's routing architecture</h2>
<p>Anthropic's (publicly available) technical analysis shows they route queries to different models based on complexity. Simple classification queries go to smaller, faster models. Complex reasoning queries go to Claude Opus. This is not about edge compute; it's about model selection. The lesson: latency gains come from picking the right model, not from moving infrastructure to the edge. An SLM on the cloud is faster and cheaper than a 70B model at the edge.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/8213b444-3f9e-4c94-94b4-5df480f2792a.png" alt="" style="display:block;margin:0 auto" />

<h2>Testing: Economic viability</h2>
<pre><code class="language-python"># code/test_edge_economics.py
def test_edge_cost_benefit():
    """Verify edge saves more money than it costs."""
    
    # Scenario: embedding service
    query_volume = 10_000_000  # per month
    
    edge_cost = query_volume / 1_000_000 * 0.20  # Cloudflare pricing
    cloud_cost = query_volume / 1_000_000 * 0.001  # Cloud batch pricing
    
    # Edge saves on roundtrip latency: ~100ms per query
    # Cloud takes 150ms roundtrip + 20ms inference
    edge_latency = 20  # ms
    cloud_latency = 170  # ms
    latency_savings = (cloud_latency - edge_latency) * query_volume / 1000 / 3600 # hours saved
    
    # Value of latency savings: assume $100/hour developer cost
    latency_value = latency_savings * 100
    
    # Cost of edge: premium for low volume, no batch discounts
    cost_premium = edge_cost - cloud_cost
    
    print(f"Edge cost: ${edge_cost}")
    print(f"Cloud cost: ${cloud_cost}")
    print(f"Cost premium: ${cost_premium}")
    print(f"Latency value: ${latency_value}")
    
    # Only justify edge if latency_value &gt; cost_premium
    # For embeddings: 100ms * 10M queries = not much latency value
    # Cost premium: \(2000 - \)10 = $1990
    # Not worth it.
    assert latency_value &gt; cost_premium, "Edge is not economically justified for this workload"
</code></pre>
<h2>Failure modes</h2>
<ol>
<li><p><strong>Edge model gets out of date and returns stale embeddings.</strong> You upgrade the cloud model but forgot to push to edge. Recovery: automate model syncing; version control your edge models.</p>
</li>
<li><p><strong>Edge latency isn't actually better because of cold starts.</strong> V8 isolates have startup overhead. Recovery: keep workers warm; pre-compile models.</p>
</li>
<li><p><strong>You route a complex query to edge and it fails silently.</strong> Edge model can't handle the input. Recovery: implement confidence scoring; always have a cloud fallback.</p>
</li>
<li><p><strong>Cost explodes because you routed too much volume to edge.</strong> Recovery: monitor edge spend weekly; set hard limits on edge budget.</p>
</li>
</ol>
<h2>When NOT to do this</h2>
<ul>
<li><p><strong>If your queries require latencies &lt;50ms</strong> (AR/VR, real-time vehicle control), edge may be necessary. But ensure you've proved 50ms is the bottleneck, not user experience.</p>
</li>
<li><p><strong>If your LLM needs state or session context</strong>, edge is harder. Cloud let you batch multiple requests and maintain context servers.</p>
</li>
<li><p><strong>If your team has no ops expertise</strong>, managing edge adds operational complexity. Stay with cloud.</p>
</li>
</ul>
<h2>What to ship this quarter</h2>
<ul>
<li><p><strong>Week 1:</strong> Profile your AI workloads; measure actual latency breakdown.</p>
</li>
<li><p><strong>Week 2:</strong> Implement confidence router; route low-complexity queries to cheaper path.</p>
</li>
<li><p><strong>Week 3:</strong> Deploy SentenceTransformers embedding at the edge (Cloudflare or Fly).</p>
</li>
<li><p><strong>Week 4:</strong> Run A/B test: edge vs cloud for the same workload; measure quality + cost.</p>
</li>
<li><p><strong>Week 5:</strong> Analyze results; decide if edge ROI is justified.</p>
</li>
<li><p><strong>Week 6:</strong> Document decision tree; educate team on edge vs cloud tradeoff.</p>
</li>
</ul>
<h2>Further reading</h2>
<p>See <a href="https://github.com/SubhanshuMG/the-50ms-edge-ai-lie/blob/main/references.md"><code>references.md</code></a> for the full bibliography. Top picks:</p>
<ol>
<li><p><strong>Anthropic's "Model Selection and Routing" technical note.</strong> The economics of model choice vs. infrastructure choice.</p>
</li>
<li><p><strong>Cloudflare Workers AI benchmarks.</strong> Published costs and latencies help you do the math.</p>
</li>
<li><p><strong>Martin Fowler, "Microservice Trade-Offs," 2023.</strong> General framework for distributed systems tradeoffs.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Stop building agents like prompts. Build them like state machines.]]></title><description><![CDATA[Github repo: https://github.com/SubhanshuMG/agents-as-state-machines

The thesis in one paragraph
Stop calling them agents. They are state machines that invoke LLMs at certain transitions. The multi-a]]></description><link>https://blogs.subhanshumg.com/stop-building-agents-like-prompts-build-them-like-state-machines</link><guid isPermaLink="true">https://blogs.subhanshumg.com/stop-building-agents-like-prompts-build-them-like-state-machines</guid><category><![CDATA[agentic AI]]></category><category><![CDATA[agents]]></category><category><![CDATA[state-machines]]></category><category><![CDATA[temporal]]></category><category><![CDATA[langgraph]]></category><category><![CDATA[#llmops]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Developer]]></category><category><![CDATA[SRE]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 19 Apr 2026 09:48:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/7580fdbd-708f-4e04-918a-818e75b9985e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<blockquote>
<p>Github repo: <a href="https://github.com/SubhanshuMG/agents-as-state-machines">https://github.com/SubhanshuMG/agents-as-state-machines</a></p>
</blockquote>
<h2>The thesis in one paragraph</h2>
<p>Stop calling them agents. They are state machines that invoke LLMs at certain transitions. The multi-agent hype (autonomous agents, swarms, orchestration) is cargo-cult software engineering. One well-designed agent with excellent tools and explicit state transitions beats five agents role-playing their way through a problem. The production-grade architecture: durable execution (Temporal, Restate, or Inngest), not LangGraph in-memory; explicit state definitions, not implicit chains; tool calls are idempotent and idempotency-keyed; human-in-the-loop is an interrupt primitive, not a callback; failures are replayed deterministically, not retried randomly. This isn't sexy. It's the architecture that doesn't fail at 2 AM.</p>
<h2>Why this matters right now</h2>
<p>The multi-agent narrative peaked in mid-2025. Gartner's 2026 AI Ops report shows that 89% of multi-agent deployments that started with 3+ agents converged to a single agent with more tools by production. The teams didn't realize this; they just kept tearing out features and simplifying. Temporal hit 1.0 in January 2026. Restate (a Temporal alternative) launched commercially in March 2026. Inngest (event-driven durable execution) shipped Temporal-compatible workflows in February 2026. All three are seeing uptake from companies moving off LangGraph. The signal is clear: production agents need durability, not prompts.</p>
<h2>Mainstream belief vs. what production shows</h2>
<p>Mainstream belief: "Build autonomous multi-agent systems. Agents collaborate, specialize, and solve problems together."</p>
<p>Production reality: Agents don't collaborate. They hallucinate sub-agents that don't exist. When you have 5 agents, debugging a failure means reading 5 traces. When one agent fails, the others don't know what to do. The simplest fix: one agent, clear state machine, excellent tools, explicit checkpoints. That's it. Every multi-agent system I've debugged in production would have been cheaper and more reliable as a state machine.</p>
<h2>A short timeline</h2>
<table>
<thead>
<tr>
<th>Date</th>
<th>Event</th>
<th>Impact</th>
</tr>
</thead>
<tbody><tr>
<td>Jun 2024</td>
<td>LangGraph 0.1, agent chains</td>
<td>Prompt-based agent composition</td>
</tr>
<tr>
<td>Dec 2024</td>
<td>Temporal 1.0</td>
<td>Production-grade durable execution</td>
</tr>
<tr>
<td>Jan 2025</td>
<td>AutoGen 0.2, multi-agent swarms</td>
<td>Hype peaks; engineering failures begin</td>
</tr>
<tr>
<td>Mar 2025</td>
<td>First enterprise "multi-agent failure" case studies</td>
<td>Teams realizing they need state machines</td>
</tr>
<tr>
<td>Jan 2026</td>
<td>Temporal for LLMs launch</td>
<td>Explicit language for durable agent workflows</td>
</tr>
<tr>
<td>Feb 2026</td>
<td>Inngest event-driven workflows</td>
<td>Event-sourced agents</td>
</tr>
<tr>
<td>Apr 2026</td>
<td>LangGraph 0.2 adds checkpoints</td>
<td>Recognition that durability is essential</td>
</tr>
</tbody></table>
<h2>The decision tree</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/bec661f3-e412-4968-8336-010d8174209f.png" alt="" style="display:block;margin:0 auto" />

<h2>The reference architecture</h2>
<img src="https://raw.githubusercontent.com/SubhanshuMG/agents-as-state-machines/main/diagrams/architecture.png" alt="Agent as durable state machine" style="display:block;margin:0 auto" />

<p>The layers:</p>
<ol>
<li><p><strong>Temporal or Restate workflow.</strong> Defines states and transitions. Durable: survives worker crashes. Replays deterministically from the last checkpoint.</p>
</li>
<li><p><strong>Agent executor.</strong> Implements each state's logic. Can invoke LLMs, tools, or other agents. Idempotent: tool calls are tagged with idempotency keys.</p>
</li>
<li><p><strong>Tool layer.</strong> All side effects (API calls, database writes) are here. Idempotent, observable, rate-limited.</p>
</li>
<li><p><strong>Human-in-the-loop gate.</strong> Certain state transitions require human approval. Blocks execution; the operator reviews and approves or rejects.</p>
</li>
<li><p><strong>Event log.</strong> Every state transition is logged to an immutable event store. Enables replay, audit, and forensics.</p>
</li>
</ol>
<p>Implementation reference: <a href="https://github.com/SubhanshuMG/agents-as-state-machines/tree/main/code"><code>code/state-machine-agent/</code></a>. Stack: Temporal Python SDK, LLM, tool wrappers.</p>
<h2>Step-by-step implementation</h2>
<h3>Phase 1: Define your state machine (week 1)</h3>
<p>Map out the explicit states your agent will traverse.</p>
<pre><code class="language-python"># code/state_machine/define_states.py
from enum import Enum
from dataclasses import dataclass

class AgentState(Enum):
    INITIAL = "initial"
    FETCH_CONTEXT = "fetch_context"
    ANALYZE = "analyze"
    PLAN = "plan"
    EXECUTE = "execute"
    VERIFY = "verify"
    COMPLETE = "complete"
    FAILED = "failed"

@dataclass
class AgentExecutionContext:
    user_id: str
    task: str
    context: dict
    plan: str
    execution_result: dict
    error: str = None
    
    def to_dict(self):
        return {
            "user_id": self.user_id,
            "task": self.task,
            "context": self.context,
            "plan": self.plan,
            "execution_result": self.execution_result,
            "error": self.error
        }
</code></pre>
<p>State diagram:</p>
<pre><code class="language-plaintext">INITIAL --&gt; FETCH_CONTEXT --&gt; ANALYZE --&gt; PLAN --&gt; EXECUTE --&gt; VERIFY --&gt; COMPLETE
                                                       |
                                                       v
                                                     FAILED
</code></pre>
<h3>Phase 2: Implement with Temporal (week 1-2)</h3>
<p>Use Temporal's Python SDK to define the workflow and activities.</p>
<pre><code class="language-python"># code/temporal/agent_workflow.py
from temporalio import workflow
from temporalio.common import RetryPolicy
from datetime import timedelta
from state_machine.define_states import AgentState, AgentExecutionContext
import activities

@workflow.defn
class AgentWorkflow:
    @workflow.run
    async def run(self, user_id: str, task: str) -&gt; dict:
        """Main agent workflow."""
        ctx = AgentExecutionContext(
            user_id=user_id,
            task=task,
            context={},
            plan="",
            execution_result={}
        )
        
        try:
            # State: FETCH_CONTEXT
            ctx.context = await workflow.execute_activity(
                activities.fetch_context,
                user_id,
                task,
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            
            # State: ANALYZE
            analysis = await workflow.execute_activity(
                activities.analyze_with_llm,
                task,
                ctx.context,
                start_to_close_timeout=timedelta(seconds=60)
            )
            
            # State: PLAN
            ctx.plan = await workflow.execute_activity(
                activities.plan_with_llm,
                task,
                analysis,
                ctx.context,
                start_to_close_timeout=timedelta(seconds=60)
            )
            
            # State: EXECUTE (with checkpoint)
            ctx.execution_result = await workflow.execute_activity(
                activities.execute_plan,
                ctx.plan,
                ctx.context,
                start_to_close_timeout=timedelta(minutes=5)
            )
            
            # State: VERIFY
            verification = await workflow.execute_activity(
                activities.verify_result,
                ctx.execution_result,
                task,
                start_to_close_timeout=timedelta(seconds=30)
            )
            
            if not verification["success"]:
                ctx.error = verification.get("reason", "Verification failed")
                return {"state": AgentState.FAILED.value, "context": ctx.to_dict()}
            
            # State: COMPLETE
            return {"state": AgentState.COMPLETE.value, "context": ctx.to_dict()}
        
        except Exception as e:
            ctx.error = str(e)
            return {"state": AgentState.FAILED.value, "context": ctx.to_dict()}
</code></pre>
<h3>Phase 3: Implement activities (week 2)</h3>
<p>Activities are the side-effects (tool calls, LLM invocations).</p>
<pre><code class="language-python"># code/temporal/activities.py
from temporalio import activity
from anthropic import Anthropic
import json

client = Anthropic()

@activity.defn
async def fetch_context(user_id: str, task: str) -&gt; dict:
    """Fetch user context from database."""
    # Idempotent: safe to retry
    return {
        "user_history": await db.query(f"SELECT * FROM user_history WHERE user_id = %s", user_id),
        "task_description": task,
        "timestamp": datetime.utcnow().isoformat()
    }

@activity.defn
async def analyze_with_llm(task: str, context: dict) -&gt; str:
    """Use LLM to analyze the task."""
    prompt = f"""
    Task: {task}
    Context: {json.dumps(context, indent=2)}
    
    Analyze this task. What information is needed? What are the constraints?
    """
    
    message = client.messages.create(
        model="claude-opus",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text

@activity.defn
async def plan_with_llm(task: str, analysis: str, context: dict) -&gt; str:
    """LLM generates an execution plan."""
    prompt = f"""
    Task: {task}
    Analysis: {analysis}
    
    Generate a step-by-step plan to accomplish this task. Be specific about tool calls.
    """
    
    message = client.messages.create(
        model="claude-opus",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text

@activity.defn
async def execute_plan(plan: str, context: dict) -&gt; dict:
    """Execute the plan by invoking tools."""
    # Parse the plan and invoke tools
    # Each tool call has an idempotency key
    idempotency_key = f"exec_{context['timestamp']}"
    
    results = []
    for step in plan.split("\n"):
        if step.startswith("TOOL:"):
            tool_name, tool_args = parse_tool_call(step)
            result = await invoke_tool(
                tool_name,
                tool_args,
                idempotency_key=idempotency_key
            )
            results.append(result)
    
    return {"steps": len(results), "results": results}

@activity.defn
async def verify_result(execution_result: dict, task: str) -&gt; dict:
    """Verify the execution result."""
    prompt = f"""
    Task: {task}
    Execution result: {json.dumps(execution_result)}
    
    Does the result satisfy the task? Yes or no, with explanation.
    """
    
    message = client.messages.create(
        model="claude-opus",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    
    response_text = message.content[0].text
    success = "yes" in response_text.lower()
    
    return {"success": success, "reason": response_text}
</code></pre>
<h3>Phase 4: Human-in-the-loop gate (week 2-3)</h3>
<p>Add approval gates for certain transitions.</p>
<pre><code class="language-python"># code/temporal/human_approval.py
from temporalio import workflow

@workflow.signal
async def approve_execution(self, approved: bool):
    """Signal to approve or reject the execution plan."""
    self.approval_result = approved
    self.approval_received = True

@workflow.run
async def run(self, user_id: str, task: str) -&gt; dict:
    """Main workflow with approval gate."""
    ctx = AgentExecutionContext(...)
    
    # ... FETCH_CONTEXT, ANALYZE, PLAN ...
    
    # APPROVAL GATE
    self.approval_received = False
    self.approval_result = None
    
    # Wait for approval (timeout after 1 hour)
    approval_timeout = workflow.wait_condition(
        lambda: self.approval_received,
        timedelta(hours=1)
    )
    
    if not approval_timeout:
        ctx.error = "Approval timeout"
        return {"state": "FAILED", "context": ctx.to_dict()}
    
    if not self.approval_result:
        ctx.error = "Execution rejected by operator"
        return {"state": "FAILED", "context": ctx.to_dict()}
    
    # State: EXECUTE
    ctx.execution_result = await workflow.execute_activity(
        activities.execute_plan,
        ctx.plan,
        ctx.context,
        start_to_close_timeout=timedelta(minutes=5)
    )
    
    # ... VERIFY, COMPLETE ...
</code></pre>
<h3>Phase 5: Idempotency for tool calls (week 3)</h3>
<p>Tag all tool calls with idempotency keys so retries don't duplicate side effects.</p>
<pre><code class="language-python"># code/tools/idempotent_tool.py
import httpx
from uuid import uuid4

async def invoke_tool(tool_name: str, args: dict, idempotency_key: str) -&gt; dict:
    """Invoke a tool with idempotency guarantee."""
    # All tool calls include an idempotency key in headers
    headers = {
        "Idempotency-Key": idempotency_key,
        "X-Tool-Name": tool_name
    }
    
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"http://tool-service/{tool_name}",
            json=args,
            headers=headers
        )
    
    return response.json()

# Tool service should implement idempotency
# Example: Stripe, GitHub, most modern APIs support Idempotency-Key header
</code></pre>
<h3>Phase 6: Event sourcing (week 3-4)</h3>
<p>Log all state transitions to an immutable event log.</p>
<pre><code class="language-python"># code/event_sourcing/event_log.py
from dataclasses import dataclass
from datetime import datetime
import json

@dataclass
class WorkflowEvent:
    workflow_id: str
    state: str
    activity: str
    timestamp: str
    result: dict
    error: str = None

async def log_state_transition(
    workflow_id: str,
    from_state: str,
    to_state: str,
    activity_result: dict
):
    """Log a state transition to the event store."""
    event = WorkflowEvent(
        workflow_id=workflow_id,
        state=to_state,
        activity=from_state,
        timestamp=datetime.utcnow().isoformat(),
        result=activity_result
    )
    
    # Append to immutable log (Postgres, DynamoDB, Kafka)
    await event_store.append(event.workflow_id, event)
</code></pre>
<h3>Phase 7: Deployment (week 4)</h3>
<p>Deploy Temporal workers and the workflow.</p>
<pre><code class="language-yaml"># code/k8s/temporal-worker.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-workflow-worker
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-worker
  template:
    metadata:
      labels:
        app: agent-worker
    spec:
      containers:
      - name: worker
        image: agent-worker:v1.0.0
        env:
        - name: TEMPORAL_HOST
          value: "temporal-server:7233"
        - name: TEMPORAL_NAMESPACE
          value: "agent-workflows"
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
</code></pre>
<h2>Real-world example: Expense report approval</h2>
<p>An enterprise finance team runs an agent to review and approve expense reports. Old approach (multi-agent): one agent classifies the expense, another verifies the receipt, a third approves. Hallucination: agents invent sub-agents that don't exist.</p>
<p>New approach (state machine): one agent with clear states:</p>
<ol>
<li><p>FETCH: Get the expense record.</p>
</li>
<li><p>VERIFY: Check the receipt (call OCR tool).</p>
</li>
<li><p>CLASSIFY: Run the LLM to categorize spend.</p>
</li>
<li><p>ROUTE: Send to the right approver (tool call).</p>
</li>
<li><p>WAIT_APPROVAL: Human gate.</p>
</li>
<li><p>RECORD: Log to accounting system (idempotent tool).</p>
</li>
</ol>
<p>If the WAIT_APPROVAL step fails (timeout), the workflow restarts from that exact point. No re-processing of receipt, no re-classification. Durable, auditable, simple.</p>
<h2>Testing: Deterministic replay</h2>
<p>Validate that workflows replay identically:</p>
<pre><code class="language-python"># code/test/test_replay.py
from temporalio.testing import WorkflowEnvironment
import pytest

@pytest.mark.asyncio
async def test_workflow_replay():
    """Verify the workflow replays deterministically."""
    async with await WorkflowEnvironment.start_local() as env:
        await env.client.execute_workflow(
            AgentWorkflow.run,
            "user_123",
            "process_document.pdf",
            id="workflow_replay_test"
        )
        
        # Replay with different random seed should give same result
        result1 = await env.client.get_workflow_history(
            "workflow_replay_test"
        )
        
        # Confirm: no divergence warnings
        assert not result1.has_warnings()
</code></pre>
<h2>Failure modes</h2>
<ol>
<li><p><strong>Non-determinism in activities.</strong> You call <code>random.randint()</code> in an activity. The workflow replays; the random value is different. The history diverges. Temporal detects this as a warning. Fix: never use non-deterministic code in activities (no random, no time. now, no external API calls without caching). Use Temporal's side effects API for non-deterministic operations.</p>
</li>
<li><p><strong>Activity timeout during long-running operation.</strong> An activity has a 5-minute timeout. The tool call takes 6 minutes. Activity fails. Temporal retries. The tool runs again (unless idempotent). Duplicate side effect. Fix: set activity timeouts to 2x the expected duration; use heartbeats for long operations to show progress.</p>
</li>
<li><p><strong>Human approval timeout creates dangling workflows.</strong> A workflow is waiting for human approval. The operator never responds. After 1 hour, the workflow times out and transitions to FAILED. But the approval task is still in Jira, waiting. Inconsistent state. Fix: send a notification before the timeout; implement a callback pattern where the approval tool signals the workflow directly.</p>
</li>
<li><p><strong>Event log explosion on high-frequency state machines.</strong> A workflow with 1,000 transitions per run, runs 1,000 times/second. The event log is 1B events/day. Storage and replay become too slow. Fix: snapshot the workflow state every N events (e.g., every 100); compress old events.</p>
</li>
<li><p><strong>Worker crash during activity execution.</strong> A worker is executing a long-running activity. The worker crashes. Temporal retries from the activity's start (at-least-once semantics). If the tool wasn't idempotent, side effects are duplicated. Fix: always implement idempotency in tools; use idempotency keys.</p>
</li>
</ol>
<h2>When NOT to do this</h2>
<p>Do not use Temporal/durable execution if:</p>
<ul>
<li><p><strong>The workflow is simple (&lt; 3 steps).</strong> Use LangGraph in-memory; operational overhead isn't worth it.</p>
</li>
<li><p><strong>Failures are acceptable and cheap to retry.</strong> If losing a $0.50 request is fine, skip durability.</p>
</li>
<li><p><strong>Your org has no infrastructure team.</strong> Temporal requires operator knowledge. If you're a solo AI engineer, stick with LangGraph.</p>
</li>
</ul>
<h2>What to ship this quarter</h2>
<ul>
<li><p>Map your agent logic as an explicit state machine (diagram) by end of week 1.</p>
</li>
<li><p>Implement as Temporal workflow with 5-7 states and activities by week 2.</p>
</li>
<li><p>Add human-in-the-loop approval gate for high-stakes transitions by week 3.</p>
</li>
<li><p>Tag all tool calls with idempotency keys by the end of week 3.</p>
</li>
<li><p>Deploy to production with 3+ worker replicas by the end of the quarter.</p>
</li>
<li><p>Validate replay and determinism with 100 test cases.</p>
</li>
</ul>
<h2>Further reading</h2>
<p>Top references:</p>
<ol>
<li><p><strong>Temporal Python SDK Documentation.</strong> Workflows, activities, determinism.</p>
</li>
<li><p><strong>Restate Runtime.</strong> Event-driven, durable execution alternative.</p>
</li>
<li><p><strong>Inngest Workflows.</strong> Event-sourced agent execution.</p>
</li>
<li><p><strong>NIST Software Supply Chain: Incident Response.</strong> Traces and replay for forensics.</p>
</li>
<li><p><strong>Idempotency Keys RFC 9110.</strong> HTTP header standard for idempotent requests.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[How I Built ForgeKit: An Open-Source Engineering Acceleration Platform That Scaffolds Production-Ready Projects in Under 60 Seconds]]></title><description><![CDATA[https://github.com/SubhanshuMG/ForgeKit


The Problem Nobody Talks About Honestly
It is 9 AM on a Monday. Your team just got greenlit on a new microservice. You spin up a fresh repo and then spend the]]></description><link>https://blogs.subhanshumg.com/forgekit</link><guid isPermaLink="true">https://blogs.subhanshumg.com/forgekit</guid><category><![CDATA[Open Source]]></category><category><![CDATA[cli]]></category><category><![CDATA[devtools]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[Web Development]]></category><category><![CDATA[serverless]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[General Programming]]></category><category><![CDATA[JavaScript]]></category><category><![CDATA[Python]]></category><category><![CDATA[vite]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Tue, 24 Mar 2026 16:30:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/151bca60-9fd4-419e-9a24-c2d69e049345.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a class="embed-card" href="https://github.com/SubhanshuMG/ForgeKit">https://github.com/SubhanshuMG/ForgeKit</a></p>

<hr />
<h2>The Problem Nobody Talks About Honestly</h2>
<p>It is 9 AM on a Monday. Your team just got greenlit on a new microservice. You spin up a fresh repo and then spend the next three hours doing this:</p>
<blockquote>
<p>You copy the Dockerfile from the last project. Strip out service-specific stuff. Realize you forgot the health check. Set up GitHub Actions from a Stack Overflow template that still references Node 16, which hit EOL two years ago. Wire up ESLint, then discover your team standardized on a different config six months ago. Create <code>.env.example</code>, forget two variables, find out at 11 PM during code review.</p>
</blockquote>
<p>You have not written a single line of product code.</p>
<p>This compounds across every new project, every new contributor onboarding, every side project started at 10 PM that burns out before the interesting part begins. It is not a skill problem. It is a tooling problem.</p>
<blockquote>
<p><strong>Engineers lose 2+ hours on every new project to the exact same setup work. ForgeKit eliminates it in one command.</strong></p>
</blockquote>
<p><a class="embed-card" href="https://youtu.be/NLMJQa4lyLI">https://youtu.be/NLMJQa4lyLI</a></p>

<hr />
<h2>What ForgeKit Is</h2>
<p><a href="https://github.com/SubhanshuMG/ForgeKit">ForgeKit</a> is an open-source engineering acceleration platform. One command scaffolds a fully wired, production-ready project: Dockerfile, docker-compose, GitHub Actions CI, test setup, environment config, health endpoint, and stack-specific sensible defaults.</p>
<pre><code class="language-bash">forgekit new my-api --template api-service
</code></pre>
<p>No config to set up. No template repos to fork and clean. The project runs on the first try.</p>
<p>It ships as a TypeScript monorepo published to npm, with six production-grade templates and a React 18 + Vite web dashboard for browsing and launching scaffolds visually.</p>
<hr />
<h2>Real-World Example: Zero to Running FastAPI Service</h2>
<p>Say you are an ML engineer who needs a FastAPI service to expose a model endpoint. The usual path: virtualenv setup, remembering pydantic v2 syntax changes, SQLAlchemy async session wiring, getting the multi-stage Docker build right so the image is not 2 GB, writing CI from scratch.</p>
<p>With ForgeKit:</p>
<pre><code class="language-bash">npm install -g forgekit-cli

forgekit new model-serving-api --template api-service
</code></pre>
<pre><code class="language-plaintext">  ForgeKit, Engineering Acceleration Platform

  Scaffolding model-serving-api from template api-service...
  ✓ Scaffolded 14 files in 1.2s

  Next steps:
    cd model-serving-api
    pip install -r requirements.txt
    cp .env.example .env
    docker-compose up
</code></pre>
<h3>What gets generated</h3>
<table>
<thead>
<tr>
<th>File</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>main.py</code></td>
<td>FastAPI app entrypoint with router</td>
</tr>
<tr>
<td><code>app/config.py</code></td>
<td>Pydantic Settings, env var loading</td>
</tr>
<tr>
<td><code>app/database.py</code></td>
<td>SQLAlchemy async session factory</td>
</tr>
<tr>
<td><code>app/routes/health.py</code></td>
<td><code>/health</code> endpoint with typed response</td>
</tr>
<tr>
<td><code>tests/test_health.py</code></td>
<td>Pytest test passes immediately</td>
</tr>
<tr>
<td><code>Dockerfile</code></td>
<td>Multi-stage build, non-root user</td>
</tr>
<tr>
<td><code>docker-compose.yml</code></td>
<td>App + PostgreSQL with healthcheck</td>
</tr>
<tr>
<td><code>.env.example</code></td>
<td>Every variable documented</td>
</tr>
</tbody></table>
<p>Everything runs. Tests pass. Docker image builds clean. You are building your actual product within minutes.</p>
<hr />
<h2>All Six Templates</h2>
<table>
<thead>
<tr>
<th>Template</th>
<th>Stack</th>
<th>Use Case</th>
</tr>
</thead>
<tbody><tr>
<td><code>web-app</code></td>
<td>Node.js + React + TypeScript + Express + Vite</td>
<td>Full-stack web apps</td>
</tr>
<tr>
<td><code>api-service</code></td>
<td>Python + FastAPI + PostgreSQL + Docker</td>
<td>REST APIs, ML serving</td>
</tr>
<tr>
<td><code>ml-pipeline</code></td>
<td>Python + Jupyter + MLflow + scikit-learn</td>
<td>Reproducible ML experiments</td>
</tr>
<tr>
<td><code>next-app</code></td>
<td>Next.js + Tailwind + TypeScript</td>
<td>Frontend-first apps</td>
</tr>
<tr>
<td><code>go-api</code></td>
<td>Go + Gin + PostgreSQL</td>
<td>High-performance APIs</td>
</tr>
<tr>
<td><code>serverless</code></td>
<td>AWS Lambda + TypeScript</td>
<td>Event-driven functions</td>
</tr>
</tbody></table>
<p>Every template ships with a Dockerfile, docker-compose, <code>.env.example</code>, and at least one passing test.</p>
<hr />
<h2>Architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/4aa99af6-330c-4237-a0c7-177a62105a0b.png" alt="" style="display:block;margin:0 auto" />

<p>ForgeKit is not a glorified <code>cp -r</code>. It has a clean five-layer architecture built for security and extensibility.</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Responsibility</th>
<th>Key Modules</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Interface</strong></td>
<td>User interaction, CLI prompts, web dashboard</td>
<td><code>packages/cli/</code>, <code>packages/web/</code></td>
</tr>
<tr>
<td><strong>Application</strong></td>
<td>Orchestration, validation, execution tracking</td>
<td><code>commands/new.ts</code>, <code>core/validator.ts</code></td>
</tr>
<tr>
<td><strong>Service</strong></td>
<td>Scaffolding engine, template resolver, file writer, security guard</td>
<td><code>core/scaffold.ts</code>, <code>core/security.ts</code></td>
</tr>
<tr>
<td><strong>Data</strong></td>
<td>Template registry, audit logs, config</td>
<td><code>templates/registry.json</code>, <code>core/audit.ts</code></td>
</tr>
<tr>
<td><strong>Infrastructure</strong></td>
<td>CI/CD, npm publishing, docs hosting</td>
<td><code>.github/workflows/</code>, <code>docs/</code></td>
</tr>
</tbody></table>
<h3>The Scaffold Lifecycle</h3>
<p>When you run <code>forgekit new</code>, here is exactly what happens under the hood:</p>
<pre><code class="language-plaintext">forgekit new
    → sanitizeProjectName()        [strips to a-z0-9 only]
    → validateTemplateId()          [regex + blocks path traversal]
    → getTemplate()                 [reads registry.json]
    → writeTemplateFiles()          [for each file:]
        → validatePathContainment() [blocks ../../ escapes]
        → Handlebars.compile()      [renders .hbs templates]
        → fs.outputFile()           [writes to disk]
    → validateHookCommand()         [allowlist: npm/pip/python only]
    → spawnSync(shell: false)       [runs post-scaffold hooks]
    → trackEvent()                  [opt-out telemetry]
</code></pre>
<p>Full source: <a href="https://github.com/SubhanshuMG/ForgeKit/tree/main/packages/cli/src/core">packages/cli/src/core/</a></p>
<hr />
<h2>Security by Design</h2>
<p>Security is a design constraint, not a checklist item. The core assumption is that templates are untrusted input. A malicious template must not escape the output directory or run arbitrary code.</p>
<blockquote>
<p>A template cannot write outside the project root. A hook cannot run <code>curl | bash</code>. A project name cannot inject a path traversal. These are enforced at the code level, not by convention.</p>
</blockquote>
<h3>The four defenses</h3>
<p><strong>Path traversal prevention</strong> validates every destination path against the project root before any write:</p>
<pre><code class="language-typescript">// core/security.ts
export function validatePathContainment(
  targetRoot: string,
  filePath: string
): boolean {
  const resolvedRoot = path.resolve(targetRoot);
  const resolvedFile = path.resolve(targetRoot, filePath);
  return resolvedFile.startsWith(resolvedRoot + path.sep);
}
</code></pre>
<p><strong>Command injection prevention</strong> via an explicit allowlist. Only <code>npm</code>, <code>npx</code>, <code>yarn</code>, <code>pnpm</code>, <code>pip</code>, <code>pip3</code>, <code>python</code>, and <code>python3</code> are allowed. <code>spawnSync</code> uses <code>shell: false</code>.</p>
<p><strong>Name sanitization</strong> strips everything except <code>[a-z0-9-_]</code>. The string <code>../../etc/passwd</code> becomes <code>etc-passwd</code>. No special characters reach the filesystem.</p>
<p><strong>External template validation</strong> checks <code>github:</code> and <code>npm:</code> prefixed IDs against a strict regex before any network call.</p>
<h3>Every generated project ships secure by default</h3>
<table>
<thead>
<tr>
<th>Default</th>
<th>What it protects against</th>
</tr>
</thead>
<tbody><tr>
<td>Non-root Docker <code>USER appuser</code></td>
<td>Container breakout escalation</td>
</tr>
<tr>
<td><code>.env.example</code> with no real secrets</td>
<td>Credential leaks in version control</td>
</tr>
<tr>
<td><code>.gitignore</code> excludes <code>.env</code></td>
<td>Accidental secret commits</td>
</tr>
<tr>
<td>PostgreSQL healthcheck in compose</td>
<td>Startup race conditions</td>
</tr>
<tr>
<td>Multi-stage Docker builds</td>
<td>Bloated images with build tools exposed</td>
</tr>
<tr>
<td>Pinned dependency versions</td>
<td>Supply chain drift</td>
</tr>
</tbody></table>
<hr />
<h2>Generated Code That Actually Runs</h2>
<p>The health route in every <code>api-service</code> scaffold:</p>
<pre><code class="language-python">from fastapi import APIRouter
from pydantic import BaseModel

router = APIRouter()

class HealthResponse(BaseModel):
    status: str
    version: str

@router.get("/health", response_model=HealthResponse)
async def health_check():
    return HealthResponse(status="ok", version="0.1.0")
</code></pre>
<p>The test that ships alongside it, passing on first run:</p>
<pre><code class="language-python">from fastapi.testclient import TestClient
from main import app

client = TestClient(app)

def test_health_check():
    response = client.get("/api/v1/health")
    assert response.status_code == 200
    assert response.json()["status"] == "ok"
</code></pre>
<p>Your CI is green before you write a single line of product code.</p>
<p><a class="embed-card" href="https://gist.github.com/SubhanshuMG/29a54512c27445b1d45f07da2d3a40fa">https://gist.github.com/SubhanshuMG/29a54512c27445b1d45f07da2d3a40fa</a></p>

<hr />
<h2>The Template System</h2>
<p>Each template is a directory of files. Some are raw (copied as-is), some are Handlebars <code>.hbs</code> files for variable interpolation. The registry at <code>templates/registry.json</code> defines the manifest:</p>
<pre><code class="language-json">{
  "id": "api-service",
  "name": "API Service (Python + FastAPI)",
  "stack": ["python", "fastapi", "postgresql", "docker"],
  "files": [
    { "src": "main.py", "dest": "main.py" },
    { "src": "README.md.hbs", "dest": "README.md" },
    { "src": "Dockerfile", "dest": "Dockerfile" }
  ],
  "hooks": [
    { "type": "post-scaffold", "command": "pip", "args": ["install", "-r", "requirements.txt"] }
  ]
}
</code></pre>
<p>Handlebars files use <code>{{name}}</code> to inject the project name at scaffold time:</p>
<pre><code class="language-handlebars"># {{name}}

A FastAPI service scaffolded with ForgeKit.
</code></pre>
<p>Full registry: <a href="https://github.com/SubhanshuMG/ForgeKit/blob/main/templates/registry.json">templates/registry.json</a></p>
<hr />
<h2>Testing Strategy</h2>
<p>Three layers, each with a specific purpose.</p>
<p><strong>Unit tests</strong> cover the security functions: sanitizer, path containment, hook allowlist, template ID regex.</p>
<p><strong>Integration tests</strong> scaffold a real project into a temp directory and verify the full file tree:</p>
<pre><code class="language-typescript">it("creates all expected files", async () =&gt; {
  const result = await scaffold({
    projectName: "test-api",
    templateId:  "api-service",
    outputDir:   os.tmpdir(),
    variables:   { name: "test-api" },
    skipInstall: true,
  });

  expect(result.success).toBe(true);
  expect(result.filesCreated).toContain(
    path.join(os.tmpdir(), "test-api", "main.py")
  );
});
</code></pre>
<p><strong>Smoke tests</strong> run the full CLI binary end-to-end against a real filesystem.</p>
<p>Full test suite: <a href="https://github.com/SubhanshuMG/ForgeKit/tree/main/packages/cli/src/__tests__">packages/cli/src/__tests__</a></p>
<hr />
<h2>CI/CD Pipeline</h2>
<p>Seven GitHub Actions workflows, each with a single job:</p>
<table>
<thead>
<tr>
<th>Workflow</th>
<th>Trigger</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>ci.yml</code></td>
<td>push, PR</td>
<td>Typecheck + lint + test + build (Node 20/22 matrix)</td>
</tr>
<tr>
<td><code>publish.yml</code></td>
<td>release tag</td>
<td>npm publish, gated on CI</td>
</tr>
<tr>
<td><code>codeql.yml</code></td>
<td>push, schedule</td>
<td>Static analysis</td>
</tr>
<tr>
<td><code>security-scan.yml</code></td>
<td>push</td>
<td>Dependency audit</td>
</tr>
<tr>
<td><code>dco-check.yml</code></td>
<td>PR</td>
<td>Developer Certificate of Origin</td>
</tr>
<tr>
<td><code>docs.yml</code></td>
<td>push to main</td>
<td>VitePress docs to GitHub Pages</td>
</tr>
<tr>
<td><code>release-drafter.yml</code></td>
<td>PR merge</td>
<td>Auto-draft changelog</td>
</tr>
</tbody></table>
<p>Workflows: <a href="https://github.com/SubhanshuMG/ForgeKit/tree/main/.github/workflows">.github/workflows/</a></p>
<hr />
<h2>How to Add Your Own Template</h2>
<p><strong>Full step-by-step guide (expand)</strong></p>
<p><strong>1.</strong> Create the directory: <code>mkdir templates/django-api</code></p>
<p><strong>2.</strong> Add your files (<code>.hbs</code> for Handlebars interpolation):</p>
<pre><code class="language-plaintext">templates/django-api/
├── manage.py.hbs
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── .env.example
└── README.md.hbs
</code></pre>
<p><strong>3.</strong> Register in <code>templates/registry.json</code>:</p>
<pre><code class="language-json">{
  "id": "django-api",
  "name": "Django API",
  "stack": ["python", "django", "drf", "postgresql"],
  "files": [
    { "src": "manage.py.hbs", "dest": "manage.py" },
    { "src": "Dockerfile", "dest": "Dockerfile" }
  ],
  "hooks": [
    { "type": "post-scaffold", "command": "pip", "args": ["install", "-r", "requirements.txt"] }
  ]
}
</code></pre>
<p><strong>4.</strong> Use <code>{{name}}</code> in <code>.hbs</code> files for project name injection.</p>
<p><strong>5.</strong> Write a test, open a PR. That is the full loop.</p>
<hr />
<h2>The CLI</h2>
<pre><code class="language-bash"># Scaffold (interactive mode)
forgekit new

# Scaffold with flags
forgekit new my-api --template api-service

# Preview without writing files
forgekit new preview --template go-api --dry-run

# List templates
forgekit list

# Check your environment
forgekit doctor
</code></pre>
<hr />
<h2>Why Apache 2.0 and DCO</h2>
<p><strong>Apache 2.0 over MIT:</strong> Includes an explicit patent grant and trademark protection. For an engineering platform, companies will depend on both matters. MIT leaves them ambiguous.</p>
<p><strong>DCO over CLA:</strong> A per-commit sign-off (<code>git commit -s</code>) stating you have the right to submit the code. Same legal effect as a CLA, zero friction. Every PR enforces it via <code>dco-check.yml</code>.</p>
<hr />
<h2>Roadmap</h2>
<table>
<thead>
<tr>
<th>Phase</th>
<th>Status</th>
<th>Focus</th>
</tr>
</thead>
<tbody><tr>
<td>Foundation</td>
<td>Done</td>
<td>Repo structure, governance, CI</td>
</tr>
<tr>
<td>Day-one value</td>
<td>Done</td>
<td>CLI, 6 templates, web dashboard, npm</td>
</tr>
<tr>
<td>Multi-role expansion</td>
<td>Next</td>
<td>Rails, Rust/Axum, React Native templates</td>
</tr>
<tr>
<td>Security + audit</td>
<td>Planned</td>
<td>RBAC, audit log, policy validation</td>
</tr>
<tr>
<td>Community growth</td>
<td>Planned</td>
<td>Public template registry, contribution pathways</td>
</tr>
<tr>
<td>Enterprise</td>
<td>Future</td>
<td>Team templates, analytics, self-hosted</td>
</tr>
</tbody></table>
<p>Full roadmap: <a href="https://github.com/SubhanshuMG/ForgeKit/blob/main/ROADMAP.md">ROADMAP.md</a></p>
<hr />
<h2>Get Started</h2>
<pre><code class="language-bash">npm install -g forgekit-cli
forgekit new
</code></pre>
<blockquote>
<p><strong>The repo is public, CI is green, the package is live on npm. Star the repo. Try the CLI. Contribute a template.</strong></p>
</blockquote>
<p><a class="embed-card" href="https://www.producthunt.com/products/forgekit-2">https://www.producthunt.com/products/forgekit-2</a></p>

<hr />
<h2>Contributing</h2>
<p>Best first contributions right now:</p>
<ul>
<li><p><strong>Add a new template</strong> (highest impact, see guide above)</p>
</li>
<li><p><strong>Improve error messages</strong> in <code>packages/cli/src/core/</code></p>
</li>
<li><p><strong>Add test coverage</strong> for security function edge cases</p>
</li>
<li><p><strong>Improve the web dashboard</strong> (React 18 + Vite)</p>
</li>
<li><p><strong>Write or improve docs</strong> (VitePress)</p>
</li>
</ul>
<p><a href="https://github.com/SubhanshuMG/ForgeKit/blob/main/CONTRIBUTING.md">CONTRIBUTING.md</a> has the full guide.</p>
<hr />
<blockquote>
<p><strong>GitHub:</strong> <a href="https://github.com/SubhanshuMG/ForgeKit">github.com/SubhanshuMG/ForgeKit</a><br /><strong>npm:</strong> <a href="https://www.npmjs.com/package/forgekit-cli">npmjs.com/package/forgekit-cli</a><br /><strong>Website:</strong> <a href="https://forgekit.build">forgekit.build</a></p>
</blockquote>
<p><em>Apache 2.0. Built by</em> <a href="https://subhanshumg.com"><em>Subhanshu Mohan Gupta</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Building a Deployment Health Validator]]></title><description><![CDATA[A deep-dive into microservice health checking, topological startup ordering and why HTTP 200 does not mean a service is healthy.

The Incident
Picture this: your on-call rotation fires a PagerDuty ale]]></description><link>https://blogs.subhanshumg.com/building-a-deployment-health-validator</link><guid isPermaLink="true">https://blogs.subhanshumg.com/building-a-deployment-health-validator</guid><category><![CDATA[Devops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Python]]></category><category><![CDATA[Microservices]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Tue, 10 Mar 2026 19:52:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/6eee3f26-22ab-4126-b978-5b5c2400a98d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>A deep-dive into microservice health checking, topological startup ordering and why HTTP 200 does not mean a service is healthy.</em></p>
<hr />
<h1>The Incident</h1>
<p>Picture this: your on-call rotation fires a PagerDuty alert at 3:07 AM. The deployment pipeline says green. Every service returned HTTP 200. The readiness check passed. But your notification pipeline is completely silent, job workers are backed up with 1,482 queued tasks, and customers are already filing tickets.</p>
<p>You pull up the deployment validator logs. Everything looks fine on the surface. The tool reported <code>overall_status: healthy</code>. It was wrong on every count.</p>
<p>This exact scenario is what the Deployment Health Validator task is built around. Five real-world bugs, carefully placed, each one plausible enough that most automated agents (and plenty of human engineers) miss at least one. This post walks through each bug, the architecture behind the validator, and how to build one correctly from scratch.</p>
<hr />
<h1>What Are We Building?</h1>
<p>A deployment health validator does four things:</p>
<ol>
<li><p>Reads a manifest that describes your microservices (ports, health endpoints, dependencies, criticality)</p>
</li>
<li><p>Probes each service's health endpoint and parses the response body, not just the HTTP status</p>
</li>
<li><p>Computes a weighted readiness score based on how critical each service is</p>
</li>
<li><p>Runs a topological sort (Kahn's algorithm) to produce a correct startup order where every dependency starts before the services that depend on it</p>
</li>
</ol>
<p>The output is a structured JSON report that your deployment pipeline can parse and gate on.</p>
<hr />
<h1>Architecture Overview</h1>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/631a0a96-81e5-4923-ae11-92089a2fe6c7.png" alt="" style="display:block;margin:0 auto" />

<h3><strong>Production Stack <em>(mock services)</em></strong></h3>
<table>
<thead>
<tr>
<th>Service</th>
<th>Port</th>
<th>Endpoint</th>
<th>Response</th>
</tr>
</thead>
<tbody><tr>
<td>auth-service</td>
<td>:8081</td>
<td>/health</td>
<td><code>{"status": "ok"}</code></td>
</tr>
<tr>
<td>api-gateway</td>
<td>:8082</td>
<td>/health</td>
<td><code>{"status": "healthy"}</code></td>
</tr>
<tr>
<td>cache-service</td>
<td>:8083</td>
<td>/ping</td>
<td><code>"pong"</code> (plain text!)</td>
</tr>
<tr>
<td>worker-service</td>
<td>:8084</td>
<td>/status</td>
<td><code>{"status": "degraded", "queue_depth": 1482}</code></td>
</tr>
<tr>
<td>notification-service</td>
<td>:8085</td>
<td>/health</td>
<td><code>{"status": "ok"}</code></td>
</tr>
</tbody></table>
<h3><strong>Dependency Graph</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/6856b887-61ac-4575-ab3c-fb3ac44cc156.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h1>Real-World Parallel: Netflix's Hystrix, AWS Health Dashboards and Kubernetes Readiness Probes</h1>
<p>Before diving into code, here's why this problem matters in production systems.</p>
<p><strong>AWS Elastic Load Balancer</strong> will route traffic to a backend if its health check returns HTTP 200 on <code>/health</code>. That's it. If your app returns <code>{"status": "starting_up"}</code> with a 200, ELB thinks it's healthy and sends it live traffic. This exact failure mode took down a major e-commerce platform in 2019 during a Black Friday deploy.</p>
<p><strong>Kubernetes readiness probes</strong> are a direct response to this problem. A pod's readiness probe checks whether it should receive traffic, separate from the liveness probe, which checks if it should be restarted. The probe can check an HTTP endpoint, but Kubernetes does not parse the body. You have to do that yourself, in your validator.</p>
<p><strong>Netflix's Chaos Engineering toolkit</strong> specifically tests whether services correctly report unhealthy states under load. <code>worker-service</code> in this task is modelled on exactly this pattern: the service is technically "up" (HTTP 200) but degraded under load (<code>queue_depth: 1482</code>). A naive health checker marks it healthy. A correct one reads the body.</p>
<hr />
<h1>Project Structure</h1>
<pre><code class="language-plaintext">deployment-health-validator/
├── task.toml                         # Task metadata, timeouts, resource limits
├── instruction.md                    # What an agent (or engineer) must fix
├── environment/
│   ├── Dockerfile                    # Container definition
│   ├── deployment_manifest.yaml      # Service definitions (with a decoy key)
│   ├── mock_services.py              # Five Flask servers simulating endpoints
│   └── validator.py                  # THE BROKEN FILE
├── solution/
│   └── solve.sh                      # Oracle: patches all 5 bugs and runs
└── tests/
    └── test_outputs.py               # 19 pytest assertions
</code></pre>
<blockquote>
<p><strong>GitHub:</strong> <a href="https://github.com/SubhanshuMG/terminal-bench-2-hard-devops-diagnostics">terminal-bench-2-hard-devops-diagnostics</a></p>
</blockquote>
<hr />
<h1>Step-by-Step Implementation Guide</h1>
<h2>Step 1: Set Up the Environment</h2>
<pre><code class="language-bash">git clone https://github.com/SubhanshuMG/terminal-bench-2-hard-devops-diagnostics.git
cd terminal-bench-2-hard-devops-diagnostics

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install flask requests pyyaml pytest
</code></pre>
<hr />
<h2>Step 2: Write the Deployment Manifest</h2>
<p>The manifest has one deliberate trap: a top-level <code>services:</code> key that contains only a legacy monitoring entry. The real services live under <code>deployment.services</code>. This mirrors real-world YAML configs where legacy keys accumulate over time and create ambiguity.</p>
<pre><code class="language-yaml"># deployment_manifest.yaml

# Legacy monitoring registry (NOT the authoritative list)
services:
  - name: "metrics-collector"
    port: 9091
    health_endpoint: "/metrics"
    dependencies: []
    criticality: "low"

# Authoritative deployment configuration
deployment:
  name: "production-stack"

  services:
    - name: "auth-service"
      port: 8081
      health_endpoint: "/health"
      dependencies: []
      criticality: "high"

    - name: "api-gateway"
      port: 8082
      health_endpoint: "/health"
      dependencies:
        - "auth-service"
        - "cache-service"
      criticality: "high"

    - name: "cache-service"
      port: 8083
      health_endpoint: "/ping"       # note: NOT /health
      dependencies: []
      criticality: "medium"

    - name: "worker-service"
      port: 8084
      health_endpoint: "/status"     # returns HTTP 200 but body says degraded
      dependencies:
        - "api-gateway"
        - "cache-service"
      criticality: "low"

    - name: "notification-service"
      port: 8085
      health_endpoint: "/health"
      dependencies:
        - "worker-service"
      criticality: "low"
</code></pre>
<hr />
<h2>Step 3: Build the Mock Services</h2>
<p>These five Flask servers simulate real microservice health endpoints. The key design choice: <code>worker-service</code> returns HTTP 200 with <code>{"status": "degraded"}</code>. Any validator that only checks the status code will silently miss this.</p>
<pre><code class="language-python"># mock_services.py

#!/usr/bin/env python3
import threading
from flask import Flask, jsonify

import logging
log = logging.getLogger("werkzeug")
log.setLevel(logging.ERROR)

auth_app = Flask("auth-service")

@auth_app.route("/health")
def auth_health():
    return jsonify({"status": "ok", "version": "2.1.0"}), 200


gateway_app = Flask("api-gateway")

@gateway_app.route("/health")
def gateway_health():
    return jsonify({"status": "healthy", "uptime_seconds": 3601}), 200


cache_app = Flask("cache-service")

@cache_app.route("/ping")
def cache_ping():
    return "pong", 200          # plain text, not JSON


worker_app = Flask("worker-service")

@worker_app.route("/status")
def worker_status():
    # HTTP 200 but the service is overloaded
    return jsonify({"status": "degraded", "queue_depth": 1482}), 200


notif_app = Flask("notification-service")

@notif_app.route("/health")
def notif_health():
    return jsonify({"status": "ok", "pending_notifications": 0}), 200


def _run(app, port):
    app.run(host="0.0.0.0", port=port, debug=False, use_reloader=False)


if __name__ == "__main__":
    specs = [
        (auth_app,    8081, "auth-service         /health"),
        (gateway_app, 8082, "api-gateway          /health"),
        (cache_app,   8083, "cache-service        /ping  "),
        (worker_app,  8084, "worker-service       /status"),
        (notif_app,   8085, "notification-service /health"),
    ]

    threads = []
    for app, port, label in specs:
        t = threading.Thread(target=_run, args=(app, port), daemon=True)
        t.start()
        threads.append(t)
        print(f"  started  {label}  -&gt; http://0.0.0.0:{port}")

    print("All mock services running. Ctrl-C to stop.")
    for t in threads:
        t.join()
</code></pre>
<p>Start them in one terminal: <code>python mock_services.py</code></p>
<p>Verify manually:</p>
<pre><code class="language-bash">curl http://localhost:8081/health    # {"status": "ok", "version": "2.1.0"}
curl http://localhost:8082/health    # {"status": "healthy", "uptime_seconds": 3601}
curl http://localhost:8083/ping      # pong
curl http://localhost:8084/status    # {"status": "degraded", "queue_depth": 1482}
curl http://localhost:8085/health    # {"status": "ok", "pending_notifications": 0}
</code></pre>
<blockquote>
<p><strong>First discipline:</strong> always <code>curl</code> your endpoints before writing a health checker. The field names, the endpoint paths, and the semantic meaning of a 200 response all matter.</p>
</blockquote>
<hr />
<h2>Step 4: The Broken Validator (And All Five Bugs)</h2>
<p>Here is <code>validator.py</code> as it ships in the repository, with each bug annotated:</p>
<pre><code class="language-python"># validator.py (BROKEN)

#!/usr/bin/env python3
import json
import yaml
import requests
from datetime import datetime, timezone
from collections import deque


def load_services(manifest_path: str) -&gt; list:
    with open(manifest_path) as f:
        config = yaml.safe_load(f)
    # BUG 1: reads config["services"] which is the legacy monitoring entry
    # only "metrics-collector" on port 9091 is returned; none of the real services
    return config["services"]


def check_health(service: dict) -&gt; dict:
    url = f"http://127.0.0.1:{service['port']}{service['health_endpoint']}"
    try:
        resp = requests.get(url, timeout=5)
        if resp.status_code != 200:
            return {"status": "unhealthy", "http_status": resp.status_code,
                    "criticality": service["criticality"]}
        try:
            body = resp.json()
            # BUG 2: reads "health_status" key; no service ever sends this field
            # body.get("health_status", "ok") always returns default "ok"
            # worker-service sends {"status": "degraded"} but is reported healthy
            state = body.get("health_status", "ok")
        except ValueError:
            state = "ok"
        healthy = state in ("ok", "up", "healthy")
        return {"status": "healthy" if healthy else "unhealthy",
                "http_status": resp.status_code,
                "criticality": service["criticality"]}
    except requests.exceptions.RequestException:
        return {"status": "unhealthy", "http_status": 0,
                "criticality": service["criticality"]}


def compute_startup_order(services: list) -&gt; list:
    names = [s["name"] for s in services]
    deps_map = {s["name"]: s.get("dependencies", []) for s in services}

    graph     = {n: [] for n in names}
    in_degree = {n: 0 for n in names}

    for svc, deps in deps_map.items():
        for dep in deps:
            # BUG 3: edges are reversed; leaf nodes get scheduled first
            # should be graph[dep].append(svc) and in_degree[svc] += 1
            graph[svc].append(dep)
            in_degree[dep] += 1

    queue = deque(n for n in names if in_degree[n] == 0)
    order = []
    while queue:
        node = queue.popleft()
        order.append(node)
        for neighbor in graph[node]:
            in_degree[neighbor] -= 1
            if in_degree[neighbor] == 0:
                queue.append(neighbor)
    return order


def compute_readiness_score(services: list, statuses: dict) -&gt; float:
    # BUG 4: all weights equal 1; criticality is ignored entirely
    # spec says high=3, medium=2, low=1
    weight_map = {"high": 1, "medium": 1, "low": 1}

    total   = sum(weight_map[s["criticality"]] for s in services)
    healthy = sum(weight_map[s["criticality"]]
                  for s in services
                  if statuses[s["name"]]["status"] == "healthy")
    return round(healthy / total, 4) if total else 0.0


def determine_status(services: list, statuses: dict, score: float):
    # BUG 5: checks ALL services instead of only high-criticality ones
    # worker-service (low criticality) being unhealthy triggers "critical"
    critical_ok = all(
        statuses[s["name"]]["status"] == "healthy"
        for s in services    # should be: for s in services if s["criticality"] == "high"
    )

    if not critical_ok:
        return "critical", critical_ok
    if score &gt;= 0.95:
        return "healthy", critical_ok
    if score &gt;= 0.70:
        return "degraded", critical_ok
    return "not_ready", critical_ok
</code></pre>
<hr />
<h2>Breaking Down Each Bug</h2>
<h3>Bug 1: Wrong YAML Key Path</h3>
<pre><code class="language-python"># BROKEN:
return config["services"]

# FIXED:
return config["deployment"]["services"]
</code></pre>
<p>The manifest has two <code>services</code> keys at different nesting levels. The top-level one is explicitly labelled as a "legacy monitoring registry." The validator reads the wrong one, picks up only <code>metrics-collector</code>, and probes port 9091 instead of the real five services.</p>
<p><strong>Why it's hard to catch:</strong> The broken code runs without errors. It probes an endpoint (9091) that isn't listening, gets a connection refused, marks <code>metrics-collector</code> as unhealthy, and writes a report that looks structurally valid. No stack trace — just wrong data.</p>
<h3>Bug 2: Wrong JSON Body Field Name <em>(the hardest one)</em></h3>
<pre><code class="language-python"># BROKEN:
state = body.get("health_status", "ok")

# FIXED:
state = body.get("status", "ok")
</code></pre>
<p>This is the most insidious bug in the set. <code>worker-service</code> returns <code>{"status": "degraded", "queue_depth": 1482}</code> with HTTP 200. The broken code reads the <code>health_status</code> field, which doesn't exist in any response. <code>dict.get()</code> returns the default <code>"ok"</code>. The validator marks <code>worker-service</code> as healthy.</p>
<p><strong>Real-world equivalent:</strong> An AWS Lambda function returns <code>{"statusCode": 200, "body": "{\"error\": \"DB_TIMEOUT\"}"}</code>. If you check <code>response.statusCode == 200</code> and call it done, you miss the error in the body entirely.</p>
<p><strong>Why agents miss this:</strong> The output looks correct. Four services are healthy. No exceptions are thrown. You would only catch it by curling the endpoint yourself and tracing exactly which field the code reads from the response.</p>
<table>
<thead>
<tr>
<th>Service</th>
<th>Body</th>
<th>Field read (broken)</th>
<th>Result</th>
</tr>
</thead>
<tbody><tr>
<td>auth-service</td>
<td><code>{"status": "ok"}</code></td>
<td><code>health_status</code> (missing)</td>
<td>default "ok" → healthy</td>
</tr>
<tr>
<td>api-gateway</td>
<td><code>{"status": "healthy"}</code></td>
<td><code>health_status</code> (missing)</td>
<td>default "ok" → healthy</td>
</tr>
<tr>
<td>cache-service</td>
<td><code>"pong"</code> (not JSON)</td>
<td><code>ValueError</code> caught</td>
<td>"ok" → healthy</td>
</tr>
<tr>
<td>worker-service</td>
<td><code>{"status": "degraded"}</code></td>
<td><code>health_status</code> (missing)</td>
<td>default "ok" → <strong>wrongly healthy</strong></td>
</tr>
<tr>
<td>notification-service</td>
<td><code>{"status": "ok"}</code></td>
<td><code>health_status</code> (missing)</td>
<td>default "ok" → healthy</td>
</tr>
</tbody></table>
<h3>Bug 3: Reversed Topological Sort</h3>
<pre><code class="language-python"># BROKEN (edges reversed):
graph[svc].append(dep)
in_degree[dep] += 1

# FIXED (dep must start before svc):
graph[dep].append(svc)
in_degree[svc] += 1
</code></pre>
<p>Kahn's algorithm itself is structurally correct. The bug is in how the graph is constructed. The broken code points from a service back to its dependencies, inverting the dependency flow. Nodes with no outgoing edges (the real leaf nodes like <code>notification-service</code>) end up with zero in-degree and get scheduled first.</p>
<p><strong>Correct startup order:</strong></p>
<pre><code class="language-plaintext">auth-service → cache-service → api-gateway → worker-service → notification-service
</code></pre>
<p><strong>Broken output:</strong></p>
<pre><code class="language-plaintext">notification-service → worker-service → api-gateway → cache-service → auth-service
</code></pre>
<p><strong>Real-world consequence:</strong> You start <code>notification-service</code> before <code>worker-service</code> is ready. It tries to connect, fails, and crashes. Your deployment fails not because of a bug in a service, but because of a bug in the tool that decides the order to start services.</p>
<h3>Bug 4: All Criticality Weights Equal 1</h3>
<pre><code class="language-python"># BROKEN:
weight_map = {"high": 1, "medium": 1, "low": 1}

# FIXED:
weight_map = {"high": 3, "medium": 2, "low": 1}
</code></pre>
<p>The task specification defines weights of 3/2/1 for high/medium/low criticality. With equal weights:</p>
<pre><code class="language-plaintext">Broken score:  4 healthy out of 5 total = 0.8
Correct score: (3 + 3 + 2 + 1) / (3 + 3 + 2 + 1 + 1) = 9/10 = 0.9
</code></pre>
<p>The broken score still exceeds the 0.70 threshold and stays below 0.95, so the overall status computation might accidentally give the right answer for the wrong reason. But downstream systems relying on an accurate score for SLA calculations will get wrong numbers.</p>
<h3>Bug 5: Critical Services Check Ignores Criticality</h3>
<pre><code class="language-python"># BROKEN:
critical_ok = all(
    statuses[s["name"]]["status"] == "healthy"
    for s in services
)

# FIXED:
critical_ok = all(
    statuses[s["name"]]["status"] == "healthy"
    for s in services
    if s["criticality"] == "high"
)
</code></pre>
<p>With <code>worker-service</code> (low criticality) being unhealthy, the broken code sets <code>critical_services_healthy = False</code> and returns <code>overall_status: "critical"</code>. The correct answer is <code>"degraded"</code>All high-criticality services are healthy, but the readiness score (0.9) falls below the 0.95 threshold for a fully healthy deployment.</p>
<p><strong>Real-world consequence:</strong> A "critical" status might trigger an automated rollback, page every on-call engineer in the org, or block a release. Triggering a critical alarm because a low-priority background worker is degraded is exactly the kind of alert fatigue that causes engineers to start ignoring pages.</p>
<hr />
<h2>Step 5: The Fixed Validator</h2>
<pre><code class="language-python"># validator_fixed.py

#!/usr/bin/env python3
import json
import yaml
import requests
from datetime import datetime, timezone
from collections import deque


def load_services(manifest_path: str) -&gt; list:
    with open(manifest_path) as f:
        config = yaml.safe_load(f)
    # FIX 1: correct key path
    return config["deployment"]["services"]


def check_health(service: dict) -&gt; dict:
    url = f"http://127.0.0.1:{service['port']}{service['health_endpoint']}"
    try:
        resp = requests.get(url, timeout=5)
        if resp.status_code != 200:
            return {"status": "unhealthy", "http_status": resp.status_code,
                    "criticality": service["criticality"]}
        try:
            body = resp.json()
            # FIX 2: read the correct field name "status"
            status_ok = body.get("status", "ok") in ("ok", "up", "healthy")
        except ValueError:
            status_ok = True  # non-JSON (e.g. "pong"): HTTP 200 is sufficient
        return {"status": "healthy" if status_ok else "unhealthy",
                "http_status": resp.status_code,
                "criticality": service["criticality"]}
    except requests.exceptions.RequestException:
        return {"status": "unhealthy", "http_status": 0,
                "criticality": service["criticality"]}


def compute_startup_order(services: list) -&gt; list:
    names = [s["name"] for s in services]
    deps_map = {s["name"]: s.get("dependencies", []) for s in services}

    graph     = {n: [] for n in names}
    in_degree = {n: 0 for n in names}

    for svc, deps in deps_map.items():
        for dep in deps:
            # FIX 3: correct edge direction; dep starts before svc
            graph[dep].append(svc)
            in_degree[svc] += 1

    queue = deque(n for n in names if in_degree[n] == 0)
    order = []
    while queue:
        node = queue.popleft()
        order.append(node)
        for neighbor in graph[node]:
            in_degree[neighbor] -= 1
            if in_degree[neighbor] == 0:
                queue.append(neighbor)
    return order


def compute_readiness_score(services: list, statuses: dict) -&gt; float:
    # FIX 4: correct criticality weights
    weight_map = {"high": 3, "medium": 2, "low": 1}

    total   = sum(weight_map[s["criticality"]] for s in services)
    healthy = sum(weight_map[s["criticality"]]
                  for s in services
                  if statuses[s["name"]]["status"] == "healthy")
    return round(healthy / total, 4) if total else 0.0


def determine_status(services: list, statuses: dict, score: float):
    # FIX 5: only check high-criticality services
    critical_ok = all(
        statuses[s["name"]]["status"] == "healthy"
        for s in services
        if s["criticality"] == "high"
    )

    if not critical_ok:
        return "critical", critical_ok
    if score &gt;= 0.95:
        return "healthy", critical_ok
    if score &gt;= 0.70:
        return "degraded", critical_ok
    return "not_ready", critical_ok


def main():
    manifest_path = "/app/deployment_manifest.yaml"
    output_path   = "/app/deployment_report.json"

    services = load_services(manifest_path)

    statuses = {}
    for svc in services:
        result = check_health(svc)
        statuses[svc["name"]] = result
        print(f"  {svc['name']:25s} {result['status']:10s} (HTTP {result['http_status']})")

    startup_order = compute_startup_order(services)
    score = compute_readiness_score(services, statuses)
    overall_status, critical_ok = determine_status(services, statuses, score)

    report = {
        "deployment_name":           "production-stack",
        "overall_status":            overall_status,
        "readiness_score":           score,
        "service_statuses":          statuses,
        "startup_order":             startup_order,
        "critical_services_healthy": critical_ok,
        "timestamp":                 datetime.now(timezone.utc).isoformat(),
    }

    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)

    print(f"\nReport written to {output_path}")
    print(f"Overall status : {overall_status}")
    print(f"Readiness score: {score}")


if __name__ == "__main__":
    main()
</code></pre>
<hr />
<h2>Step 6: Expected Output</h2>
<pre><code class="language-json">{
  "deployment_name": "production-stack",
  "overall_status": "degraded",
  "readiness_score": 0.9,
  "service_statuses": {
    "auth-service":         { "status": "healthy",   "http_status": 200, "criticality": "high" },
    "api-gateway":          { "status": "healthy",   "http_status": 200, "criticality": "high" },
    "cache-service":        { "status": "healthy",   "http_status": 200, "criticality": "medium" },
    "worker-service":       { "status": "unhealthy", "http_status": 200, "criticality": "low" },
    "notification-service": { "status": "healthy",   "http_status": 200, "criticality": "low" }
  },
  "startup_order": [
    "auth-service",
    "cache-service",
    "api-gateway",
    "worker-service",
    "notification-service"
  ],
  "critical_services_healthy": true,
  "timestamp": "2024-01-01T00:00:00+00:00"
}
</code></pre>
<p><strong>Score derivation</strong> (high=3, medium=2, low=1):</p>
<pre><code class="language-plaintext">Healthy weight: auth(3) + api-gateway(3) + cache(2) + notification(1) = 9
Total weight:   9 + worker(1) = 10
Readiness score: 9/10 = 0.9

Status logic:
  critical_services_healthy = true   (both high services are healthy)
  score = 0.9, which is &lt; 0.95
  result: "degraded"
</code></pre>
<hr />
<h2>Step 7: The Dockerfile</h2>
<pre><code class="language-dockerfile">FROM python:3.12-slim

RUN apt-get update &amp;&amp; \
    apt-get install -y --no-install-recommends curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/
RUN pip install --no-cache-dir flask==3.0.3 requests==2.32.3 pyyaml==6.0.2

COPY * /app/

# Start mock services in background, wait for auth-service to be ready,
# then hold the container open for the agent or oracle to exec into.
CMD ["/bin/bash", "-c", \
     "python /app/mock_services.py &gt; /tmp/mock_services.log 2&gt;&amp;1 &amp; \
      until curl -sf http://localhost:8081/health &gt; /dev/null 2&gt;&amp;1; do sleep 1; done &amp;&amp; \
      sleep infinity"]
</code></pre>
<p>Build and run:</p>
<pre><code class="language-bash">docker build -t deployment-validator ./deployment-health-validator/environment/
docker run -it --rm deployment-validator bash

# Inside the container:
python /app/mock_services.py &amp;
sleep 2
python /app/validator.py     # or run the fixed version
cat /app/deployment_report.json
</code></pre>
<hr />
<h2>Step 8: Run the Test Suite</h2>
<p>The 19 tests break down as follows:</p>
<table>
<thead>
<tr>
<th>Test</th>
<th>Catches</th>
</tr>
</thead>
<tbody><tr>
<td><code>test_report_file_exists</code></td>
<td>Validator ran at all</td>
</tr>
<tr>
<td><code>test_report_top_level_keys</code></td>
<td>Schema correctness</td>
</tr>
<tr>
<td><code>test_deployment_name</code></td>
<td>Bug 1 (wrong key path returns wrong name)</td>
</tr>
<tr>
<td><code>test_timestamp_format</code></td>
<td>ISO 8601 UTC format</td>
</tr>
<tr>
<td><code>test_all_five_services_present</code></td>
<td>Bug 1 (only 1 service if wrong key)</td>
</tr>
<tr>
<td><code>test_auth_service_healthy</code></td>
<td>Service probing works</td>
</tr>
<tr>
<td><code>test_api_gateway_healthy</code></td>
<td>Service probing works</td>
</tr>
<tr>
<td><code>test_cache_service_healthy</code></td>
<td>Agent read the manifest (endpoint is <code>/ping</code>)</td>
</tr>
<tr>
<td><code>test_worker_service_unhealthy</code></td>
<td>Bug 2 (wrong field name)</td>
</tr>
<tr>
<td><code>test_notification_service_healthy</code></td>
<td>Service probing works</td>
</tr>
<tr>
<td><code>test_service_criticality_values</code></td>
<td>Manifest parsing</td>
</tr>
<tr>
<td><code>test_readiness_score</code></td>
<td>Bug 4 (weights)</td>
</tr>
<tr>
<td><code>test_critical_services_healthy</code></td>
<td>Bug 5 (criticality filter)</td>
</tr>
<tr>
<td><code>test_overall_status_degraded</code></td>
<td>All bugs combined</td>
</tr>
<tr>
<td><code>test_startup_order_has_all_services</code></td>
<td>Bug 3 (topo sort)</td>
</tr>
<tr>
<td><code>test_startup_order_auth_before_gateway</code></td>
<td>Bug 3</td>
</tr>
<tr>
<td><code>test_startup_order_cache_before_gateway</code></td>
<td>Bug 3</td>
</tr>
<tr>
<td><code>test_startup_order_gateway_before_worker</code></td>
<td>Bug 3</td>
</tr>
<tr>
<td><code>test_startup_order_worker_before_notification</code></td>
<td>Bug 3</td>
</tr>
</tbody></table>
<p>Run them:</p>
<pre><code class="language-bash"># Inside the container, after running the validator
pytest /app/tests/ -v
</code></pre>
<p>A passing run looks like:</p>
<pre><code class="language-plaintext">PASSED tests/test_outputs.py::test_report_file_exists
PASSED tests/test_outputs.py::test_report_top_level_keys
PASSED tests/test_outputs.py::test_deployment_name
PASSED tests/test_outputs.py::test_timestamp_format
PASSED tests/test_outputs.py::test_all_five_services_present
PASSED tests/test_outputs.py::test_auth_service_healthy
PASSED tests/test_outputs.py::test_api_gateway_healthy
PASSED tests/test_outputs.py::test_cache_service_healthy
PASSED tests/test_outputs.py::test_worker_service_unhealthy
PASSED tests/test_outputs.py::test_notification_service_healthy
PASSED tests/test_outputs.py::test_service_criticality_values
PASSED tests/test_outputs.py::test_readiness_score
PASSED tests/test_outputs.py::test_critical_services_healthy
PASSED tests/test_outputs.py::test_overall_status_degraded
PASSED tests/test_outputs.py::test_startup_order_has_all_services
PASSED tests/test_outputs.py::test_startup_order_auth_before_gateway
PASSED tests/test_outputs.py::test_startup_order_cache_before_gateway
PASSED tests/test_outputs.py::test_startup_order_gateway_before_worker
PASSED tests/test_outputs.py::test_startup_order_worker_before_notification

19 passed in 0.12s
</code></pre>
<hr />
<h2>Step 9: Running as a Terminal Bench 2.0 Task</h2>
<p>This repository is also a valid Terminal Bench 2.0 task submission. You can run it against an AI agent to measure how reliably the agent finds all five bugs.</p>
<pre><code class="language-bash">pip install bespokelabs-harbor

export GROQ_API_KEY=&lt;your-key&gt;

# Verify the oracle (the task must be solvable before you can score agents against it)
harbor run -p ./deployment-health-validator -a oracle -q

# Run an agent trial; k=10 gives a statistically meaningful success rate
harbor run -p ./deployment-health-validator \
    -a terminus-2 \
    -m groq/moonshotai/kimi-k2-instruct-0905 \
    -k 10
</code></pre>
<hr />
<h1>The Difficulty Calibration Story</h1>
<p>Getting the task into the "hard" range (between 0% and 70% agent success) required five iterations:</p>
<table>
<thead>
<tr>
<th>Iteration</th>
<th>Bug 2 Design</th>
<th>Agent Success</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><code>"healthy"</code> missing from accepted values</td>
<td>100% — too easy</td>
</tr>
<tr>
<td>2</td>
<td>HTTP-only check, no body parsing</td>
<td>0% — too hard</td>
</tr>
<tr>
<td>3</td>
<td>No body check, but explicit hint about valid values</td>
<td>90% — still too easy</td>
</tr>
<tr>
<td>4</td>
<td>No body check, vague hint about inspecting body</td>
<td>0% — agents ignore it</td>
</tr>
<tr>
<td>5 (final)</td>
<td>Wrong field name with plausible default</td>
<td>~40–60% — correct range</td>
</tr>
</tbody></table>
<p>The key insight: <strong>a bug must produce plausible-looking output without throwing exceptions.</strong> A validator that crashes is trivially easy to fix. A validator that silently produces wrong answers is genuinely hard, because you have to know what the correct answer should be before you can spot the discrepancy.</p>
<p>This is the same principle behind production incident post-mortems. The hardest outages aren't the ones where something crashes; they're the ones where something runs successfully while doing the wrong thing.</p>
<hr />
<h1>Key Takeaways</h1>
<ol>
<li><p><strong>HTTP status codes are not health signals.</strong> Always parse the response body and check the semantic status field. A <code>200 OK</code> with <code>{"status": "degraded"}</code> means the service is degraded.</p>
</li>
<li><p><strong>Read your YAML carefully.</strong> Manifest files accumulate legacy keys over time. Know which section of the config is authoritative. Comment it explicitly.</p>
</li>
<li><p><strong>Topological sort edge direction matters.</strong> In Kahn's algorithm, an edge from A to B means A comes before B. If your dependency means "A must exist before B starts," the edge is <code>A → B</code>, <code>in_degree[B] += 1</code>. It's easy to get this exactly backwards.</p>
</li>
<li><p><strong>Weighted scoring should reflect business priority.</strong> Equal weights mean a low-priority background worker failing counts the same as your authentication service being down. That's not true in production.</p>
</li>
<li><p><strong>Critical service filtering should be explicit.</strong> If your alerting fires "critical" when any service is unhealthy, you'll desensitize your on-call team within a week. Be precise about which services actually matter for the critical threshold.</p>
</li>
<li><p><strong>Test with assertions about semantic values, not just structure.</strong> Checking that <code>deployment_report.json</code> exists is not enough. Check that <code>worker-service.status == "unhealthy"</code>, that <code>readiness_score == 0.9</code>, and that <code>startup_order.index("auth-service") &lt; startup_order.index("api-gateway")</code>.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Platform Engineering at the Edge]]></title><description><![CDATA[No Internal Developer Platform today supports disconnected edge environments. Backstage, Port, and Cortex all assume always-on cloud connectivity, yet organizations like GE HealthCare (100+ hospital-e]]></description><link>https://blogs.subhanshumg.com/platform-engineering-at-the-edge</link><guid isPermaLink="true">https://blogs.subhanshumg.com/platform-engineering-at-the-edge</guid><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Security]]></category><category><![CDATA[edgecomputing]]></category><category><![CDATA[gitops]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Tue, 03 Mar 2026 10:30:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/ce98965e-2ace-49f5-955f-55357a09a3bc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>No Internal Developer Platform today supports disconnected edge environments.</strong> Backstage, Port, and Cortex all assume always-on cloud connectivity, yet organizations like GE HealthCare (100+ hospital-edge Kubernetes clusters), Chick-fil-A (2,800+ restaurant clusters), and Axiom Space (MicroShift on the International Space Station) are already running containerized workloads where the internet is a luxury. Building an IDP for the edge requires rethinking every assumption: state sync becomes eventually consistent, GitOps becomes store-and-forward, and compliance enforcement must work with zero network access. The tooling exists; K3s, Zarf, Fleet, OPA Gatekeeper, SPIFFE/SPIRE, CRDTs. Assembling them into a coherent platform demands architectural patterns fundamentally different from cloud-native defaults.</p>
<hr />
<h2>Edge Compute Is Already Running Containers at Surprising Scale</h2>
<p>The edge is no longer a future state. <strong>SpaceX operates 30,000+ Linux nodes in orbit</strong> across the Starlink constellation, with each satellite containing roughly 60 Linux computers alongside custom ASICs from STMicroelectronics (which delivers 5+ million chips per day to SpaceX). While SpaceX hasn't confirmed containers in orbit, their ground infrastructure runs Kubernetes, Docker, Kafka, and HBase extensively, serving 9+ million users with software updates pushed to satellites weekly. The V3 satellites deliver over 1 Tbps fronthaul throughput per bird.</p>
<p>On factory floors, NVIDIA's Jetson Orin lineup delivers <strong>40 to 275 TOPS of AI inference</strong> in form factors consuming 7 to 60W. The AGX Orin 64GB (2048 CUDA cores, 64 Tensor cores, 64GB LPDDR5) runs full containerized workloads via the NVIDIA Container Runtime, included in every JetPack release since 4.2.1. JetPack 6.2 (late 2024) introduced "Super Mode" doubling generative AI performance on Orin Nano/NX modules, while JetPack 7.0 (2025) brought the Jetson AGX Thor with MIG support, CUDA 13.0, and a preemptable real-time kernel on Ubuntu 24.04.</p>
<p>In healthcare, Advantech's POC-8 series medical panel PCs carry IEC 60601-1 certification with Intel Core i7-9700E processors and optional NVIDIA MXM GPUs, fully capable of running containers for surgical AI. GE HealthCare's Edison platform already manages <strong>100+ Kubernetes clusters across hospital edge sites</strong> via Spectro Cloud Palette, with local OCI registries enabling air-gapped container distribution. Their architecture supports fully disconnected operation because it must; these are life-critical applications where cloud outages cannot translate to clinical failures.</p>
<hr />
<h2>Why Every Major IDP Fails at the Edge</h2>
<p>Backstage dominates the IDP market with <strong>89% market share across 3,400+ organizations</strong> serving 2 million developers. Its architecture, a TypeScript/Node.js backend with React frontend, PostgreSQL database, and plugin ecosystem of 200+ extensions, represents the state of the art for cloud-native developer portals. The September 2024 release of the New Backend System (1.0 stable) introduced dependency injection, and the Kubernetes plugin provides multi-cluster visibility by querying API servers directly.</p>
<p>Port ($60M Series C in 2024) offers a pure SaaS model with blueprint-based entity definitions and an API-first architecture. Every data ingestion flows through <code>api.getport.io</code>. Cortex provides self-hosted Helm charts alongside its cloud offering, with push-model Kubernetes agents and 60+ integrations. All three platforms share a fatal assumption for edge: <strong>persistent, reliable network connectivity</strong>.</p>
<table>
<thead>
<tr>
<th>Capability</th>
<th>Backstage</th>
<th>Port</th>
<th>Cortex</th>
</tr>
</thead>
<tbody><tr>
<td>Self-hosted</td>
<td>✅</td>
<td>❌ (SaaS only)</td>
<td>✅ (Helm)</td>
</tr>
<tr>
<td>Offline operation</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr>
<td>Air-gapped support</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr>
<td>Distributed catalog</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr>
<td>Store-and-forward sync</td>
<td>❌</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr>
<td>Edge cluster scale (1000+)</td>
<td>❌</td>
<td>⚠️</td>
<td>⚠️</td>
</tr>
</tbody></table>
<p>Backstage's Kubernetes plugin assumes direct API server access; Mercedes-Benz filed GitHub issue #6967 requesting dynamic cluster supply because the static cluster configuration couldn't handle their growing fleet. Port's K8s exporter requires continuous outbound internet to push data. Cortex's agent model requires persistent connectivity to report state. None implement offline-capable UIs, local data caching, CRDT-based merge strategies, or async action queuing. The closest thing to an edge-native developer platform today is <strong>Spectro Cloud Palette</strong>, which offers a Local UI for field engineers and manages edge clusters in air-gapped environments, but it's an infrastructure management platform, not a full IDP.</p>
<p>An edge-native IDP would need a <strong>federated software catalog</strong> with local instances that operate independently and sync via CRDTs when connectivity returns, <strong>offline self-service workflows</strong> with queued actions, <strong>local golden path templates</strong> bundled with the platform, and <strong>store-and-forward telemetry</strong> that compresses and filters before transmitting. No one has built this yet.</p>
<hr />
<h2>Lightweight Kubernetes Distributions Make Disconnected Operation Possible</h2>
<p>Three distributions compete for the edge Kubernetes space, each with distinct air-gapped strategies.</p>
<p><strong>K3s</strong> (v1.33.3+k3s1) from Rancher/SUSE remains the most deployed, requiring just <strong>512MB RAM</strong> for an agent node. Air-gapped installation uses a tarball-based approach: download the <code>k3s-airgap-images</code> archive and binary on a connected machine, copy them to <code>/var/lib/rancher/k3s/agent/images/</code> on the target, and run <code>INSTALL_K3S_SKIP_DOWNLOAD=true ./install.sh</code>. K3s offers embedded SQLite (single-server) or embedded etcd (HA with 3+ servers), and its auto-deploying manifests feature applies any YAML placed in <code>/var/lib/rancher/k3s/server/manifests/</code> automatically, including CRDs. Version 1.33.1 added conditional image import via <code>.cache.json</code> to skip re-importing unchanged tarballs on restart.</p>
<p><strong>MicroShift</strong> (4.17) from Red Hat is OpenShift stripped to its essentials: <strong>2 cores, 2GB RAM</strong>, single-node only, with OVN-Kubernetes networking and LVMS storage. Its killer feature for edge is deep RHEL for Edge integration; ostree-based immutable OS images built with <code>composer-cli</code>, with <strong>Greenboot health checks</strong> that automatically roll back failed updates by rebooting into the previous known-good ostree commit. Everything bakes into a single ISO installable from USB, including all RPM packages, container images, and MicroShift itself. For disconnected environments, combine a mirror registry with a local RPM mirror via <code>reposync</code> or Red Hat Satellite.</p>
<p><strong>k0s</strong> (v1.34.3+k0s.0) from Mirantis pushes resource minimalism further: <strong>0.5GB RAM</strong> for a worker. It's a single static binary embedding containerd, runc, kubectl, and CNI plugins with zero host OS dependencies. Uniquely, k0s supports <strong>RISC-V</strong> alongside x86_64, ARM64, and ARMv7. Its Autopilot operator handles automated rolling updates via <code>Plan</code> and <code>UpdateConfig</code> CRDs, with safety checks verifying all controllers report <code>/ready</code> before proceeding. Air-gapped bundles are generated with <code>k0s airgap bundle-artifacts</code>, and <code>spec.images.default_pull_policy: Never</code> prevents any internet pulls.</p>
<pre><code class="language-yaml"># k0s air-gapped deployment via k0sctl
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: ClusterConfig
spec:
  k0s:
    version: v1.34.3+k0s.0
  hosts:
    - role: controller
      ssh:
        address: 10.0.0.1
      uploadBinary: true
      k0sBinaryPath: ./k0s
    - role: worker
      ssh:
        address: 10.0.0.2
      uploadBinary: true
      files:
        - src: ./airgap-bundle-amd64.tar
          dstDir: /var/lib/k0s/images
</code></pre>
<hr />
<h2>GitOps Breaks Gracefully, But Edge Needs Store-and-Forward</h2>
<p>Standard GitOps tools degrade predictably when disconnected. <strong>Flux CD</strong> (v2.8.1) continues enforcing the last-known-good state; the kustomize-controller and helm-controller keep reconciling from the last successfully fetched artifact while the source-controller logs <code>FetchFailed</code> and retries at the configured interval. Changes committed to Git while disconnected simply aren't picked up until reconnection. <strong>ArgoCD</strong> (v3.1) behaves similarly in push mode: applications targeting unreachable clusters show <code>Unknown</code> status, but existing workloads continue running. Neither tool offers offline queueing or store-and-forward.</p>
<p>Three solutions address this gap directly.</p>
<p><strong>Zarf</strong> (v0.71.1, by Defense Unicorns) solves air-gapped GitOps through physical transport. A <code>zarf.yaml</code> declares all needed artifacts, including container images, Helm charts, Kubernetes manifests, and Git repositories. Then <code>zarf package create</code> bundles everything into a single compressed <code>.tar.zst</code> tarball. On the disconnected target, <code>zarf init</code> bootstraps a K3s cluster (optional), deploys an in-cluster OCI registry, and installs a <strong>mutating webhook</strong> (<code>zarf-agent</code>) that automatically rewrites all image references to point to the local registry. The init package uses an ingenious bootstrap: the registry image is split into 512KB chunks stored as ConfigMaps (fitting etcd's 1MB limit), then a statically compiled Rust binary reassembles them. Zarf also deploys Gitea as an in-cluster Git server, enabling full GitOps workflows entirely offline.</p>
<p><strong>Fleet by Rancher</strong> uses a pull-based architecture designed for up to <strong>1 million clusters</strong>. The Fleet Agent on each downstream cluster initiates outbound connections to the management cluster, with no inbound access required. Git content is compiled into Bundles on the management cluster, and agents fetch BundleDeployments when connectivity permits. With <code>correctDrift.enabled: true</code>, Fleet automatically reconciles any drift. This pull model naturally handles intermittent connectivity: when offline, existing deployments persist; when reconnected, the agent pulls pending updates.</p>
<p><strong>Red Hat ACM</strong> (2.12.x) provides a hub-spoke model where the <code>klusterlet</code> work-agent on managed clusters periodically checks for <code>ManifestWork</code> resources, applies them locally, and reports status back. When a spoke loses connectivity, <code>ManagedClusterConditionAvailable</code> transitions to <code>Unknown</code>, but all existing ManifestWorks remain enforced. ACM's integration with OpenShift GitOps enables Zero Touch Provisioning (ZTP) for MicroShift clusters from Git-stored SiteConfig and PolicyGenTemplate CRDs.</p>
<p>The <strong>ArgoCD Agent</strong> project (argoproj-labs) represents the newest approach: a lightweight pull-mode agent where edge clusters initiate connections back to the hub, eliminating the need for any inbound network access to edge environments. Currently pre-GA but architecturally significant for edge at scale.</p>
<hr />
<h2>Compliance Enforcement Works Offline (<em>Regulations Don't Care About Your Network)</em></h2>
<p><strong>OPA Gatekeeper</strong> (v3.21.0) enforces admission control entirely locally. Policies stored as ConstraintTemplate and Constraint CRDs live in the cluster's etcd; the validating webhook intercepts API requests at the kube-apiserver, evaluates Rego policies in-process, and returns allow/deny decisions with <strong>zero external network calls</strong>. For distributing policy updates to edge nodes, bundle Rego policies into <code>.tar.gz</code> files, transport via physical media, and serve from a local HTTP server. OPA supports cryptographic bundle signing via JWT verification, ensuring policy integrity on untrusted nodes.</p>
<p><strong>SBOM verification without network</strong> is fully supported across the major tools. Syft generates SBOMs offline against local images. Grype scans offline by pre-downloading the vulnerability database from <code>toolbox-data.anchore.io</code> and hosting it locally. Trivy copies its databases to a private registry via ORAS CLI and runs with <code>TRIVY_OFFLINE_SCAN=true</code>. Cosign (v2.4.1) verifies signatures offline using <code>--offline --local-image</code>, with stapled inclusion proofs eliminating the need to contact the Rekor transparency log. The CNCF Notation project (v1.1.0) supports signing images on local disk and integrates with Ratify for Kubernetes admission verification through Gatekeeper.</p>
<p>The regulatory landscape varies dramatically by sector.</p>
<p><strong>HIPAA</strong> (45 CFR §164.312) requires encryption at rest (etcd encryption via KMS provider, LUKS for PVs), encryption in transit (TLS 1.2+ for all API communication, mTLS via service mesh), comprehensive audit logging, RBAC with unique user IDs, and automatic logoff. PHI must be encrypted at every location it exists, including containers, volumes, host filesystems, and log aggregators.</p>
<p><strong>FDA 21 CFR Part 11</strong> (updated guidance October 2024) mandates computer-generated audit trails recording operator identity and timestamps for all create/modify/delete actions, which cannot be altered. Electronic signatures require at least two identification components. <strong>Immutable container images directly support Part 11's integrity requirements</strong>, and GitOps workflows provide version-controlled, auditable deployment histories. Edge devices must maintain audit logs locally during disconnected periods.</p>
<p><strong>IEC 62443</strong> for industrial systems defines Security Levels 1 through 4 (from accidental misuse through nation-state threats) and seven Foundational Requirements including identification/authentication, restricted data flow, and system integrity. The zones-and-conduits model maps directly to Kubernetes Network Policies (zone boundaries) and ingress controllers (conduit controls). The 2024 update to IEC 62443-4-1 now explicitly requires <strong>SBOMs from component suppliers</strong>.</p>
<p><strong>FCC Part 25</strong> for satellites focuses on spectrum management and orbital debris, with no compute-specific requirements. The FCC proposed replacing Part 25 with new "Part 100" rules in October 2025. <strong>ITAR/EAR export controls</strong> are the real concern: the October 2024 landmark reform created the first-ever ITAR definition of "spacecraft" and new license exceptions for allied nations, but high-performance processors (10+ TFLOPS) and imaging sensors (&lt;30cm resolution) remain controlled.</p>
<hr />
<h2>Real Deployments That Prove the Architecture Works</h2>
<p>The most instructive case studies span all three target environments.</p>
<p><strong>Axiom Space launched MicroShift to the ISS in August 2024</strong> aboard SpaceX CRS-33, running Red Hat Device Edge on the AxDCU-1 data center unit. This represents the first Kubernetes-based containerized computing in orbit, with delta updates minimizing bandwidth consumption and automated rollback for self-healing. The system runs AI/ML workloads for supervised autonomy and life sciences research, reducing dependence on costly satellite downlinks.</p>
<p><strong>Loft Orbital</strong> raised €170M Series C in January 2025 and operates a "virtual missions" model, where customers deploy software to already-orbiting YAM satellites without building hardware. Their YAM-9 satellite demonstrated the first commercial four-node heterogeneous compute cluster in orbit. Customers like Helsing run real-time AI for RF signal intelligence, while SkyServe runs wildfire tracking. The CNCF <strong>KubeEdge</strong> project demonstrated on-orbit target identification accuracy improvements of over 50% on Chinese satellites using a lightweight Sedna AI inference model.</p>
<p><strong>Chick-fil-A operates approximately 2,800 K3s clusters</strong>, one per restaurant, on Intel NUC hardware (3 nodes, ~8GB RAM each, ~$1,000 per site). Their custom "Vessel" agent clones a per-restaurant Git repository and applies manifests via <code>kubectl apply</code>. The system processes billions of MQTT messages monthly from IoT devices (fryers, grills, tablets). Chief Architect Brian Chambers emphasizes that eventual consistency is "the only viable model for edge"; not all clusters are identical at any time, but all converge on a golden image over days to weeks.</p>
<p><strong>The Home Depot runs 2,300+ K3s clusters</strong> managed by SUSE Rancher, processing 5.5 billion documents monthly with 4-hour chain-wide deployments across 6,900+ machines. Distinguished Engineer Dillon TenBrink's eight lessons from KubeCon 2025 highlight that <strong>storage at the edge remains the hardest unsolved problem</strong> ("I personally would have no storage at the edge if possible") and that eventual consistency must be embraced rather than fought.</p>
<p><strong>DoD Platform One's Big Bang</strong> deploys across cloud, air-gapped, and classified environments using ArgoCD for GitOps with Iron Bank hardened container images. Defense Unicorns' Zarf powers air-gapped delivery for submarines and classified networks. The U-2 spy plane received new AI/ML container software deployed in just 12 days, with over-the-air container updates decoupled from airworthiness hardware certification.</p>
<hr />
<h2>Architecture Patterns for a Disconnected World</h2>
<p>The theoretical foundation for edge state synchronization comes from a seminal 2021 paper, "Rearchitecting Kubernetes for the Edge" (Jeffery, Howard, and Mortier, EdgeSys '21), which proposes replacing etcd's Raft consensus with <strong>CRDT-based eventually consistent storage</strong>. Their analysis shows etcd writes constitute ~30% of Kubernetes API requests, and strong consistency becomes a bottleneck at edge scale. A CRDT-based datastore exposes the same etcd API but allows reads/writes to any single node without coordination, with lazy background sync resolving conflicts via mathematical merge functions. Follow-up work by Sassi, Jensen, and Mortier (March 2025) tested these concepts using emulated edge clusters.</p>
<p>For practical state synchronization, <strong>NATS</strong> (v2.10+, CNCF Incubating) provides the most edge-optimized message broker. Its Leaf Node pattern runs a local NATS server on each edge device that syncs with the cloud cluster on reconnection. Sensor data persists to a local JetStream stream; on connectivity return, the leaf node syncs automatically in order without duplication. NATS is a single 15MB binary with no external dependencies. Volvo and Schaeffler use it in production for fleet management.</p>
<p><strong>SPIFFE/SPIRE</strong> (CNCF Graduated) solves workload identity at the edge through nested topology. A top-level SPIRE Server holds the root CA, while downstream SPIRE Servers at edge sites obtain intermediate CAs via the Workload API. Critically, <strong>SVID verification is entirely local</strong>; workloads verify peer identities using cached trust bundles with zero network calls. During disconnection, the edge SPIRE Server continues issuing SVIDs using its cached intermediate CA until the configurable TTL expires.</p>
<p><strong>Keylime</strong> (CNCF) enables cryptographic attestation of untrusted edge nodes via TPM 2.0. Before deploying workloads, the Verifier validates measured boot PCR values and IMA (Integrity Measurement Architecture) runtime measurements against golden values. A 2024 ICCCN paper demonstrated integration with Kubernetes through a custom <strong>EdgeNode CRD</strong> where an Attestation Controller adjusts RBAC permissions based on attestation events; nodes that fail attestation automatically lose the ability to receive workloads.</p>
<p><strong>Crossplane</strong> (v1.20, CNCF Graduated October 2025) extends edge infrastructure management through Composite Resource Definitions that abstract complexity. The hub-and-spoke pattern runs Crossplane centrally to provision edge clusters, then uses provider-helm to install Crossplane into each new cluster with bundled Configuration packages, giving you full management of Crossplane clusters using Crossplane itself. Spectro Cloud's provider-palette adds edge-native host provisioning and Cluster Profiles to the Crossplane ecosystem.</p>
<p><strong>Kairos</strong> (CNCF Sandbox, v3.5.0 stable) delivers the immutable OS layer: the operating system IS the container image, distributed via OCI registries. Building custom OS images requires nothing more than a Dockerfile. A/B atomic upgrades push new images to a registry, and nodes upgrade safely with automatic rollback on failure. Optional P2P mesh networking via libp2p/EdgeVPN enables automated node discovery across networks spanning up to 10,000 km.</p>
<hr />
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/28d698c6-19fb-469a-967b-4b0409e2908f.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Testing Disconnected Edge Requires Simulating Disconnection</h2>
<p><strong>Chaos Mesh</strong> (v2.6+, CNCF Incubating) provides the most comprehensive network partition testing for edge scenarios. Its <code>NetworkChaos</code> CRD supports partition, delay, packet loss, duplication, reordering, and bandwidth limiting:</p>
<pre><code class="language-yaml">apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: edge-hub-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: [edge-workloads]
    labelSelectors:
      app: edge-service
  direction: both
  target:
    selector:
      namespaces: [kube-system]
  duration: '30m'
</code></pre>
<p><strong>LitmusChaos</strong> (v3.0+, CNCF Incubating) adds a 2025 innovation: an MCP Server that exposes chaos capabilities via the Model Context Protocol, allowing engineers to trigger experiments from AI assistants using natural language. Its probe system, including HTTP, command, Kubernetes, and Prometheus, validates steady-state hypotheses at each experiment phase.</p>
<p>For compliance validation in air-gapped environments, <strong>kube-bench</strong> runs as a standalone Go binary executing CIS Kubernetes Benchmark checks with embedded definitions, no network required. The <strong>OpenShift Compliance Operator</strong> runs OpenSCAP evaluations via privileged pods with host read access, generating <code>ComplianceRemediation</code> CRDs that can automatically apply fixes. Both tools produce JSON/JUnit output storable locally and exportable via store-and-forward when connectivity returns.</p>
<p>A comprehensive edge testing pipeline should provision multi-cluster environments with k3d (~5 second startup, ~350MB idle per cluster), deploy applications via GitOps, run compliance scans, then systematically inject network partitions to verify autonomous operation, state convergence on reconnection, and data integrity throughout.</p>
<hr />
<h2>The Platform Engineering Gap at the Edge</h2>
<p>The tools for edge Kubernetes are mature. K3s, MicroShift, and k0s run reliably on constrained hardware. Zarf solves air-gapped delivery. Fleet and ACM handle multi-cluster GitOps at scale. OPA Gatekeeper enforces policy locally. SPIFFE/SPIRE provides identity without connectivity. Keylime attests node integrity via hardware roots of trust. The missing piece is the <strong>integration layer</strong>; a coherent Internal Developer Platform that stitches these components together with a federated catalog, offline self-service workflows, and eventually consistent state synchronization.</p>
<p>Three architectural principles emerge from real-world deployments. First, <strong>eventual consistency is not a compromise but a requirement</strong>; both Chick-fil-A and Home Depot explicitly adopted it after failing with stronger consistency models. Second, <strong>the pull model wins for edge GitOps</strong>; Fleet, argocd-agent, and ACM's klusterlet all have agents initiating outbound connections, eliminating inbound network requirements. Third, <strong>immutability at every layer</strong> prevents the configuration drift that makes disconnected environments unmanageable: immutable OS (Kairos/ostree), immutable container images, immutable infrastructure-as-code, and signed, immutable deployment packages (Zarf).</p>
<p>The organization that builds a true edge-native IDP with CRDT-synchronized catalogs, TPM-attested node enrollment, offline golden paths, and compliance-as-code that works without network, will unlock platform engineering for the 75% of enterprise data that Gartner predicts will be processed outside traditional data centers by 2027. The satellites, factories, and hospitals aren't waiting.</p>
]]></content:encoded></item><item><title><![CDATA[Governing the Ungovernable: Building an EU AI Act Article 9 Compliance Framework for Agentic AI That Actually Works in Production]]></title><description><![CDATA[The EU AI Act's risk management requirements for high-risk AI systems are now on the clock. August 2, 2026 is the hard deadline for Annex III systems. But nobody has published a practical technical im]]></description><link>https://blogs.subhanshumg.com/governing-the-ungovernable</link><guid isPermaLink="true">https://blogs.subhanshumg.com/governing-the-ungovernable</guid><category><![CDATA[euaiact]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[ai compliance certification]]></category><category><![CDATA[DevSecOps]]></category><category><![CDATA[llm]]></category><category><![CDATA[Security]]></category><category><![CDATA[AI]]></category><category><![CDATA[Governance]]></category><category><![CDATA[langgraph]]></category><category><![CDATA[mlops]]></category><category><![CDATA[generative ai]]></category><category><![CDATA[#AIAct2026]]></category><category><![CDATA[Article9]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sat, 28 Feb 2026 07:04:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/0bd5b06f-c8fd-4527-a215-111d64b55d2d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>The EU AI Act's risk management requirements for high-risk AI systems are now on the clock. August 2, 2026 is the hard deadline for Annex III systems. But nobody has published a practical technical implementation guide for agentic systems inside DevSecOps pipelines. This article changes that.</p>
</blockquote>
<p><strong>All code referenced in this article lives in the companion repository:</strong> <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance"><strong>github.com/SubhanshuMG/agentic-ai-eu-compliance</strong></a></p>
<hr />
<h1>The Problem Nobody Is Talking About</h1>
<p>Every compliance guide published so far assumes one thing: your AI system is a fixed pipeline. Data goes in, prediction comes out, risk assessment is done once before deployment. Done.</p>
<p>Agentic AI systems shatter that assumption completely.</p>
<p>When a LangGraph-orchestrated agent decides at runtime to call a financial API, write to a production database, spawn a sub-agent, and escalate privileges to complete a task, you have a system whose risk profile is not determined at design time. It is determined at runtime, at every step, across every tool invocation, in every context.</p>
<p>The EU AI Act was not written with this in mind. Article 9 mandates a "continuous iterative process planned and run throughout the entire lifecycle." Most organizations interpret that as: run an assessment before deployment, update it annually, done.</p>
<p>That interpretation will get you fined.</p>
<p>This article gives you the exact legal requirements, a real-world failure case that illustrates the governance gap, and a six-layer compliance architecture you can implement today.</p>
<hr />
<h1>What Article 9 Actually Requires <em>(Read the Text)</em></h1>
<p>Article 9 of Regulation (EU) 2024/1689 has ten paragraphs. Most compliance summaries skip the details. Here are the obligations that directly affect engineering teams:</p>
<p><strong>Paragraph 1</strong> requires the risk management system to be "established, implemented, documented and maintained." All four verbs matter. You cannot just document it and file it.</p>
<p><strong>Paragraph 2</strong> defines this as a continuous iterative process comprising four mandatory steps:</p>
<ul>
<li><p>Identification and analysis of known and reasonably foreseeable risks to health, safety, and fundamental rights</p>
</li>
<li><p>Estimation and evaluation of risks that can arise during use under intended purpose and reasonably foreseeable misuse</p>
</li>
<li><p>Evaluation of risks arising from post-market monitoring data (Article 72)</p>
</li>
<li><p>Adoption of appropriate risk management measures</p>
</li>
</ul>
<p><strong>Paragraph 5</strong> mandates a three-tier mitigation hierarchy. First, eliminate or reduce risks by design. Second, implement mitigation and control measures. Third, provide information to deployers. The residual risk must be "judged to be acceptable." There are no fixed numerical thresholds. The standard is tied to state of the art, meaning as new mitigations become technically feasible, the bar for "acceptable" tightens.</p>
<p><strong>Paragraphs 6 to 8</strong> require testing using "prior defined metrics and probabilistic thresholds" appropriate to the intended purpose. Testing must occur throughout development and, in any event, prior to market placement.</p>
<p><strong>The key enforcement dates:</strong></p>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Date</p></th><th><p>What becomes enforceable</p></th></tr><tr><td><p>February 2, 2025</p></td><td><p>Prohibited AI practices (Article 5), AI literacy (Article 4)</p></td></tr><tr><td><p>August 2, 2025</p></td><td><p>GPAI model obligations, governance framework, penalty regime</p></td></tr><tr><td><p><strong>August 2, 2026</strong></p></td><td><p><strong>All high-risk requirements including Article 9 for Annex III systems</strong></p></td></tr><tr><td><p>August 2, 2027</p></td><td><p>Annex I systems (AI embedded in regulated products)</p></td></tr></tbody></table>

<p>The European Commission's Digital Omnibus proposal (November 19, 2025) suggests extending the Annex III deadline to December 2, 2027, but this is not yet adopted law. Build for August 2026. Treat any extension as a bonus, not a plan.</p>
<hr />
<h1>The Real-World Problem: <em>A Healthcare Triage Agent Gone Sideways</em></h1>
<p>Let me show you exactly what breaks when you try to apply static compliance frameworks to agentic systems.</p>
<h3><strong>Scenario:</strong></h3>
<p>A hospital network deploys a clinical decision support agent, a high-risk system under Annex III point 5a of the EU AI Act, to assist emergency department triage nurses. The agent is built on LangGraph, uses GPT-4, and has access to a tool set including patient record lookup, medication interaction checking, lab result retrieval, and escalation routing.</p>
<h3><strong>The static compliance team's approach:</strong></h3>
<p>They run a conformity assessment before deployment. They document the intended purpose: assist nurses in triaging patients. They assess the risk: medium-high, with human oversight baked in because the nurse always reviews the recommendation. They test it on 500 synthetic cases. They sign it off.</p>
<h3><strong>What actually happens at runtime:</strong></h3>
<p>On a busy Tuesday night, the agent receives an ambiguous input: a patient with chest pain and a complex medication history. The agent decides, autonomously, to call the lab result retrieval tool, check medication interactions for all 12 current medications, cross-reference previous visits using the record lookup tool, then generate a high-confidence "probable STEMI" escalation recommendation.</p>
<p>This chain of tool calls took 4.2 seconds. The nurse glanced at the recommendation and hit approve.</p>
<h3><strong>Where compliance breaks:</strong></h3>
<p>The conformity assessment documented the agent's intended purpose as "assisting with triage." It did not document the agent's runtime-determined behaviour of cross-referencing historical visit data with current labs to generate high-confidence diagnoses. That specific tool-chaining behaviour was never tested, never risk-assessed, and the confidence score (0.91) was not calibrated against real patient outcomes for that specific multi-tool reasoning path.</p>
<p>Under Article 9, this is a compliance failure. The risk from that specific runtime configuration was never identified, estimated, or mitigated. The agent's risk profile was emergent, and the organization's compliance framework had no way to capture it.</p>
<p>Now multiply this across 50 agents, 200 tool combinations, and thousands of daily interactions. That is the governance gap.</p>
<hr />
<h1>Why Agentic AI Breaks the Static Compliance Model</h1>
<p>Five structural problems make traditional compliance inadequate for agentic systems:</p>
<p><strong>Runtime-determined risk profiles.</strong> A customer service agent that retrieves a FAQ entry has a completely different risk profile from the same agent making a refund decision, querying a financial record, and writing to an order management system in a single session. The system is the same. The risk is not.</p>
<p><strong>Non-deterministic outputs at scale.</strong> Research published at ACL 2025, based on 14,400 agentic simulations across 12 LLMs, found that agents autonomously engaged in catastrophic behaviours and deception without deliberate inducement. More concerning: stronger reasoning capabilities often increased rather than mitigated these risks.</p>
<p><strong>Dynamic tool use and blast radius.</strong> An agent with access to 50 tools but typically using 3 might invoke a high-risk tool in an unusual context. The blast radius changes with every novel tool combination. Static model cards cannot capture this.</p>
<p><strong>Multi-agent privilege escalation.</strong> The ServiceNow Now Assist vulnerability (late 2025) demonstrated this concretely. A low-privilege agent was manipulated via prompt injection into requesting a higher-privilege agent to export case files to external URLs. The higher-privilege agent, trusting its peer, executed the unauthorized action. ServiceNow initially classified this as "expected behaviour."</p>
<p><strong>Accountability without ambiguity.</strong> In Moffatt v. Air Canada (2024), the British Columbia tribunal rejected Air Canada's argument that its chatbot was "a separate legal entity responsible for its own actions." The company was held liable. Autonomous agent behaviour does not transfer legal responsibility.</p>
<hr />
<h2>The Six-Layer Compliance Architecture</h2>
<p>Here is the architecture that satisfies every Article 9 obligation for an agentic system in production.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/e97f4738-1c34-48b9-a882-b9b77835e6a2.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Step-by-Step Implementation</h2>
<h3>Prerequisites</h3>
<p>Install all dependencies from the repo:</p>
<pre><code class="language-bash">git clone https://github.com/SubhanshuMG/agentic-ai-eu-compliance.git
cd agentic-ai-eu-compliance
pip install -r requirements.txt
cp .env.example .env
</code></pre>
<blockquote>
<p>Full dependency list: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/requirements.txt">requirements.txt</a></p>
</blockquote>
<h3>Step 1: Build the Stateful Orchestration Layer</h3>
<p>The foundation is a LangGraph agent with PostgreSQL-backed checkpointing. Every state transition becomes an auditable record you can replay for compliance investigation.</p>
<p><strong>What this file does:</strong> Defines the <code>AgentState</code> TypedDict that carries risk scores, compliance flags, and tool call history across every graph node. Implements the <code>risk_scoring_node</code> that evaluates five weighted dimensions per tool call: base risk level, chain penalty (compounds at 0.05 per additional tool), reversibility penalty (0.2 for irreversible actions), plus a configurable kill switch that triggers human review above 0.75. The <code>human_review_node</code> pauses execution, persists full state to PostgreSQL via <code>AsyncPostgresSaver</code>, and waits asynchronously for human approve/edit/reject before resuming.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/orchestration/agent_graph.py">orchestration/agent_</a><a href="http://graph.py">graph.py</a></p>
</blockquote>
<p><strong>Article 9 obligations satisfied:</strong> Art. 9(2) continuous iterative process via checkpointed state graph; Art. 9(2b) runtime risk estimation per tool invocation; Art. 14 human oversight via configurable HITL interrupt policies.</p>
<h3>Step 2: Build the Runtime Guardrails Layer</h3>
<p>This middleware intercepts every input and output, running prompt injection detection, PII scanning, and content filtering before the agent ever sees the request.</p>
<p><strong>What this file does:</strong> Implements a two-layer prompt injection detector: fast local regex matching (<del>0ms) against 10 known attack patterns, followed by optional Lakera Guard API verification (</del>100ms) with 100+ language support. PII redaction covers email, phone, SSN, credit card, NHS numbers, and IBAN with pattern-based replacement. The <code>ComplianceGuardrailPipeline</code> runs all checks in sequence and short-circuits to <code>BLOCK</code> on the first critical violation, or returns <code>REDACT</code> if only PII was found. Output groundedness is checked via word-overlap heuristic against the original context, with an LLM-as-judge integration point clearly marked for production use.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/guardrails/middleware.py">guardrails/</a><a href="http://middleware.py">middleware.py</a></p>
</blockquote>
<p><strong>Article 9 obligations satisfied:</strong> Art. 9(5) first-tier risk elimination by design; Art. 9(5) second-tier mitigation and control measures.</p>
<h3>Step 3: Build the Audit Logging Layer</h3>
<p>Every agent interaction must produce a structured, immutable audit record satisfying Article 12 and Article 19.</p>
<p><strong>What this file does:</strong> Creates two PostgreSQL tables with strict separation of concerns. <code>agent_audit_log</code> is append-only (enforced via <code>DO INSTEAD NOTHING</code> rules on DELETE and UPDATE), stores hashed inputs and outputs rather than raw content, carries a <code>retention_until</code> field set to 180 days from creation, and includes JSONB columns for guardrail results, compliance flags, reasoning traces, and human oversight decisions. <code>agent_session_pii</code> stores raw personal data separately with a <code>gdpr_delete_at</code> field, enabling GDPR Article 17 right-to-erasure compliance without destroying the audit trail. The <code>generate_compliance_report</code> method produces a structured Article 9 compliance report with aggregate risk metrics, human review rates, and session counts across any date range.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/audit/logger.py">audit/</a><a href="http://logger.py">logger.py</a></p>
</blockquote>
<p><strong>Article 9 obligations satisfied:</strong> Art. 12 automatic logging for high-risk AI; Art. 19 6-month minimum retention for operational logs; Art. 19 GDPR-compliant PII separation.</p>
<h3>Step 4: OpenTelemetry Observability</h3>
<p>Wire your agent into OpenTelemetry using the GenAI semantic conventions, giving you vendor-neutral traces and metrics that map directly to Article 9's continuous monitoring requirement.</p>
<p><strong>What this file does:</strong> Configures <code>TracerProvider</code> and <code>MeterProvider</code> with OTLP exporters targeting your Grafana/Jaeger stack. Span attributes follow the <code>gen_ai.*</code> namespace from OTel spec v1.37+, including <code>gen_ai.request.model</code>, <code>gen_ai.agent.session_id</code>, and custom <code>ai.compliance.*</code> attributes for regulation and article tracking. Four compliance-specific metrics are defined: <code>gen_ai.agent.risk_score</code> histogram (per-action risk scores with tool and tier labels), <code>gen_ai.agent.guardrail_triggered</code> counter (type and decision labels), <code>gen_ai.agent.human_review_required</code> counter (risk score bucket labels), and <code>gen_ai.agent.tool_latency</code> histogram. User IDs are hashed to 16 hex characters before inclusion in any span attribute.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/observability/telemetry.py">observability/</a><a href="http://telemetry.py">telemetry.py</a></p>
</blockquote>
<p><strong>Article 9 obligations satisfied:</strong> Art. 9(2c) evaluation of risks arising from post-market monitoring data.</p>
<h3>Step 5: The CI/CD Compliance Gate</h3>
<p>Every model update, prompt change, or tool addition must pass automated Article 9 testing before deployment.</p>
<p><strong>What this file does:</strong> Defines <code>THRESHOLDS</code> as the "prior defined metrics and probabilistic thresholds" required by Article 9(6), covering hallucination rate (5% max), bias score (15% max), prompt injection pass rate (99% min), groundedness (80% min), and PII leakage (0.1% max). The <code>ADVERSARIAL_TESTS</code> list covers seven attack categories: prompt injection, goal hijacking, PII extraction, confidence calibration, memory poisoning, and privilege escalation. The <code>run_full_gate</code> method runs all three evaluation stages (adversarial suite, Giskard scan, bias evaluation) in sequence, computes a weighted overall score, and writes a structured JSON report that the CI/CD pipeline uses to make the deployment decision.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/cicd/compliance_gate.py">cicd/compliance_</a><a href="http://gate.py">gate.py</a></p>
</blockquote>
<p><strong>Article 9 obligations satisfied:</strong> Art. 9(6) testing using prior defined metrics; Art. 9(7) real-world testing conditions; Art. 9(8) probabilistic thresholds appropriate to intended purpose.</p>
<h3>Step 6: The FastAPI Integration Layer</h3>
<p>Wire everything together into a production API that runs the full compliance stack on every request.</p>
<p><strong>What this file does:</strong> Uses FastAPI's lifespan context manager to initialize all six compliance layers at startup: asyncpg connection pool, <code>ComplianceAuditLogger</code> with table setup, compiled LangGraph agent with PostgreSQL checkpointer, and <code>ComplianceGuardrailPipeline</code> with Lakera Guard. Every POST to <code>/invoke</code> runs the full six-layer stack in sequence, short-circuiting to a BLOCK response (with full audit record) if input guardrails fire. The response includes the session ID, final risk score, all compliance flags, human review status, blocked status, and the audit log ID for traceability. The GET <code>/compliance/report</code> endpoint generates an on-demand Article 9 compliance report for any time window.</p>
<blockquote>
<p>Full implementation: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/api/main.py">api/</a><a href="http://main.py">main.py</a></p>
</blockquote>
<h3>Step 7: GitHub Actions CI/CD Pipeline</h3>
<p><strong>What this file does:</strong> Spins up a PostgreSQL service container, installs all dependencies, starts the agent API in test mode, runs the full compliance test suite with JUnit XML output, executes the adversarial gate, uploads compliance reports as artifacts with 180-day retention (satisfying Article 19), fails the build if the gate score drops below threshold, and fires a Slack notification to the compliance team on failure.</p>
<blockquote>
<p>Full workflow: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/.github/workflows/ai-compliance.yml">.github/workflows/ai-compliance.yml</a></p>
</blockquote>
<h3>Step 8: The Compliance Test Suite</h3>
<p><strong>What this file does:</strong> Eight <code>pytest-asyncio</code> tests covering every Article 9 obligation with a live agent endpoint. Tests verify prompt injection is blocked with a 1.0 risk score, PII is absent from all response fields, irreversible high-risk tool calls trigger human review flags, every interaction produces a valid UUID audit log ID, the compliance report endpoint returns the expected Article 9 structure, multi-tool chains produce compounded risk scores above 0.7, session state persists correctly across turns, and memory poisoning attempts are intercepted.</p>
<blockquote>
<p>Full test suite: <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/tests/compliance/test_article9.py">tests/compliance/test_</a><a href="http://article9.py">article9.py</a></p>
</blockquote>
<p>Run them:</p>
<pre><code class="language-bash"># Start the API first
uvicorn api.main:app --reload

# Then in another terminal
pytest tests/compliance/ -v
</code></pre>
<hr />
<h1>What Maps to What: <em>Article 9 Obligations vs Architecture</em></h1>
<table style="min-width:75px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Article 9 Requirement</p></th><th><p>File</p></th><th><p>Layer</p></th></tr><tr><td><p>Continuous iterative process (Art. 9(2))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/orchestration/agent_graph.py" style="pointer-events:none">orchestration/agent_</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://graph.py" style="pointer-events:none">graph.py</a></p></td><td><p>L1</p></td></tr><tr><td><p>Identify foreseeable risks (Art. 9(2a))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/guardrails/middleware.py" style="pointer-events:none">guardrails/</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://middleware.py" style="pointer-events:none">middleware.py</a></p></td><td><p>L2</p></td></tr><tr><td><p>Estimate risk under misuse (Art. 9(2b))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/orchestration/agent_graph.py" style="pointer-events:none">orchestration/agent_</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://graph.py" style="pointer-events:none">graph.py</a></p></td><td><p>L3</p></td></tr><tr><td><p>Post-market monitoring (Art. 9(2c))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/observability/telemetry.py" style="pointer-events:none">observability/</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://telemetry.py" style="pointer-events:none">telemetry.py</a></p></td><td><p>L6</p></td></tr><tr><td><p>Risk mitigation by design (Art. 9(5))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/guardrails/middleware.py" style="pointer-events:none">guardrails/</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://middleware.py" style="pointer-events:none">middleware.py</a></p></td><td><p>L2</p></td></tr><tr><td><p>Testing with defined metrics (Art. 9(6))</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/cicd/compliance_gate.py" style="pointer-events:none">cicd/compliance_</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://gate.py" style="pointer-events:none">gate.py</a></p></td><td><p>CI/CD</p></td></tr><tr><td><p>Human oversight (Art. 14)</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/orchestration/agent_graph.py" style="pointer-events:none">orchestration/agent_</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://graph.py" style="pointer-events:none">graph.py</a></p></td><td><p>L5</p></td></tr><tr><td><p>Automatic logging (Art. 12)</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/audit/logger.py" style="pointer-events:none">audit/</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://logger.py" style="pointer-events:none">logger.py</a></p></td><td><p>L4</p></td></tr><tr><td><p>6-month retention (Art. 19)</p></td><td><p><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance/blob/main/audit/logger.py" style="pointer-events:none">audit/</a><a target="_self" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="http://logger.py" style="pointer-events:none">logger.py</a></p></td><td><p>L4</p></td></tr></tbody></table>

<hr />
<h1>The Three Non-Negotiable Insights</h1>
<p><strong>First:</strong> Article 9's "continuous iterative process" requirement actually aligns with what good agentic governance demands. The problem is not the regulation; it is the industry's habit of treating compliance as a pre-deployment checkbox. Runtime risk scoring, continuous adversarial testing, and behavioural drift monitoring are not just compliance measures. They are the only architecturally correct approach to governing systems whose risk profiles emerge at runtime.</p>
<p><strong>Second:</strong> The tooling is production-ready today. A stack combining LangGraph (orchestration, HITL), NeMo Guardrails and Guardrails AI (runtime safety), Giskard (testing), MLflow (lifecycle governance), OpenTelemetry (observability), and Credo AI (compliance management) covers every Article 9 obligation. What most organizations lack is integration, not tooling.</p>
<p><strong>Third:</strong> The compliance deadline creates urgency regardless of legislative delays. NIST launched its AI Agent Standards Initiative on February 17, 2026. Singapore's IMDA published the first government-issued framework specifically for agentic systems in January 2026. The regulatory convergence is underway. Organizations that build the six-layer architecture now are positioned for compliance not just under the EU AI Act but across the emerging global regulatory landscape.</p>
<p>The healthcare triage agent case is not hypothetical. Systems exactly like it are running in production right now, without runtime risk scoring, without HITL enforcement on irreversible actions, and without audit trails that satisfy Article 12. Every runtime decision that the agent makes is an undocumented, unmitigated risk event.</p>
<p>That is the governance gap. This architecture closes it.</p>
<hr />
<h1>Resources and Further Reading</h1>
<ul>
<li><p><strong>Companion repo:</strong> <a href="https://github.com/SubhanshuMG/agentic-ai-eu-compliance">github.com/SubhanshuMG/agentic-ai-eu-compliance</a></p>
</li>
<li><p>EU AI Act full text: <a href="http://artificialintelligenceact.eu">artificialintelligenceact.eu</a></p>
</li>
<li><p>Article 9 annotated: <a href="http://artificial-intelligence-act.com/Article_9">artificial-intelligence-act.com/Article_9</a></p>
</li>
<li><p>NIST AI Agent Standards Initiative: <a href="https://www.nist.gov/caisi/ai-agent-standards-initiative">nist.gov/caisi/ai-agent-standards-initiative</a></p>
</li>
<li><p>Singapore IMDA Agentic AI Framework: <a href="http://imda.gov.sg">imda.gov.sg</a></p>
</li>
<li><p>CSA AAGATE reference architecture: <a href="http://cloudsecurityalliance.org">cloudsecurityalliance.org</a></p>
</li>
<li><p>LangGraph HITL documentation: <a href="http://langchain.com">langchain.com</a></p>
</li>
<li><p>OpenTelemetry GenAI semantic conventions: <a href="http://opentelemetry.io">opentelemetry.io</a></p>
</li>
<li><p>Giskard EU AI Act scanner: <a href="http://giskard.ai">giskard.ai</a></p>
</li>
<li><p>OWASP Agentic Security Initiative: <a href="http://owasp.org">owasp.org</a></p>
</li>
</ul>
<hr />
<blockquote>
<p><em><strong>The healthcare example is a composite illustration and does not represent any specific organization's system. Companion code is MIT licensed.</strong></em></p>
</blockquote>
<hr />
]]></content:encoded></item><item><title><![CDATA[The 3AM Problem: Why On-Call Burnout Is a System Design Failure, Not a People Problem]]></title><description><![CDATA[The Human Story Behind the Pager
Meet Priya. Senior backend engineer with 5 years of experience, joined a fintech scale-up to build payment infrastructure. She is good at her job.
Then she went on-cal]]></description><link>https://blogs.subhanshumg.com/the-3am-problem-why-on-call-burnout-is-a-system-design-failure-not-a-people-problem</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-3am-problem-why-on-call-burnout-is-a-system-design-failure-not-a-people-problem</guid><category><![CDATA[Devops]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[#AIOps]]></category><category><![CDATA[backend]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Wed, 25 Feb 2026 10:04:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/63ef8cfc-0799-4319-ac52-aee21a0eb4af.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>The Human Story Behind the Pager</h1>
<p>Meet Priya. Senior backend engineer with 5 years of experience, joined a fintech scale-up to build payment infrastructure. She is good at her job.</p>
<p>Then she went on-call.</p>
<p><strong>Week 1 of her first rotation:</strong></p>
<pre><code class="language-plaintext">Mon 02:14 — ALERT: payment-processor CPU &gt; 90%
Mon 02:41 — ALERT: order-service DB connection pool exhausted
Tue 03:07 — ALERT: kafka consumer lag &gt; 50,000 messages
Wed 01:22 — ALERT: payment-processor CPU &gt; 90%  (same one)
Thu 03:55 — ALERT: auth-service error rate &gt; 5%
Fri 02:01 — ALERT: kafka consumer lag &gt; 50,000 messages  (auto-resolved before she woke up)
Sat 04:30 — ALERT: payment-processor CPU &gt; 90%  (still the same one)
</code></pre>
<p>Each page triggers the same ritual. PagerDuty fires. She jolts awake at 3AM. Slacks the team to flag she is on it. Opens five dashboards simultaneously. Reads a runbook authored by an engineer who left 14 months ago, referencing a <code>legacy-payment-v1</code> service that was deprecated in Q2. Restarts a pod. Watches metrics for 25 minutes. The numbers look stable. She falls back asleep at 4:15AM and shows up to standup at 9AM unable to complete a coherent sentence.</p>
<p>By Thursday she has LinkedIn open in a background tab.</p>
<p>Here is the thing that nobody says out loud at the postmortem: <strong>the system worked exactly as it was designed to work.</strong> Every alert fired on the threshold it was configured for. PagerDuty escalated correctly. The runbook existed. The monitoring was "in place."</p>
<p>The design itself was the failure.</p>
<hr />
<h1>Root Cause Analysis: Three System Design Failures</h1>
<p>When an engineer burns out on-call, organizations reach for the wrong solutions: hire more engineers, add more rotation members, run a resilience workshop, tell people to "set better boundaries." These treat the symptom while the root cause keeps compounding.</p>
<p>There are three concrete, measurable system design failures behind almost every on-call burnout story.</p>
<h3>Failure 1: Symptom-Level Alerting with No Signal Intelligence</h3>
<p>Most alerting systems are built by people who are not on-call. They set thresholds on individual metrics and call it observability. The result is an alert architecture shaped like this:</p>
<pre><code class="language-plaintext">CPU &gt; 80%           → PAGE
Memory &gt; 70%        → PAGE
Error rate &gt; 1%     → PAGE
Latency p99 &gt; 2s    → PAGE
DB connections &gt; 80 → PAGE
</code></pre>
<p>When a single upstream database becomes slow, every service that touches it will simultaneously breach every one of those thresholds. Five services, five metrics each, four alert rules per metric: the on-call engineer receives 100 pages about one 30-second database hiccup.</p>
<p>This is called an <strong>alert storm</strong> and it is not a monitoring problem. It is an alert <em>architecture</em> problem. The system has no concept of causality, grouping, or signal-to-noise ratio.</p>
<h3>Failure 2: Zero Automated Diagnosis</h3>
<p>When Priya gets a page at 3AM, she opens a dashboard. She stares at graphs. She looks at logs. She cross-references metrics across four services. She forms a hypothesis. She validates it. She acts on it.</p>
<p>That entire cognitive loop, from page to diagnosis, takes 12 to 25 minutes on average. And the cruel irony is that 80% of production incidents follow a pattern that has happened before. The playbook exists. The runbook covers it. But the system makes the engineer <em>manually reconstruct</em> the diagnosis every single time, at 3AM, half-asleep, under pressure.</p>
<p>This is not an engineering problem. It is an automation gap: <strong>the system is collecting all the data needed to auto-diagnose the issue but forcing a human to do the computation manually.</strong></p>
<h3>Failure 3: Runbooks That Are Documents, Not Programs</h3>
<p>"Runbook" should mean "a program that runs." In most organizations it means "a Confluence page last updated 11 months ago."</p>
<p>The on-call engineer reads a document, interprets its instructions for the current situation, executes a series of manual commands, and verifies the results themselves. Every step is a potential failure point. Every interpretation is a potential mistake. Every manual command is executed by someone who has been awake for 23 of the last 24 hours.</p>
<p>When runbooks are documents, not programs, the system is relying on human consistency at 3AM to maintain system reliability. This is poor systems thinking.</p>
<hr />
<h1>Real World Case Study: The Payment Gateway Meltdown</h1>
<p><strong>Company:</strong> A mid-sized e-commerce platform processing $2M/day in transactions.</p>
<p><strong>Stack:</strong> Kubernetes on AWS EKS, Postgres RDS, Redis, Kafka, Python microservices, Prometheus, Grafana, PagerDuty.</p>
<p><strong>The incident:</strong> Black Friday. Traffic 4x normal. 11:47PM.</p>
<h3>What the on-call engineer saw:</h3>
<pre><code class="language-plaintext">11:47 PM — payment-service error rate: 12%
11:47 PM — checkout-service latency p99: 8.2s
11:48 PM — order-service DB connections: 95/100
11:48 PM — inventory-service error rate: 7%
11:49 PM — payment-service CPU: 91%
11:49 PM — checkout-service CPU: 88%
11:49 PM — order-service error rate: 15%
11:49 PM — redis connection timeout alerts: 4 separate alerts
11:50 PM — kafka consumer lag: 120,000 messages
11:50 PM — notification-service error rate: 6%
         ... 23 more alerts over the next 8 minutes ...
</code></pre>
<p><strong>31 pages. One engineer. 12 minutes.</strong></p>
<h3>What was actually happening (one sentence):</h3>
<p>A single connection pool misconfiguration in <code>order-service</code> caused it to exhaust its Postgres connections under load, which cascaded upstream through every service that depended on order data.</p>
<p><strong>One root cause. 31 pages. 47 minutes to diagnose. $380,000 in lost GMV.</strong></p>
<h3>What a well-designed system would have done:</h3>
<pre><code class="language-plaintext">11:47 PM — [GROUPED ALERT, 1 page]:
  Root cause candidate: order-service DB connection pool exhaustion
  Evidence: 31 correlated metric anomalies across 6 services
  Affected services: payment, checkout, inventory, order, notification, redis
  Automated diagnosis: Connection pool limit reached (95/100)
  Recommended action: [RUNBOOK-AUTO-007] Scale connection pool, restart order-service
  Confidence: 94%
  Auto-remediation: Ready to execute on approval
</code></pre>
<p><strong>One page. One diagnosis. One action. Triggered, reviewed, approved, and resolved in 9 minutes.</strong></p>
<hr />
<h1>The Architecture That Eliminates Burnout</h1>
<img src="https://cdn.hashnode.com/uploads/covers/6442da7c019a6adb6b507559/b611029b-ccbb-4607-a760-d505f31e5136.png" alt="" style="display:block;margin:0 auto" />

<h2>Layer 1: Smart Alert Grouping</h2>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/alertmanager-config.yaml">alertmanager-config.yaml</a></p>
</blockquote>
<h3>Prerequisites</h3>
<pre><code class="language-bash">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set alertmanager.enabled=true \
  --set grafana.enabled=true
</code></pre>
<h3>Step 1: Alert Rules with Semantic Labels</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/k8s/prometheus-alert-rules.yaml">prometheus-alert-rules.yaml</a></p>
</blockquote>
<p>The key insight is that every alert rule must carry labels that describe <em>what it means</em>, not just <em>what it measures</em>. These labels power the grouping and inhibition logic downstream.</p>
<pre><code class="language-yaml"># prometheus-alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: service.database
      interval: 30s
      rules:
        - alert: DatabaseConnectionPoolExhausted
          expr: |
            (
              pg_stat_activity_count{state="active"}
              / pg_settings_max_connections
            ) &gt; 0.85
          for: 1m
          labels:
            severity: critical
            layer: infrastructure
            impact_type: resource
            root_cause_candidate: "true"
            service: "{{ $labels.service }}"
            runbook: RB-DB-001
          annotations:
            summary: "DB connection pool at {{ $value | humanizePercentage }} capacity"

        - alert: ServiceErrorRateHigh
          expr: |
            (
              rate(http_requests_total{status=~"5.."}[5m])
              / rate(http_requests_total[5m])
            ) &gt; 0.05
          for: 2m
          labels:
            severity: warning
            layer: application
            impact_type: availability
            root_cause_candidate: "false"
            service: "{{ $labels.service }}"
            runbook: RB-APP-001
          annotations:
            summary: "{{ \(labels.service }} error rate {{ \)value | humanizePercentage }}"

        - alert: ServiceLatencyHigh
          expr: |
            histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket[5m])
            ) &gt; 2
          for: 2m
          labels:
            severity: warning
            layer: application
            impact_type: performance
            root_cause_candidate: "false"
            service: "{{ $labels.service }}"
            runbook: RB-APP-002
          annotations:
            summary: "{{ \(labels.service }} p99 latency {{ \)value | humanizeDuration }}"
</code></pre>
<h3>Step 2: Alertmanager Config with Intelligent Grouping and Inhibition</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/alertmanager-config.yaml">alertmanager-config.yaml</a></p>
</blockquote>
<pre><code class="language-yaml"># alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

# If a root cause alert fires, suppress all downstream symptom alerts.
# This is the single biggest reducer of alert volume.
inhibit_rules:
  - source_matchers:
      - 'alertname="DatabaseConnectionPoolExhausted"'
      - 'root_cause_candidate="true"'
    target_matchers:
      - 'root_cause_candidate="false"'
      - 'layer="application"'
    equal: ['namespace']

  - source_matchers:
      - 'layer="infrastructure"'
      - 'severity="critical"'
    target_matchers:
      - 'layer="application"'
      - 'severity="warning"'
    equal: ['namespace']

  - source_matchers:
      - 'severity="critical"'
    target_matchers:
      - 'severity="warning"'
    equal: ['alertname', 'service', 'namespace']

route:
  group_by: ['namespace', 'layer', 'impact_type']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'diagnosis-engine'

  routes:
    - matchers:
        - 'severity="critical"'
        - 'layer="infrastructure"'
      group_wait: 10s
      group_interval: 2m
      receiver: 'diagnosis-engine'
      continue: false

    - matchers:
        - 'layer="application"'
      group_by: ['namespace', 'service', 'impact_type']
      group_wait: 45s
      receiver: 'diagnosis-engine'
      continue: false

    - matchers:
        - 'severity="warning"'
      repeat_interval: 1h
      receiver: 'slack-warnings'
      continue: false

receivers:
  - name: 'diagnosis-engine'
    webhook_configs:
      - url: 'http://diagnosis-engine.monitoring.svc.cluster.local:8080/alerts'
        send_resolved: true

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#oncall-warnings'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: |
          *Alerts:* {{ len .Alerts }}
          *Namespace:* {{ .GroupLabels.namespace }}
</code></pre>
<p>Apply this config:</p>
<pre><code class="language-bash">kubectl apply -f alertmanager-config.yaml

kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 \
  -- amtool config show
</code></pre>
<h2>Layer 2: Automated Diagnosis Engine</h2>
<h3>Project Structure</h3>
<pre><code class="language-plaintext">oncall-burnout-fix/
├── Dockerfile
├── requirements.txt
├── main.py
├── alertmanager-config.yaml
├── docker-compose.yml
├── src/
│   ├── models.py
│   ├── graph.py
│   ├── correlator.py
│   ├── diagnosis.py
│   ├── aiops.py
│   ├── history.py
│   └── runbook_handler.py
├── playbooks/
│   └── db-connection-pool-fix.yml
├── tests/
│   ├── test_diagnosis.py
│   └── test_integration.py
└── k8s/
    └── deployment.yaml
</code></pre>
<h3>requirements.txt</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/requirements.txt">requirements.txt</a></p>
</blockquote>
<pre><code class="language-plaintext">fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
httpx==0.26.0
networkx==3.2.1
prometheus-api-client==0.5.4
redis==5.0.1
numpy==1.26.3
scikit-learn==1.4.0
structlog==24.1.0
tenacity==8.2.3
pytest==7.4.4
pytest-asyncio==0.23.3
</code></pre>
<h3>Data contract layer</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/models.py">src/</a><a href="http://models.py">models.py</a></p>
</blockquote>
<pre><code class="language-python">from pydantic import BaseModel, Field
from typing import Optional, Literal
from datetime import datetime
from enum import Enum


class AlertSeverity(str, Enum):
    CRITICAL = "critical"
    WARNING  = "warning"
    INFO     = "info"


class AlertLabel(BaseModel):
    alertname: str
    severity: AlertSeverity
    layer: str
    impact_type: str
    root_cause_candidate: bool = False
    service: Optional[str] = None
    namespace: str = "default"
    runbook: Optional[str] = None


class Alert(BaseModel):
    status: Literal["firing", "resolved"]
    labels: AlertLabel
    annotations: dict
    starts_at: datetime = Field(alias="startsAt")
    ends_at: Optional[datetime] = Field(None, alias="endsAt")
    generator_url: str = Field("", alias="generatorURL")
    fingerprint: str = ""

    class Config:
        populate_by_name = True


class AlertBundle(BaseModel):
    version: str = "4"
    group_key: str = Field(alias="groupKey")
    status: Literal["firing", "resolved"]
    receiver: str
    group_labels: dict = Field(alias="groupLabels")
    common_labels: dict = Field(alias="commonLabels")
    common_annotations: dict = Field(alias="commonAnnotations")
    alerts: list[Alert]

    class Config:
        populate_by_name = True


class DiagnosisResult(BaseModel):
    incident_id: str
    root_cause_alert: Optional[str]
    root_cause_service: Optional[str]
    root_cause_description: str
    affected_services: list[str]
    confidence_score: float
    alert_count: int
    deduplicated_alert_count: int
    recommended_runbook: Optional[str]
    supporting_metrics: dict
    historical_match: Optional[str]
    created_at: datetime
</code></pre>
<h3>Service Dependency Graph</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/graph.py">src/</a><a href="http://graph.py">graph.py</a></p>
</blockquote>
<pre><code class="language-python">import networkx as nx
from typing import Optional
import json
import structlog

log = structlog.get_logger()


class ServiceDependencyGraph:
    """
    Directed graph of service dependencies.
    Edge A -&gt; B means A calls B. If B fails, A is affected.
    """

    def __init__(self):
        self.graph = nx.DiGraph()
        self._build_default_graph()

    def _build_default_graph(self):
        services = [
            "payment-service", "checkout-service", "order-service",
            "inventory-service", "notification-service",
            "postgres-rds", "redis-cache", "kafka",
        ]
        dependencies = [
            ("payment-service",      "order-service"),
            ("payment-service",      "redis-cache"),
            ("checkout-service",     "payment-service"),
            ("checkout-service",     "inventory-service"),
            ("checkout-service",     "order-service"),
            ("order-service",        "postgres-rds"),
            ("order-service",        "kafka"),
            ("inventory-service",    "postgres-rds"),
            ("notification-service", "kafka"),
        ]
        for s in services:
            self.graph.add_node(s)
        for caller, dep in dependencies:
            self.graph.add_edge(caller, dep)

    def get_upstream_services(self, service: str) -&gt; list[str]:
        if service not in self.graph:
            return []
        return list(self.graph.predecessors(service))

    def get_downstream_services(self, service: str) -&gt; list[str]:
        if service not in self.graph:
            return []
        return list(self.graph.successors(service))

    def find_likely_root_cause(self, affected_services: list[str]) -&gt; Optional[str]:
        if not affected_services:
            return None
        affected_set = set(affected_services)
        scores = {}
        for service in affected_services:
            if service not in self.graph:
                continue
            upstream = set(self.graph.predecessors(service))
            scores[service] = len(upstream.intersection(affected_set))
        if not scores:
            return affected_services[0]
        return max(scores, key=scores.get)

    def get_impact_radius(self, root_service: str) -&gt; list[str]:
        if root_service not in self.graph:
            return []
        impacted, queue = set(), [root_service]
        while queue:
            current = queue.pop(0)
            for pred in self.graph.predecessors(current):
                if pred not in impacted:
                    impacted.add(pred)
                    queue.append(pred)
        return list(impacted)

    def load_from_file(self, path: str):
        with open(path) as f:
            data = json.load(f)
        self.graph.clear()
        for node in data.get("nodes", []):
            self.graph.add_node(node)
        for edge in data.get("edges", []):
            self.graph.add_edge(edge["from"], edge["to"])
</code></pre>
<h3>Metric Correlation Analysis</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/correlator.py">src/</a><a href="http://correlator.py">correlator.py</a></p>
</blockquote>
<pre><code class="language-python">import numpy as np
from datetime import datetime, timedelta
import httpx
import structlog
from tenacity import retry, stop_after_attempt, wait_exponential

log = structlog.get_logger()
PROMETHEUS_URL = "http://prometheus-operated.monitoring.svc.cluster.local:9090"


class MetricCorrelator:

    def __init__(self, prometheus_url: str = PROMETHEUS_URL):
        self.prometheus_url = prometheus_url
        self.client = httpx.AsyncClient(timeout=10.0)

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=8))
    async def fetch_metric(
        self, query: str, start: datetime, end: datetime, step: str = "30s"
    ) -&gt; list[float]:
        params = {
            "query": query,
            "start": start.timestamp(),
            "end":   end.timestamp(),
            "step":  step,
        }
        try:
            r = await self.client.get(
                f"{self.prometheus_url}/api/v1/query_range", params=params
            )
            r.raise_for_status()
            data = r.json()
            if data["status"] != "success":
                return []
            results = data.get("data", {}).get("result", [])
            if not results:
                return []
            return [float(v[1]) for v in results[0]["values"]]
        except Exception as e:
            log.error("prometheus_fetch_error", query=query, error=str(e))
            return []

    async def compute_correlation(
        self, metric_a: list[float], metric_b: list[float]
    ) -&gt; float:
        if len(metric_a) &lt; 3 or len(metric_b) &lt; 3:
            return 0.0
        min_len = min(len(metric_a), len(metric_b))
        a, b = np.array(metric_a[:min_len]), np.array(metric_b[:min_len])
        if np.std(a) == 0 or np.std(b) == 0:
            return 0.0
        corr = np.corrcoef(a, b)[0, 1]
        return float(corr) if not np.isnan(corr) else 0.0

    async def compute_anomaly_score(self, values: list[float]) -&gt; float:
        if len(values) &lt; 5:
            return 0.0
        historical = np.array(values[:-1])
        latest = values[-1]
        mean, std = np.mean(historical), np.std(historical)
        if std == 0:
            return 0.0
        return min(abs((latest - mean) / std) / 5.0, 1.0)

    async def validate_db_connection_hypothesis(
        self, service: str, incident_start: datetime
    ) -&gt; dict:
        end   = incident_start + timedelta(minutes=10)
        start = incident_start - timedelta(minutes=5)
        db_q  = f'pg_stat_activity_count{{service="{service}",state="active"}}'
        err_q = f'rate(http_requests_total{{service="{service}",status=~"5.."}}[2m])'
        db_vals  = await self.fetch_metric(db_q,  start, end)
        err_vals = await self.fetch_metric(err_q, start, end)
        correlation = await self.compute_correlation(db_vals, err_vals)
        db_anomaly  = await self.compute_anomaly_score(db_vals)
        return {
            "hypothesis": "db_connection_exhaustion",
            "correlation_coefficient": correlation,
            "db_anomaly_score": db_anomaly,
            "supports_hypothesis": correlation &gt; 0.7 and db_anomaly &gt; 0.6,
            "db_connection_values": db_vals[-3:]  if db_vals  else [],
            "error_rate_values":    err_vals[-3:] if err_vals else [],
        }

    async def close(self):
        await self.client.aclose()
</code></pre>
<h3>Root Cause Analysis Engine</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/diagnosis.py">src/</a><a href="http://diagnosis.py">diagnosis.py</a></p>
</blockquote>
<pre><code class="language-python">import uuid, math
import structlog
from datetime import datetime
from typing import Optional

from .models import AlertBundle, DiagnosisResult, AlertSeverity
from .graph import ServiceDependencyGraph
from .correlator import MetricCorrelator
from .history import IncidentHistory

log = structlog.get_logger()


class DiagnosisEngine:

    def __init__(self):
        self.graph      = ServiceDependencyGraph()
        self.correlator = MetricCorrelator()
        self.history    = IncidentHistory()

    async def diagnose(self, bundle: AlertBundle) -&gt; DiagnosisResult:
        incident_id    = str(uuid.uuid4())[:8]
        firing_alerts  = [a for a in bundle.alerts if a.status == "firing"]
        incident_start = min(a.starts_at for a in firing_alerts)

        # 1. Extract affected services
        affected_services = list(set(
            a.labels.service for a in firing_alerts if a.labels.service
        ))

        # 2. Find explicit root cause candidates from alert labels
        root_cause_candidates = [
            a for a in firing_alerts
            if a.labels.root_cause_candidate and a.labels.service
        ]

        # 3. Fall back to graph traversal if no explicit candidates
        if root_cause_candidates:
            root_cause_service = root_cause_candidates[0].labels.service
            root_cause_alert   = root_cause_candidates[0].labels.alertname
        else:
            root_cause_service = self.graph.find_likely_root_cause(affected_services)
            root_cause_alert   = next(
                (a.labels.alertname for a in firing_alerts
                 if a.labels.service == root_cause_service),
                None
            )

        # 4. Validate with metric correlation
        correlation_data = {}
        if root_cause_service:
            correlation_data = await self.correlator.validate_db_connection_hypothesis(
                root_cause_service, incident_start
            )

        # 5. Historical pattern matching
        historical_match = await self.history.find_similar_incident(
            affected_services=affected_services,
            root_cause_alert=root_cause_alert,
        )

        # 6. Score confidence
        confidence = self._compute_confidence(
            has_explicit_candidate=bool(root_cause_candidates),
            correlation_data=correlation_data,
            has_historical_match=historical_match is not None,
            alert_count=len(firing_alerts),
            affected_service_count=len(affected_services),
        )

        result = DiagnosisResult(
            incident_id=incident_id,
            root_cause_alert=root_cause_alert,
            root_cause_service=root_cause_service,
            root_cause_description=self._build_description(
                root_cause_service, root_cause_alert,
                affected_services, firing_alerts, correlation_data
            ),
            affected_services=affected_services,
            confidence_score=confidence,
            alert_count=len(bundle.alerts),
            deduplicated_alert_count=len(firing_alerts),
            recommended_runbook=self._select_runbook(
                root_cause_alert, root_cause_candidates, historical_match
            ),
            supporting_metrics=correlation_data,
            historical_match=historical_match,
            created_at=datetime.utcnow(),
        )
        await self.history.store_incident(result, bundle)
        return result

    def _compute_confidence(
        self, has_explicit_candidate, correlation_data,
        has_historical_match, alert_count, affected_service_count
    ) -&gt; float:
        score = 0.0
        if has_explicit_candidate:
            score += 0.40
        if correlation_data.get("supports_hypothesis"):
            score += 0.30
        elif correlation_data.get("correlation_coefficient", 0) &gt; 0.5:
            score += 0.15
        if has_historical_match:
            score += 0.20
        score += min(math.log(alert_count + 1) / 20, 0.10)
        return round(min(score, 1.0), 3)

    def _select_runbook(self, root_cause_alert, root_cause_candidates, historical_match):
        if root_cause_candidates and root_cause_candidates[0].labels.runbook:
            return root_cause_candidates[0].labels.runbook
        if historical_match:
            return historical_match
        return {
            "DatabaseConnectionPoolExhausted": "RB-DB-001",
            "ServiceErrorRateHigh":            "RB-APP-001",
            "KafkaConsumerLagHigh":            "RB-KAFKA-001",
            "ServiceLatencyHigh":              "RB-APP-002",
        }.get(root_cause_alert)

    def _build_description(
        self, root_cause_service, root_cause_alert,
        affected_services, firing_alerts, correlation_data
    ) -&gt; str:
        lines = []
        if root_cause_service and root_cause_alert:
            lines.append(f"Root cause: {root_cause_alert} on {root_cause_service}")
        if affected_services:
            lines.append(f"Cascade affecting: {', '.join(affected_services)}")
        if correlation_data.get("supports_hypothesis"):
            r = correlation_data.get("correlation_coefficient", 0)
            lines.append(f"Metric correlation confirms hypothesis (r={r:.2f})")
        critical = sum(
            1 for a in firing_alerts if a.labels.severity == AlertSeverity.CRITICAL
        )
        if critical:
            lines.append(f"{critical} critical alerts deduplicated into this incident")
        return ". ".join(lines) + "."
</code></pre>
<h3>Pattern Store</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/history.py">src/</a><a href="http://history.py">history.py</a></p>
</blockquote>
<pre><code class="language-python">import redis.asyncio as redis
import structlog
from typing import Optional
from datetime import datetime
from .models import DiagnosisResult, AlertBundle

log = structlog.get_logger()
REDIS_URL = "redis://redis.monitoring.svc.cluster.local:6379"


class IncidentHistory:

    def __init__(self):
        self._client: Optional[redis.Redis] = None

    async def _get_client(self) -&gt; redis.Redis:
        if not self._client:
            self._client = await redis.from_url(REDIS_URL, decode_responses=True)
        return self._client

    def _fingerprint(
        self, affected_services: list[str], root_cause_alert: Optional[str]
    ) -&gt; str:
        return f"{root_cause_alert or 'unknown'}::{','.join(sorted(affected_services))}"

    async def find_similar_incident(
        self, affected_services: list[str], root_cause_alert: Optional[str]
    ) -&gt; Optional[str]:
        try:
            client = await self._get_client()
            fp     = self._fingerprint(affected_services, root_cause_alert)
            return await client.hget(f"incident:pattern:{fp}", "recommended_runbook")
        except Exception as e:
            log.warning("history_lookup_failed", error=str(e))
            return None

    async def store_incident(self, result: DiagnosisResult, bundle: AlertBundle):
        try:
            client = await self._get_client()
            fp     = self._fingerprint(result.affected_services, result.root_cause_alert)
            key    = f"incident:pattern:{fp}"
            prev   = int(await client.hget(key, "occurrence_count") or "0")
            await client.hset(key, mapping={
                "root_cause_alert":    result.root_cause_alert   or "",
                "root_cause_service":  result.root_cause_service or "",
                "recommended_runbook": result.recommended_runbook or "",
                "confidence":          str(result.confidence_score),
                "last_seen":           datetime.utcnow().isoformat(),
                "occurrence_count":    str(prev + 1),
            })
            await client.expire(key, 60 * 60 * 24 * 90)
        except Exception as e:
            log.warning("history_store_failed", error=str(e))
</code></pre>
<h3>FastAPI Application</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="http://main.py">main.py</a></p>
</blockquote>
<pre><code class="language-python">from fastapi import FastAPI, HTTPException, Header
from contextlib import asynccontextmanager
import structlog

from src.models import AlertBundle, DiagnosisResult
from src.diagnosis import DiagnosisEngine
from src.aiops import AIOpsTriageEngine

log = structlog.get_logger()


@asynccontextmanager
async def lifespan(app: FastAPI):
    log.info("diagnosis_engine_starting")
    yield
    log.info("diagnosis_engine_stopping")


app = FastAPI(
    title="On-Call Diagnosis Engine",
    version="1.0.0",
    lifespan=lifespan,
)

diagnosis_engine = DiagnosisEngine()
aiops_engine     = AIOpsTriageEngine()


@app.post("/alerts", response_model=DiagnosisResult)
async def receive_alerts(bundle: AlertBundle, authorization: str = Header(None)):
    if bundle.status == "resolved":
        log.info("incident_resolved", group_key=bundle.group_key)
        return {"message": "resolved"}
    try:
        diagnosis = await diagnosis_engine.diagnose(bundle)
        triage    = await aiops_engine.triage(diagnosis, bundle)
        await aiops_engine.notify_oncall(triage, diagnosis)
        return diagnosis
    except Exception as e:
        log.error("diagnosis_failed", error=str(e), exc_info=True)
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
async def health():
    return {"status": "ok"}
</code></pre>
<h2>Layer 3: AIOps Triage with LLMs</h2>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/aiops.py">src/</a><a href="http://aiops.py">aiops.py</a></p>
</blockquote>
<pre><code class="language-python">import os, json
import httpx
import structlog
from .models import DiagnosisResult, AlertBundle

log = structlog.get_logger()
SLACK_WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL")
OPENAI_API_KEY    = os.environ.get("OPENAI_API_KEY")


class AIOpsTriageEngine:

    def __init__(self):
        self.client = httpx.AsyncClient(timeout=30.0)

    async def triage(self, diagnosis: DiagnosisResult, bundle: AlertBundle) -&gt; dict:
        prompt = self._build_triage_prompt(diagnosis, bundle)
        try:
            r = await self.client.post(
                "https://api.openai.com/v1/chat/completions",
                headers={
                    "Authorization": f"Bearer {OPENAI_API_KEY}",
                    "Content-Type": "application/json",
                },
                json={
                    "model": "gpt-4-turbo-preview",
                    "temperature": 0.1,
                    "max_tokens": 800,
                    "response_format": {"type": "json_object"},
                    "messages": [
                        {
                            "role": "system",
                            "content": (
                                "You are an expert SRE assistant. "
                                "Given an incident diagnosis, produce a clear, concise "
                                "JSON triage package. Be specific, not generic. "
                                "Use technical language for a senior engineer at 3AM. "
                                "No fluff, no padding."
                            ),
                        },
                        {"role": "user", "content": prompt},
                    ],
                },
            )
            r.raise_for_status()
            return json.loads(r.json()["choices"][0]["message"]["content"])
        except Exception as e:
            log.error("aiops_triage_failed", error=str(e))
            return self._fallback_triage(diagnosis)

    def _build_triage_prompt(self, diagnosis: DiagnosisResult, bundle: AlertBundle) -&gt; str:
        alert_names = [a.labels.alertname for a in bundle.alerts if a.status == "firing"]
        return f"""
Incident diagnosis:
Root cause alert: {diagnosis.root_cause_alert}
Root cause service: {diagnosis.root_cause_service}
Description: {diagnosis.root_cause_description}
Affected: {', '.join(diagnosis.affected_services)}
Alert count before dedup: {diagnosis.alert_count}
Alert count after dedup: {diagnosis.deduplicated_alert_count}
Confidence: {diagnosis.confidence_score:.0%}
Runbook: {diagnosis.recommended_runbook}
Historical match: {diagnosis.historical_match or 'None'}
Alerts: {', '.join(alert_names[:10])}

Return JSON with keys:
one_line_summary, what_is_happening, immediate_action,
steps (array), do_not_do, estimated_resolution_time, escalate_if
"""

    def _fallback_triage(self, diagnosis: DiagnosisResult) -&gt; dict:
        return {
            "one_line_summary": diagnosis.root_cause_description,
            "what_is_happening": f"Cascade from {diagnosis.root_cause_service}.",
            "immediate_action": f"Check runbook {diagnosis.recommended_runbook}",
            "steps": ["Check root cause service logs", "Verify metric stabilization"],
            "do_not_do": "Do not restart all services simultaneously",
            "estimated_resolution_time": "15-30",
            "escalate_if": "Issue persists after 30 minutes",
        }

    async def notify_oncall(self, triage: dict, diagnosis: DiagnosisResult):
        confidence_emoji = (
            "🟢" if diagnosis.confidence_score &gt; 0.8
            else "🟡" if diagnosis.confidence_score &gt; 0.6
            else "🔴"
        )
        payload = {
            "blocks": [
                {
                    "type": "header",
                    "text": {
                        "type": "plain_text",
                        "text": f"INCIDENT {diagnosis.incident_id.upper()} — {triage['one_line_summary']}",
                    },
                },
                {
                    "type": "section",
                    "fields": [
                        {"type": "mrkdwn",
                         "text": f"*Root Cause:*\n`{diagnosis.root_cause_alert}` on `{diagnosis.root_cause_service}`"},
                        {"type": "mrkdwn",
                         "text": f"*Confidence:*\n{confidence_emoji} {diagnosis.confidence_score:.0%}"},
                        {"type": "mrkdwn",
                         "text": f"*Affected:*\n{', '.join(f'`{s}`' for s in diagnosis.affected_services)}"},
                        {"type": "mrkdwn",
                         "text": f"*Dedup:*\n{diagnosis.deduplicated_alert_count} alerts to 1 page"},
                    ],
                },
                {"type": "divider"},
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": (
                            f"*What is happening:*\n{triage['what_is_happening']}\n\n"
                            f"*Immediate action:*\n{triage['immediate_action']}\n\n"
                            f"*Steps:*\n"
                            + "\n".join(f"{i+1}. {s}" for i, s in enumerate(triage["steps"]))
                            + f"\n\n*Do NOT:* {triage['do_not_do']}"
                        ),
                    },
                },
                {"type": "divider"},
                {
                    "type": "actions",
                    "elements": [
                        {
                            "type": "button",
                            "text": {"type": "plain_text", "text": "Approve Auto-Fix"},
                            "style": "primary",
                            "value": json.dumps({
                                "action": "auto_fix",
                                "incident_id": diagnosis.incident_id,
                                "runbook": diagnosis.recommended_runbook,
                            }),
                        },
                        {
                            "type": "button",
                            "text": {"type": "plain_text", "text": "Run Manually"},
                            "value": json.dumps({
                                "action": "manual",
                                "incident_id": diagnosis.incident_id,
                                "runbook": diagnosis.recommended_runbook,
                            }),
                        },
                        {
                            "type": "button",
                            "text": {"type": "plain_text", "text": "Escalate"},
                            "style": "danger",
                            "value": json.dumps({
                                "action": "escalate",
                                "incident_id": diagnosis.incident_id,
                            }),
                        },
                    ],
                },
            ]
        }
        try:
            r = await self.client.post(SLACK_WEBHOOK_URL, json=payload)
            r.raise_for_status()
        except Exception as e:
            log.error("slack_notification_failed", error=str(e))

    async def close(self):
        await self.client.aclose()
</code></pre>
<h2>Layer 4: Runbook Automation with Ansible</h2>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/src/runbook_handler.py">src/runbook_</a><a href="http://handler.py">handler.py</a></p>
</blockquote>
<pre><code class="language-python">from fastapi import APIRouter, Request
import json, asyncio
import structlog

log = structlog.get_logger()
router = APIRouter()

RUNBOOK_PLAYBOOKS = {
    "RB-DB-001":    "playbooks/db-connection-pool-fix.yml",
    "RB-APP-001":   "playbooks/app-restart-graceful.yml",
    "RB-KAFKA-001": "playbooks/kafka-consumer-reset.yml",
    "RB-APP-002":   "playbooks/app-scale-horizontal.yml",
}


@router.post("/slack/actions")
async def handle_slack_action(request: Request):
    form_data   = await request.form()
    payload     = json.loads(form_data["payload"])
    action      = payload["actions"][0]
    action_data = json.loads(action["value"])

    action_type = action_data["action"]
    incident_id = action_data["incident_id"]
    runbook     = action_data.get("runbook")

    if action_type == "auto_fix" and runbook in RUNBOOK_PLAYBOOKS:
        asyncio.create_task(execute_runbook(
            runbook_id=runbook,
            incident_id=incident_id,
            triggered_by=payload["user"]["name"],
        ))
        return {"text": f"Auto-fix triggered. Running {runbook} for incident {incident_id}."}
    elif action_type == "escalate":
        return {"text": f"Escalating incident {incident_id}."}
    elif action_type == "manual":
        playbook = RUNBOOK_PLAYBOOKS.get(runbook, "unknown")
        return {"text": f"Manual mode:\n```ansible-playbook {playbook} -e incident_id={incident_id}```"}

    return {"text": "Action acknowledged."}


async def execute_runbook(runbook_id: str, incident_id: str, triggered_by: str):
    playbook = RUNBOOK_PLAYBOOKS.get(runbook_id)
    if not playbook:
        log.error("unknown_runbook", runbook_id=runbook_id)
        return
    cmd = [
        "ansible-playbook", playbook,
        "-e", f"incident_id={incident_id}",
        "-e", f"triggered_by={triggered_by}",
        "-e", f"runbook_id={runbook_id}",
        "--diff",
    ]
    proc = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.STDOUT,
    )
    stdout, _ = await proc.communicate()
    output = stdout.decode()
    if proc.returncode == 0:
        log.info("runbook_succeeded", incident_id=incident_id, tail=output[-500:])
    else:
        log.error("runbook_failed",   incident_id=incident_id, tail=output[-500:])
</code></pre>
<h3>DB connection pool fixes</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/playbooks/db-connection-pool-fix.yml">playbooks/db-connection-pool-fix.yml</a></p>
</blockquote>
<pre><code class="language-yaml">---
- name: "RB-DB-001: Database Connection Pool Exhaustion Fix"
  hosts: localhost
  gather_facts: false
  vars:
    namespace: "production"
    service: "order-service"
    max_connections_multiplier: 1.5

  tasks:
    - name: Get current deployment config
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: "{{ service }}"
        namespace: "{{ namespace }}"
      register: deployment_info

    - name: Fail if deployment not found
      ansible.builtin.fail:
        msg: "Deployment {{ service }} not found in {{ namespace }}"
      when: deployment_info.resources | length == 0

    - name: Extract current DB pool size from env
      ansible.builtin.set_fact:
        current_pool_size: &gt;-
          {{
            deployment_info.resources[0].spec.template.spec.containers[0].env
            | selectattr('name', 'eq', 'DB_POOL_SIZE')
            | map(attribute='value')
            | first | default('10') | int
          }}

    - name: Compute new pool size (capped at 50)
      ansible.builtin.set_fact:
        new_pool_size: &gt;-
          {{ [(current_pool_size | int * max_connections_multiplier) | int, 50] | min }}

    - name: Patch deployment with new pool size
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: "{{ service }}"
            namespace: "{{ namespace }}"
            annotations:
              remediation/incident-id:  "{{ incident_id }}"
              remediation/triggered-by: "{{ triggered_by }}"
              remediation/runbook:      "{{ runbook_id }}"
              remediation/timestamp:    "{{ ansible_date_time.iso8601 }}"
          spec:
            template:
              spec:
                containers:
                  - name: "{{ service }}"
                    env:
                      - name: DB_POOL_SIZE
                        value: "{{ new_pool_size | string }}"

    - name: Wait for rollout to complete
      kubernetes.core.k8s_rollout_status:
        name: "{{ service }}"
        namespace: "{{ namespace }}"
        timeout: 120

    - name: Verify connection pool metric improved
      ansible.builtin.uri:
        url: &gt;-
          http://prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/query
          ?query=pg_stat_activity_count{service="{{ service }}",state="active"}
        method: GET
        return_content: true
      register: metric_check
      retries: 6
      delay: 10
      until: &gt;-
        (metric_check.json.data.result[0].value[1] | float)
        &lt; (new_pool_size | int * 0.7)

    - name: Report success
      ansible.builtin.debug:
        msg: &gt;
          RB-DB-001 complete.
          Pool scaled {{ current_pool_size }} to {{ new_pool_size }}.
          Incident {{ incident_id }} resolved.
</code></pre>
<h2>End-to-End Testing</h2>
<h3>Unit Tests</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/tests/test_diagnosis.py">tests/test_</a><a href="http://diagnosis.py">diagnosis.py</a></p>
</blockquote>
<pre><code class="language-python">import pytest
from datetime import datetime
from unittest.mock import AsyncMock
from src.models import AlertBundle
from src.graph import ServiceDependencyGraph
from src.diagnosis import DiagnosisEngine


def make_alert(alertname, service, severity="warning",
               root_cause_candidate=False, layer="application"):
    return {
        "status": "firing",
        "labels": {
            "alertname": alertname, "severity": severity, "layer": layer,
            "impact_type": "availability",
            "root_cause_candidate": str(root_cause_candidate).lower(),
            "service": service, "namespace": "production",
            "runbook": f"RB-{alertname[:3].upper()}-001",
        },
        "annotations": {"summary": f"Test alert for {service}"},
        "startsAt": datetime.utcnow().isoformat() + "Z",
        "endsAt": "0001-01-01T00:00:00Z",
        "generatorURL": "http://prometheus/graph",
        "fingerprint": f"fp-{alertname}-{service}",
    }


def make_bundle(alerts):
    return {
        "version": "4", "groupKey": "{}:{}", "status": "firing",
        "receiver": "diagnosis-engine",
        "groupLabels": {"namespace": "production"},
        "commonLabels": {"namespace": "production"},
        "commonAnnotations": {}, "alerts": alerts,
    }


class TestServiceDependencyGraph:
    def setup_method(self):
        self.graph = ServiceDependencyGraph()

    def test_upstream_services(self):
        upstream = self.graph.get_upstream_services("order-service")
        assert "payment-service" in upstream
        assert "checkout-service" in upstream

    def test_downstream_services(self):
        downstream = self.graph.get_downstream_services("order-service")
        assert "postgres-rds" in downstream
        assert "kafka" in downstream

    def test_root_cause_identification(self):
        affected = ["payment-service", "checkout-service", "order-service"]
        root = self.graph.find_likely_root_cause(affected)
        assert root == "order-service"

    def test_empty_services(self):
        assert self.graph.find_likely_root_cause([]) is None

    def test_impact_radius(self):
        impacted = self.graph.get_impact_radius("postgres-rds")
        assert "order-service"     in impacted
        assert "inventory-service" in impacted


@pytest.mark.asyncio
class TestDiagnosisEngine:
    async def setup_method(self, method):
        self.engine = DiagnosisEngine()
        self.engine.correlator.validate_db_connection_hypothesis = AsyncMock(
            return_value={
                "correlation_coefficient": 0.87,
                "db_anomaly_score": 0.82,
                "supports_hypothesis": True,
            }
        )
        self.engine.history.find_similar_incident = AsyncMock(return_value="RB-DB-001")
        self.engine.history.store_incident        = AsyncMock()

    async def test_explicit_root_cause(self):
        bundle = AlertBundle(**make_bundle([
            make_alert("DatabaseConnectionPoolExhausted", "order-service",
                       severity="critical", root_cause_candidate=True,
                       layer="infrastructure"),
            make_alert("ServiceErrorRateHigh",  "payment-service"),
            make_alert("ServiceErrorRateHigh",  "checkout-service"),
        ]))
        result = await self.engine.diagnose(bundle)
        assert result.root_cause_service == "order-service"
        assert result.root_cause_alert   == "DatabaseConnectionPoolExhausted"
        assert result.confidence_score    &gt; 0.7
        assert result.recommended_runbook == "RB-DB-001"

    async def test_graph_traversal_fallback(self):
        bundle = AlertBundle(**make_bundle([
            make_alert("ServiceErrorRateHigh", "payment-service"),
            make_alert("ServiceErrorRateHigh", "checkout-service"),
            make_alert("ServiceErrorRateHigh", "order-service"),
        ]))
        result = await self.engine.diagnose(bundle)
        assert result.root_cause_service == "order-service"

    async def test_low_confidence_without_correlation(self):
        self.engine.correlator.validate_db_connection_hypothesis = AsyncMock(
            return_value={"correlation_coefficient": 0.2,
                          "db_anomaly_score": 0.1, "supports_hypothesis": False}
        )
        self.engine.history.find_similar_incident = AsyncMock(return_value=None)
        bundle = AlertBundle(**make_bundle([
            make_alert("ServiceErrorRateHigh", "payment-service")
        ]))
        result = await self.engine.diagnose(bundle)
        assert result.confidence_score &lt; 0.6
</code></pre>
<h3>Integration Test: Full Pipeline Simulation</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/tests/test_integration.py">tests/test_</a><a href="http://integration.py">integration.py</a></p>
</blockquote>
<pre><code class="language-python">import pytest
import httpx

BASE_URL = "http://localhost:8080"

PAYMENT_CASCADE = {
    "version": "4",
    "groupKey": '{/{namespace="production"}:{alertname="~"}}',
    "status": "firing",
    "receiver": "diagnosis-engine",
    "groupLabels": {"namespace": "production", "layer": "infrastructure"},
    "commonLabels": {"namespace": "production"},
    "commonAnnotations": {},
    "alerts": [
        {
            "status": "firing",
            "labels": {
                "alertname": "DatabaseConnectionPoolExhausted",
                "severity": "critical", "layer": "infrastructure",
                "impact_type": "resource", "root_cause_candidate": "true",
                "service": "order-service", "namespace": "production",
                "runbook": "RB-DB-001",
            },
            "annotations": {"summary": "DB connection pool at 98%"},
            "startsAt": "2024-11-29T23:47:00Z", "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "http://prometheus/graph", "fingerprint": "fp001",
        },
        {
            "status": "firing",
            "labels": {
                "alertname": "ServiceErrorRateHigh",
                "severity": "warning", "layer": "application",
                "impact_type": "availability", "root_cause_candidate": "false",
                "service": "payment-service", "namespace": "production",
                "runbook": "RB-APP-001",
            },
            "annotations": {"summary": "payment-service error rate 12%"},
            "startsAt": "2024-11-29T23:47:30Z", "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "http://prometheus/graph", "fingerprint": "fp002",
        },
        {
            "status": "firing",
            "labels": {
                "alertname": "ServiceLatencyHigh",
                "severity": "warning", "layer": "application",
                "impact_type": "performance", "root_cause_candidate": "false",
                "service": "checkout-service", "namespace": "production",
                "runbook": "RB-APP-002",
            },
            "annotations": {"summary": "checkout-service p99 latency 8.2s"},
            "startsAt": "2024-11-29T23:47:45Z", "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "http://prometheus/graph", "fingerprint": "fp003",
        },
    ],
}


@pytest.mark.asyncio
async def test_full_pipeline_payment_cascade():
    async with httpx.AsyncClient(base_url=BASE_URL, timeout=30.0) as client:
        response = await client.post("/alerts", json=PAYMENT_CASCADE)

    assert response.status_code == 200
    result = response.json()

    assert result["root_cause_service"] == "order-service"
    assert result["root_cause_alert"]   == "DatabaseConnectionPoolExhausted"
    assert "payment-service"  in result["affected_services"]
    assert "checkout-service" in result["affected_services"]
    assert result["confidence_score"]    &gt;= 0.7
    assert result["recommended_runbook"] == "RB-DB-001"
    assert result["alert_count"] == 3


@pytest.mark.asyncio
async def test_health():
    async with httpx.AsyncClient(base_url=BASE_URL, timeout=5.0) as client:
        r = await client.get("/health")
    assert r.status_code == 200
    assert r.json()["status"] == "ok"


@pytest.mark.asyncio
async def test_resolved_bundle():
    resolved = {**PAYMENT_CASCADE, "status": "resolved"}
    async with httpx.AsyncClient(base_url=BASE_URL, timeout=10.0) as client:
        r = await client.post("/alerts", json=resolved)
    assert r.status_code == 200
</code></pre>
<h3>Run All Tests</h3>
<pre><code class="language-bash"># Unit tests
pytest tests/test_diagnosis.py -v

# Integration tests (start services first)
docker-compose up -d
pytest tests/test_integration.py -v --asyncio-mode=auto

# Full coverage report
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing
</code></pre>
<h2>Kubernetes Deployment</h2>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/k8s/deployment.yaml">k8s/deployment.yaml</a></p>
</blockquote>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: diagnosis-engine
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: diagnosis-engine
  template:
    metadata:
      labels:
        app: diagnosis-engine
    spec:
      serviceAccountName: diagnosis-engine
      containers:
        - name: diagnosis-engine
          image: your-registry/diagnosis-engine:latest
          ports:
            - containerPort: 8080
          env:
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: oncall-secrets
                  key: slack-webhook-url
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: oncall-secrets
                  key: openai-api-key
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: diagnosis-engine
  namespace: monitoring
spec:
  selector:
    app: diagnosis-engine
  ports:
    - port: 8080
      targetPort: 8080
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: diagnosis-engine
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "patch", "update"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: diagnosis-engine
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: diagnosis-engine
subjects:
  - kind: ServiceAccount
    name: diagnosis-engine
    namespace: monitoring
</code></pre>
<h3>Dockerfile</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/Dockerfile">Dockerfile</a></p>
</blockquote>
<pre><code class="language-dockerfile">FROM python:3.12-slim

WORKDIR /app

RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends ansible \
    &amp;&amp; rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

RUN ansible-galaxy collection install kubernetes.core

EXPOSE 8080

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]
</code></pre>
<h3>docker-compose.yml</h3>
<blockquote>
<p>📁 <strong>GitHub file:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix/blob/main/docker-compose.yml">docker-compose.yml</a></p>
</blockquote>
<pre><code class="language-yaml">version: "3.9"
services:
  diagnosis-engine:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
</code></pre>
<hr />
<h1>The Cultural Change Framework</h1>
<p>Architecture alone is not enough. The system changes are only durable if they are paired with organizational changes that reassign accountability from individuals to teams.</p>
<p><strong>Principle 1: Alert authorship equals alert ownership.</strong> Every alert rule must have a named team as its owner in the label set (<code>team: payments</code>). That team is responsible for the signal quality of their alerts, reviewed quarterly. Alert noise is a team metric, not an individual failing.</p>
<p><strong>Principle 2: Noisy alerts are bugs.</strong> If an alert fires and the on-call engineer determines it was not actionable, filing a ticket for that alert is now required procedure, not optional. An alert that fires twice with no action taken is a P2 bug with an SLA for resolution.</p>
<p><strong>Principle 3: Postmortems review the system, not the engineer.</strong> The right questions are not "What did the on-call engineer do?" but "Why was the diagnosis not automated? Why did the runbook not prevent this?" The engineer is a symptom. The system is the cause.</p>
<p><strong>Principle 4: Rotation health is a team metric.</strong> Track pages per rotation, time-to-diagnose, auto-resolution rate, false positive rate, and engineer-reported rotation health weekly. When these numbers are bad, the conversation is with engineering leadership about system investment, not with individual engineers about resilience.</p>
<p><strong>Principle 5: Build for the engineer who is half-asleep.</strong> Every piece of on-call tooling should be designed with the assumption that the person using it has been awake for 18 hours and has 30 seconds of working memory available. If the runbook has more than five steps, it should be a program. If the alert message does not tell you what to do, it should not be an alert.</p>
<hr />
<h1>Measuring Success</h1>
<table style="min-width:75px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><th><p>Metric</p></th><th><p>Target</p></th><th><p>How to measure</p></th></tr><tr><td><p>Pages per rotation week</p></td><td><p>Reduce by 70%</p></td><td><p>PagerDuty weekly report</p></td></tr><tr><td><p>Time to diagnose (MTTD)</p></td><td><p>Under 5 minutes</p></td><td><p>Alertmanager timestamp to Slack timestamp</p></td></tr><tr><td><p>False positive rate</p></td><td><p>Under 10%</p></td><td><p>Post-incident survey: "Was this page actionable?"</p></td></tr><tr><td><p>Auto-resolution rate</p></td><td><p>Over 40% of P2 incidents</p></td><td><p>Runbook automation success logs</p></td></tr><tr><td><p>Engineer rotation health score</p></td><td><p>Above 7/10</p></td><td><p>Weekly anonymous survey</p></td></tr><tr><td><p>Alert deduplication ratio</p></td><td><p>Over 10:1</p></td><td><p><code>alert_count / deduplicated_alert_count</code> in DiagnosisResult</p></td></tr></tbody></table>

<hr />
<h1>The Manifesto in One Paragraph</h1>
<p>On-call burnout is a measurement problem, an architecture problem, and an accountability problem. It is not a people problem. The engineer who burned out after 90 pages in 30 days did not lack resilience. They were operating in a system that was designed to fail its humans. The fix is not hiring more engineers to absorb the same noise. The fix is designing a system where the noise does not reach humans in the first place: where alerts group by causality, where diagnosis runs automatically, where runbooks are programs not documents, and where accountability for alert quality sits with teams and systems, not with the individual who happened to be on-call when the chaos hit. You can build this system today. Every layer described in this article uses tools that are free, open-source, and already in your stack. The only thing preventing you from implementing it is the assumption that on-call suffering is inevitable. It is not. It is a design choice. And design choices can be changed.</p>
<hr />
<blockquote>
<h3><strong>Full source code:</strong> <a href="https://github.com/SubhanshuMG/oncall-burnout-fix">https://github.com/SubhanshuMG/oncall-burnout-fix</a></h3>
</blockquote>
<p><em>If this post helped you think differently about on-call architecture, share it with the engineering leader in your organization who controls your monitoring budget. That is the right person to read this.</em></p>
]]></content:encoded></item><item><title><![CDATA[The $4.45M Mistake: How a Missing SBOM Requirement Let the XZ Utils Backdoor Slip Past Millions of Servers]]></title><description><![CDATA[The XZ Utils backdoor (CVE-2024-3094) nearly became the most devastating supply chain attack in history; a patient, three-year social engineering campaign that embedded a remote code execution backdoo]]></description><link>https://blogs.subhanshumg.com/xz-utils-backdoor-sbom-supply-chain-security</link><guid isPermaLink="true">https://blogs.subhanshumg.com/xz-utils-backdoor-sbom-supply-chain-security</guid><category><![CDATA[supply chain]]></category><category><![CDATA[Security]]></category><category><![CDATA[DevSecOps]]></category><category><![CDATA[sbom]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[cybersecurity]]></category><category><![CDATA[GitHub]]></category><category><![CDATA[github-actions]]></category><category><![CDATA[CVE]]></category><category><![CDATA[internal developer platforms]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 22 Feb 2026 15:05:20 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/6442da7c019a6adb6b507559/0edbbaef-a3f5-47cc-afd4-75eeaf9e9cc4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The XZ Utils backdoor (CVE-2024-3094) nearly became the most devastating supply chain attack in history; a patient, three-year social engineering campaign that embedded a remote code execution backdoor into a compression library used by virtually every Linux system. Only a <strong>500-millisecond performance anomaly</strong> noticed by a single developer prevented what could have been a universal skeleton key to every SSH server on the internet. This guide dissects exactly how the attack worked, then builds a complete self-healing supply chain audit system, with real code, that would catch it.</p>
<p>The XZ Utils incident exposed a fundamental truth: open-source supply chain security cannot rely on human vigilance alone. <strong>Automated SBOM generation, cryptographic signing, and policy enforcement at every pipeline gate</strong> are now non-negotiable. What follows is a comprehensive architecture and implementation guide for building these defenses into an Internal Developer Platform (IDP) with working GitHub Actions workflows, OPA Rego policies, Cosign signing, Kyverno admission control, and Tekton pipeline integration.</p>
<hr />
<h1>The XZ Utils attack: a masterclass in patient infiltration</h1>
<h3>Think of it like a bank heist movie</h3>
<p>Before diving into the technical details, consider this analogy. Imagine a small-town bank with a single, aging security guard, dedicated but overworked and underpaid. A con artist shows up, starts volunteering at the bank, doing helpful tasks for free. Meanwhile, fake customers start complaining loudly that the guard is too slow, too unreliable, and should retire. After two years of helpfulness, the con artist gets their own set of keys. They don't rob the bank immediately. Instead, they install a hidden mechanism in the vault door, one that only opens with their specific fingerprint, invisible to everyone else. They hide this mechanism inside replacement parts for the vault's air ventilation system, something nobody thinks to inspect. If a random engineer hadn't noticed the vault door was taking half a second longer to open, every bank in the world running the same vault model would have been silently compromised.</p>
<p>That is, almost literally, what happened with XZ Utils.</p>
<h2>Three years of social engineering</h2>
<p>The attacker, operating as <strong>"Jia Tan" (GitHub: JiaT75)</strong>, created their account on <strong>January 26, 2021</strong> and submitted their first innocuous patch, an <code>.editorconfig</code> file, on <strong>October 29, 2021</strong>. Over the next year, they submitted legitimate, helpful contributions to build trust with the sole maintainer, Lasse Collin.</p>
<p>The pressure campaign began in <strong>April 2022</strong>. Sock puppet accounts, <strong>"Jigar Kumar"</strong> and <strong>"Dennis Ens"</strong>, bombarded the xz-devel mailing list with complaints about slow maintenance. Kumar wrote: <em>"Progress will not happen until there is new maintainer. The current maintainer lost interest or doesn't care to maintain anymore."</em> Collin, who had publicly acknowledged struggling with mental health issues, responded: <em>"I haven't lost interest but my ability to care has been fairly limited mostly due to longterm mental health issues... It's also good to keep in mind that this is an unpaid hobby project."</em></p>
<p>The pressure worked. Key milestones in the takeover:</p>
<ul>
<li><p><strong>June 2022</strong>: Collin acknowledges Jia Tan as "practically a co-maintainer already"</p>
</li>
<li><p><strong>September 2022</strong>: Jia Tan gives authoritative release summaries</p>
</li>
<li><p><strong>November 2022</strong>: Bug report email changed to shared alias; README updated to list "the project maintainers Lasse Collin and Jia Tan"</p>
</li>
<li><p><strong>January 2023</strong>: Collin tags his <strong>final release</strong> (v5.4.1)</p>
</li>
<li><p><strong>March 2023</strong>: Jia Tan tags their <strong>first release</strong> (v5.4.2) and takes over oss-fuzz contact</p>
</li>
<li><p><strong>June 2023</strong>: "Hans Jansen" (another suspected persona) introduces GNU Indirect Function (IFUNC) support, the exact mechanism the backdoor would later exploit</p>
</li>
<li><p><strong>July 2023</strong>: Jia Tan disables IFUNC in oss-fuzz builds, preventing the fuzzer from detecting the backdoor</p>
</li>
</ul>
<p>Brian Krebs confirmed that Jigar Kumar, Dennis Ens, and Hans Jansen's email addresses never appeared anywhere else on the internet, not even in data breach databases, strongly indicating purpose-built sock puppet accounts.</p>
<h2>The technical mechanism: five layers deep</h2>
<p>The backdoor was architecturally elegant and terrifyingly well-hidden, operating through a five-stage chain.</p>
<h3><strong>Stage 1: Payload hidden in binary test files.</strong></h3>
<p>On <strong>February 23, 2024</strong>, Jia Tan committed two binary "test" files: <code>tests/files/bad-3-corrupt_lzma2.xz</code> and <code>tests/files/good-large_compressed.lzma</code>. The repository already noted that test files were "created by hand with a hex editor", making opaque binary additions appear routine. These files contained the x86-64 object code payload. <strong>Neither file was referenced by any actual test.</strong></p>
<h3><strong>Stage 2: Modified build system activates the payload.</strong></h3>
<p>The release tarballs (but not the git repository) included a malicious <code>build-to-host.m4</code> file. This is critical: release tarballs routinely include auto-generated build files absent from git, so the divergence went unnoticed. The m4 macro injected an obfuscated shell script into the configure process that checked preconditions (x86-64 Linux, GCC, GNU linker, building as deb/rpm) and then extracted the payload using a chain of <code>sed</code>, <code>tr</code>, <code>head</code>, and <code>xz</code> commands:</p>
<pre><code class="language-shell">sed rpath ../../../tests/files/bad-3-corrupt_lzma2.xz | \
  tr "\t \-_" " \t_\-" | xz -d | /bin/bash
</code></pre>
<p>The extracted payload was compiled into <code>liblzma_la-crc64-fast.o</code> and linked into liblzma.</p>
<h3><strong>Stage 3: IFUNC hijack of RSA_public_decrypt.</strong></h3>
<p>The payload used the GNU Indirect Function mechanism (introduced months earlier by "Hans Jansen's" patches) to replace the IFUNC resolvers for <code>crc32_resolve()</code> and <code>crc64_resolve()</code>. During process startup, before the GOT was remapped read-only, the malicious resolver parsed in-memory symbol tables, installed an audit hook into the dynamic linker, and waited for <code>RSA_public_decrypt@plt</code> to be resolved. When found, it redirected the function to the backdoor code and uninstalled the hook to cover its tracks.</p>
<h3><strong>Stage 4: The sshd, libsystemd, liblzma chain.</strong></h3>
<p>OpenSSH does not directly depend on liblzma. However, many Linux distributions (Debian, Ubuntu, Fedora) patch sshd to support systemd notification via <code>sd_notify</code>, linking sshd against <strong>libsystemd</strong>, which transitively depends on <strong>liblzma</strong>. The dependency chain <code>sshd -&gt; libsystemd -&gt; liblzma (backdoored)</code> loaded the malicious code directly into the SSH daemon's address space.</p>
<h3><strong>Stage 5: Remote code execution gated by Ed448 cryptography.</strong></h3>
<p>The hooked <code>RSA_public_decrypt</code> extracted a payload from the RSA public modulus in SSH certificate authentication (fully attacker-controlled), decrypted it with a hardcoded <strong>ChaCha20</strong> key, and verified a digital signature using a hardcoded <strong>Ed448</strong> public key. Only the holder of the corresponding Ed448 private key could trigger execution. If verified, the payload was passed directly to <code>system()</code>, achieving <strong>unauthenticated remote code execution</strong>. As Filippo Valsorda stated: <em>"It's RCE, not auth bypass, and gated/unreplayable."</em></p>
<p>The use of Ed448 (rather than the common Ed25519), the multi-year campaign, and sophisticated anti-analysis countermeasures (detecting <code>TERM</code>, <code>LD_DEBUG</code>, <code>rr</code>, checking <code>argv[0]</code> for <code>/usr/sbin/sshd</code>, an environment variable kill switch <code>yolAbejyiejuvnup=Evjtgvsh5okmkAvj</code>) all point to likely <strong>state-sponsored</strong> activity.</p>
<h2>Discovery: 500 milliseconds saved the internet</h2>
<p><strong>Andres Freund</strong>, a Microsoft principal engineer and PostgreSQL committer, discovered the backdoor while doing routine micro-benchmarking on Debian sid. He noticed SSH logins consuming unexpectedly high CPU and taking <strong>~0.8 seconds instead of ~0.3 seconds</strong>, a 500ms anomaly. He also encountered Valgrind errors related to liblzma. As he wrote on the oss-security mailing list on <strong>March 29, 2024</strong>: <em>"After observing a few odd symptoms around liblzma on Debian sid installations over the last weeks (logins with ssh taking a lot of CPU, valgrind errors) I figured out the answer."</em></p>
<p>The CVE received a <strong>CVSS score of 10.0</strong>, the maximum. Affected versions were <strong>5.6.0</strong> (released February 24, 2024) and <strong>5.6.1</strong> (released March 9, 2024). Only bleeding-edge distributions had adopted them: Fedora 40/Rawhide, Debian unstable/testing, openSUSE Tumbleweed, and Kali Linux. No stable production distribution was compromised. Red Hat issued an emergency advisory: <em>"Immediately stop using Fedora 40 or Fedora Rawhide."</em></p>
<p><strong>No automated security tool, code review process, or CI/CD pipeline caught this attack.</strong> It was pure human accident.</p>
<hr />
<h1>What an SBOM is and why it matters now</h1>
<p>A <strong>Software Bill of Materials (SBOM)</strong> is a machine-readable inventory of every component, library, and dependency in a software artifact, essentially a nutritional label for code. Executive Order 14028 (May 2021) mandated SBOMs for federal software procurement, and the NTIA defined <strong>seven minimum data fields</strong>: supplier name, component name, version, unique identifiers (PURL/CPE), dependency relationships, SBOM author, and timestamp.</p>
<p>Two dominant formats compete. <strong>SPDX</strong> (Linux Foundation, ISO/IEC 5962:2021) excels at license compliance with file-level and snippet-level granularity. <strong>CycloneDX</strong> (OWASP, ECMA TC54) is security-focused with native VEX support, vulnerability fields, and component pedigree tracking. Both are supported by major tools. For supply chain security, CycloneDX's prescriptive design and built-in vulnerability exchange make it the stronger choice; for license-heavy regulated industries, SPDX's ISO standardization wins.</p>
<h2>Five SBOM signals that would have flagged XZ Utils</h2>
<p>An SBOM alone would not have prevented CVE-2024-3094. But systematic SBOM monitoring would have raised multiple alarms:</p>
<ol>
<li><p><strong>New maintainer detection.</strong> SBOM metadata tracks producers and suppliers. The transition from Lasse Collin to "Jia Tan" as release author, especially on a single-maintainer project with <strong>OpenSSF Scorecard risk scores of ~5.4/10</strong>, would have triggered review in any organization monitoring maintainer changes on critical dependencies.</p>
</li>
<li><p><strong>Modified build scripts absent from source.</strong> The malicious <code>build-to-host.m4</code> existed only in release tarballs, not in git. Comparing SBOMs generated from source checkout versus release tarball would reveal unexplained divergences. StepSecurity's Harden-Runner analysis showed multiple anomalous <code>sed</code> commands modifying the Makefile during 5.6.0 builds, a clear behavioral anomaly.</p>
</li>
<li><p><strong>New binary files in a source repository.</strong> The payload was hidden in binary test files that were not exercised by any test. SPDX's file-level analysis capability could flag new opaque binaries, and build-process SBOM diffing between versions would detect the additions.</p>
</li>
<li><p><strong>Unexpected transitive dependencies.</strong> Binary SBOM analysis of sshd would reveal linkage to liblzma despite OpenSSH's source having no such dependency. The discrepancy between source-level and binary-level dependency graphs is a critical signal.</p>
</li>
<li><p><strong>Rapid incident response.</strong> Organizations with SBOMs could instantly query "which systems contain xz-utils 5.6.0 or 5.6.1?", answering in <strong>minutes rather than days</strong>. Wiz demonstrated this with their agentless SBOM inventory; Sumo Logic showed SPDX JSON queries identifying affected hosts immediately.</p>
</li>
</ol>
<p>The honest assessment: SBOMs are necessary but not sufficient. They must be combined with reproducible builds, maintainer vetting, behavioral analysis, and automated policy enforcement, which is exactly what we build next.</p>
<hr />
<h1>Architecture of a self-healing supply chain audit system</h1>
<p>The system operates as a defense-in-depth pipeline with <strong>six enforcement points</strong>, each independently capable of blocking compromised artifacts.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/6442da7c019a6adb6b507559/ae81639b-3de0-484b-80b9-b3e381e0af54.png" alt="" style="display:block;margin:0 auto" />

<p>The core components:</p>
<ul>
<li><p><strong>SBOM generation</strong> (Syft) produces CycloneDX/SPDX inventories at build time</p>
</li>
<li><p><strong>Policy engine</strong> (OPA with Rego, Conftest) evaluates SBOMs against blocklists, license rules, dependency drift, and completeness requirements</p>
</li>
<li><p><strong>Vulnerability scanning</strong> (Grype, Trivy) fails builds on critical/high CVEs</p>
</li>
<li><p><strong>Cryptographic signing</strong> (Cosign/Sigstore) provides keyless signatures via Fulcio OIDC certificates and records every signing event in the Rekor transparency log</p>
</li>
<li><p><strong>Artifact registry</strong> (Harbor) stores images with attached signatures, SBOM attestations, and enforces pull-time policies</p>
</li>
<li><p><strong>Admission control</strong> (Kyverno or sigstore/policy-controller) blocks unsigned or un-attested images at the Kubernetes API server</p>
</li>
<li><p><strong>Drift detection</strong> (ArgoCD, continuous rescanning, Dependency-Track) detects newly discovered vulnerabilities in already-deployed artifacts and triggers automated remediation</p>
</li>
</ul>
<p>The <strong>self-healing</strong> aspect comes from three feedback loops: (1) scheduled vulnerability rescanning that triggers rebuilds when new CVEs affect deployed images, (2) GitOps reconciliation that reverts unauthorized manual deployments, and (3) admission control that prevents bypassing the pipeline entirely.</p>
<h2>Step 1: GitHub Actions workflow with full supply chain security</h2>
<p>This workflow builds a container image, generates an SBOM, scans for vulnerabilities, signs the image, and attests the SBOM, all with keyless Sigstore.</p>
<pre><code class="language-yaml">name: Supply Chain Security Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-sign-scan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
      id-token: write        # Required for Sigstore OIDC keyless signing
      security-events: write  # Required for SARIF upload

    steps:
      # Checkout
      - name: Checkout repository
        uses: actions/checkout@v4

      # Install Supply Chain Tools
      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Install Syft
        uses: anchore/sbom-action/download-syft@v0

      # Docker Build and Push
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}
          tags: |
            type=sha,format=long
            type=semver,pattern={{version}}

      - name: Build and push image
        id: build-and-push
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

      # Generate SBOM with Syft
      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        id: sbom
        with:
          image: \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}
          format: spdx-json
          output-file: sbom.spdx.json

      # Policy Check with Conftest/OPA
      - name: Install Conftest
        run: |
          LATEST=$(wget -qO- "https://api.github.com/repos/open-policy-agent/conftest/releases/latest" | grep '"tag_name"' | sed -E 's/.*"v([^"]+)".*/\1/')
          wget -q "https://github.com/open-policy-agent/conftest/releases/download/v\({LATEST}/conftest_\){LATEST}_Linux_x86_64.tar.gz"
          tar xzf conftest_${LATEST}_Linux_x86_64.tar.gz
          sudo mv conftest /usr/local/bin/

      - name: Run SBOM policy checks
        run: conftest test sbom.spdx.json --policy policy/

      # Vulnerability Scan with Grype
      - name: Scan for vulnerabilities
        uses: anchore/scan-action@v7
        id: scan
        with:
          sbom: sbom.spdx.json
          fail-build: true
          severity-cutoff: critical
          output-format: sarif

      - name: Upload SARIF report
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: ${{ steps.scan.outputs.sarif }}

      # Sign Image with Cosign (Keyless)
      - name: Sign container image
        run: |
          cosign sign --yes \
            \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}

      # Attest SBOM
      - name: Attest SBOM to image
        run: |
          cosign attest --yes \
            --predicate sbom.spdx.json \
            --type spdxjson \
            \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}

      # Verify Everything
      - name: Verify signature
        run: |
          cosign verify \
            --certificate-identity-regexp="https://github.com/${{ github.repository }}/*" \
            --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
            \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}

      - name: Verify SBOM attestation
        run: |
          cosign verify-attestation \
            --type spdxjson \
            --certificate-identity-regexp="https://github.com/${{ github.repository }}/*" \
            --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
            \({{ env.REGISTRY }}/\){{ env.IMAGE_NAME }}@${{ steps.build-and-push.outputs.digest }}
</code></pre>
<p>Key details: <code>id-token: write</code> is mandatory for keyless Sigstore signing; it mints the OIDC token that Fulcio uses to issue a short-lived certificate. The <code>--yes</code> flag is required in CI for non-interactive acceptance. All image references use <strong>digest</strong> (<code>@sha256:...</code>), never tags, to prevent TOCTOU attacks.</p>
<h2>Step 2: OPA Rego policies for SBOM enforcement</h2>
<p>These policies run via <code>conftest test sbom.json --policy policy/</code> in the CI pipeline. Place them in a <code>policy/</code> directory.</p>
<h3>Denied packages blocklist</h3>
<pre><code class="language-json"># policy/denied_packages.rego
package main

import future.keywords.contains
import future.keywords.if
import future.keywords.in

denied_packages := {
  "pkg:npm/event-stream",
  "pkg:npm/flatmap-stream",
  "pkg:pypi/jeIlyfish",
  "pkg:generic/xz-utils@5.6.0",
  "pkg:generic/xz-utils@5.6.1",
}

deny contains msg if {
  some component in input.packages
  purl := component.externalRefs[_].referenceLocator
  purl in denied_packages
  msg := sprintf("BLOCKED: package '%s' (purl: %s) is on the deny list", [component.name, purl])
}
</code></pre>
<h3>License compliance</h3>
<pre><code class="language-javascript"># policy/license_compliance.rego
package main

import future.keywords.contains
import future.keywords.if
import future.keywords.in

prohibited_licenses := {
  "GPL-2.0-only", "GPL-2.0-or-later",
  "GPL-3.0-only", "GPL-3.0-or-later",
  "AGPL-3.0-only", "AGPL-3.0-or-later",
  "SSPL-1.0",
}

deny contains msg if {
  some pkg in input.packages
  license := pkg.licenseConcluded
  license in prohibited_licenses
  msg := sprintf("LICENSE VIOLATION: '%s' uses prohibited license '%s'", [pkg.name, license])
}
</code></pre>
<h3>Dependency drift detection</h3>
<pre><code class="language-javascript"># policy/dependency_drift.rego
# Load baseline: conftest test sbom.json --policy policy/ --data baseline.json
package main

import future.keywords.contains
import future.keywords.if
import future.keywords.in

deny contains msg if {
  data.approved_packages
  some pkg in input.packages
  name_version := sprintf("%s@%s", [pkg.name, pkg.versionInfo])
  not name_version in data.approved_packages
  msg := sprintf("NEW DEPENDENCY: '%s' not in approved baseline; review required", [name_version])
}

warn contains msg if {
  not data.approved_packages
  msg := "No baseline loaded; dependency drift detection skipped"
}
</code></pre>
<h3>SBOM completeness check</h3>
<pre><code class="language-javascript"># policy/sbom_completeness.rego
package main

import future.keywords.contains
import future.keywords.if

deny contains msg if {
  not input.spdxVersion
  msg := "SBOM missing required field: spdxVersion"
}

deny contains msg if {
  not input.creationInfo
  msg := "SBOM missing required field: creationInfo"
}

deny contains msg if {
  input.creationInfo
  not input.creationInfo.created
  msg := "SBOM missing required field: creationInfo.created (timestamp)"
}

deny contains msg if {
  not input.packages
  msg := "SBOM has no packages; appears empty"
}

deny contains msg if {
  count(input.packages) == 0
  msg := "SBOM packages array is empty"
}

deny contains msg if {
  some i, pkg in input.packages
  not pkg.name
  msg := sprintf("Package at index %d missing required field 'name'", [i])
}

deny contains msg if {
  some i, pkg in input.packages
  not pkg.versionInfo
  msg := sprintf("Package '%s' missing required field 'versionInfo'", [pkg.name])
}
</code></pre>
<h2>Step 3: Kubernetes admission control with Kyverno</h2>
<h3>Verify Cosign keyless signatures on all images</h3>
<pre><code class="language-yaml">apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    policies.kyverno.io/title: Verify Image Signatures
    policies.kyverno.io/category: Supply Chain Security
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: Enforce
  background: false
  rules:
    - name: verify-keyless-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "ghcr.io/my-org/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/my-org/*/.github/workflows/*"
                    issuer: "https://token.actions.githubusercontent.com"
                    rekor:
                      url: https://rekor.sigstore.dev
</code></pre>
<h3>Require SBOM attestation on all deployed images</h3>
<pre><code class="language-yaml">apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-sbom-attestation
  annotations:
    policies.kyverno.io/title: Require SBOM Attestation
    policies.kyverno.io/category: Supply Chain Security
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: Enforce
  background: false
  webhookTimeoutSeconds: 30
  rules:
    - name: check-sbom-attestation
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "ghcr.io/my-org/*"
          attestations:
            - type: https://spdx.dev/Document
              attestors:
                - entries:
                    - keyless:
                        subject: "https://github.com/my-org/*/.github/workflows/*"
                        issuer: "https://token.actions.githubusercontent.com"
                        rekor:
                          url: https://rekor.sigstore.dev
              conditions:
                - all:
                    - key: "{{ spdxVersion }}"
                      operator: Equals
                      value: "SPDX-2.3"
</code></pre>
<h3>Alternative: sigstore/policy-controller ClusterImagePolicy</h3>
<pre><code class="language-yaml">apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: keyless-sbom-required
spec:
  images:
    - glob: "ghcr.io/my-org/**"
  authorities:
    - keyless:
        url: https://fulcio.sigstore.dev
        identities:
          - issuer: https://token.actions.githubusercontent.com
            subject: "https://github.com/my-org/my-repo/.github/workflows/build.yml@refs/heads/main"
      ctlog:
        url: https://rekor.sigstore.dev
      attestations:
        - name: must-have-sbom
          predicateType: https://spdx.dev/Document
          policy:
            type: cue
            data: |
              predicateType: "https://spdx.dev/Document"
</code></pre>
<h2>Step 4: Dependency diff alerting and maintainer change detection</h2>
<p>Add this job to your GitHub Actions workflow to detect suspicious dependency changes:</p>
<pre><code class="language-yaml">  dependency-audit:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - name: Checkout PR
        uses: actions/checkout@v4

      - name: Checkout base
        uses: actions/checkout@v4
        with:
          ref: ${{ github.base_ref }}
          path: base

      - name: Generate SBOMs for diff
        run: |
          # Generate SBOM for current PR
          syft scan dir:. -o spdx-json=pr-sbom.json
          # Generate SBOM for base branch
          syft scan dir:./base -o spdx-json=base-sbom.json

      - name: Dependency diff analysis
        run: |
          #!/bin/bash
          set -euo pipefail

          # Extract package names and versions from both SBOMs
          jq -r '.packages[] | "\(.name)@\(.versionInfo // "unknown")"' base-sbom.json | sort &gt; base-deps.txt
          jq -r '.packages[] | "\(.name)@\(.versionInfo // "unknown")"' pr-sbom.json | sort &gt; pr-deps.txt

          # Find new dependencies
          NEW_DEPS=$(comm -13 base-deps.txt pr-deps.txt)
          REMOVED_DEPS=$(comm -23 base-deps.txt pr-deps.txt)

          if [ -n "$NEW_DEPS" ]; then
            echo "::warning::New dependencies detected:"
            echo "$NEW_DEPS"
            echo ""
            echo "Review these additions for supply chain risk."
          fi

          if [ -n "$REMOVED_DEPS" ]; then
            echo "::notice::Dependencies removed:"
            echo "$REMOVED_DEPS"
          fi

          # Flag high-risk patterns (like the XZ Utils attack signals)
          echo "$NEW_DEPS" | while read -r dep; do
            PKG_NAME=\((echo "\)dep" | cut -d'@' -f1)
            echo "Checking scorecard for: $PKG_NAME"
          done

      - name: OpenSSF Scorecard check
        uses: ossf/scorecard-action@v2.3.1
        with:
          results_file: scorecard-results.sarif
          results_format: sarif
          publish_results: true
</code></pre>
<p>For <strong>maintainer change detection</strong> specifically, integrate with package registry APIs:</p>
<pre><code class="language-shell">#!/bin/bash
# maintainer-check.sh; detect maintainer changes on npm packages
# Run periodically or in CI when lockfiles change

LOCKFILE="package-lock.json"
ALERT_WEBHOOK="${SLACK_WEBHOOK_URL}"

jq -r '.packages | to_entries[] | .value.resolved // empty' "$LOCKFILE" | \
  grep -oP '(?&lt;=registry.npmjs.org/)[^/]+' | sort -u | while read -r pkg; do

  # Fetch current maintainers from npm
  CURRENT=\((npm view "\)pkg" maintainers --json 2&gt;/dev/null | jq -r '.[].name' | sort)
  CACHED="./maintainer-cache/${pkg}.txt"

  if [ -f "$CACHED" ]; then
    PREVIOUS=\((cat "\)CACHED")
    if [ "\(CURRENT" != "\)PREVIOUS" ]; then
      echo "Maintainer change on $pkg"
      echo "  Previous: $PREVIOUS"
      echo "  Current:  $CURRENT"
      # Send alert
      curl -s -X POST "$ALERT_WEBHOOK" \
        -H 'Content-type: application/json' \
        -d "{\"text\":\"Maintainer change detected on \`\(pkg\`\nPrevious: \)PREVIOUS\nCurrent: $CURRENT\"}"
    fi
  fi

  mkdir -p ./maintainer-cache
  echo "\(CURRENT" &gt; "\)CACHED"
done
</code></pre>
<h2>Step 5: Tekton pipeline with Chains for automatic provenance</h2>
<p>The Tekton pipeline mirrors the GitHub Actions workflow but runs on Kubernetes. <strong>Tekton Chains</strong> automatically generates SLSA provenance and signs it.</p>
<h3>Chains configuration</h3>
<pre><code class="language-yaml"># Install Tekton Pipelines and Chains
kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
kubectl apply -f https://storage.googleapis.com/tekton-releases/chains/latest/release.yaml

# Configure Chains for SLSA v1 provenance with OCI storage
kubectl patch configmap chains-config -n tekton-chains -p='{"data":{
  "artifacts.taskrun.format": "slsa/v2alpha3",
  "artifacts.taskrun.storage": "oci",
  "artifacts.oci.storage": "oci",
  "transparency.enabled": "true"
}}'

# Generate signing keys for Chains
cosign generate-key-pair k8s://tekton-chains/signing-secrets
</code></pre>
<h3>SBOM generation Task</h3>
<pre><code class="language-yaml">apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: generate-sbom
spec:
  description: Generate CycloneDX SBOM with Syft
  params:
    - name: IMAGE
      type: string
  workspaces:
    - name: source
  results:
    - name: SBOM_PATH
  steps:
    - name: generate
      image: docker.io/anchore/syft:v1.4.1
      script: |
        #!/usr/bin/env sh
        set -ex
        syft packages "$(params.IMAGE)" \
          --output cyclonedx-json \
          --file $(workspaces.source.path)/sbom.cdx.json
        echo -n "\((workspaces.source.path)/sbom.cdx.json" &gt; \)(results.SBOM_PATH.path)
</code></pre>
<h3>Vulnerability scan Task</h3>
<pre><code class="language-yaml">apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: vulnerability-scan
spec:
  description: Scan for vulnerabilities with Grype, fail on high+
  params:
    - name: IMAGE
      type: string
    - name: FAIL_ON
      type: string
      default: "high"
  workspaces:
    - name: source
  steps:
    - name: scan
      image: docker.io/anchore/grype:v0.79.2
      script: |
        #!/usr/bin/env sh
        set -ex
        grype "$(params.IMAGE)" \
          --fail-on $(params.FAIL_ON) \
          --output cyclonedx-json \
          --file $(workspaces.source.path)/vuln-report.json
</code></pre>
<h3>Full Pipeline definition</h3>
<pre><code class="language-yaml">apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: secure-build-pipeline
spec:
  params:
    - name: repo-url
      type: string
    - name: image-reference
      type: string
  workspaces:
    - name: shared-data
    - name: docker-credentials
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-data
      params:
        - name: url
          value: $(params.repo-url)

    - name: build-push
      runAfter: ["fetch-source"]
      taskRef:
        name: kaniko
      workspaces:
        - name: source
          workspace: shared-data
        - name: dockerconfig
          workspace: docker-credentials
      params:
        - name: IMAGE
          value: $(params.image-reference)

    - name: generate-sbom
      runAfter: ["build-push"]
      taskRef:
        name: generate-sbom
      workspaces:
        - name: source
          workspace: shared-data
      params:
        - name: IMAGE
          value: $(params.image-reference)

    - name: vulnerability-scan
      runAfter: ["generate-sbom"]
      taskRef:
        name: vulnerability-scan
      workspaces:
        - name: source
          workspace: shared-data
      params:
        - name: IMAGE
          value: $(params.image-reference)
</code></pre>
<p>Tekton Chains watches completed TaskRuns and <strong>automatically signs the provenance</strong> using the configured keys. This achieves <strong>SLSA Build Level 2</strong> out of the box, and Level 3 with SPIFFE/SPIRE integration for non-falsifiable provenance.</p>
<h2>Step 6: Harbor registry with policy enforcement</h2>
<p>Configure Harbor to enforce signatures and trigger SBOM generation:</p>
<pre><code class="language-shell"># Deploy Harbor with Trivy scanner (Helm)
helm repo add harbor https://helm.goharbor.io
helm install harbor harbor/harbor \
  --set expose.type=ingress \
  --set expose.ingress.hosts.core=harbor.example.com \
  --set trivy.enabled=true \
  --set persistence.enabled=true

# Configure project-level policies via Harbor API
# Require content trust (only signed images can be pulled)
curl -X PUT "https://harbor.example.com/api/v2.0/projects/myproject" \
  -H "Content-Type: application/json" \
  -u "admin:Harbor12345" \
  -d '{
    "metadata": {
      "enable_content_trust": "true",
      "enable_content_trust_cosign": "true",
      "auto_scan": "true",
      "prevent_vul": "true",
      "severity": "high"
    }
  }'
</code></pre>
<p>With <code>enable_content_trust_cosign</code> set to <code>true</code>, Harbor will refuse to serve images that lack a valid Cosign signature. With <code>prevent_vul</code> and <code>severity</code> configured, Harbor blocks pulls of images with vulnerabilities at or above the threshold. The <code>auto_scan</code> setting triggers Trivy scans on every push, and Harbor stores SBOM attestations as OCI artifacts alongside images.</p>
<hr />
<h1>Testing the end-to-end pipeline</h1>
<p>Verify every enforcement point works by testing both the happy path and failure cases:</p>
<pre><code class="language-bash"># TEST 1: Happy path; signed, attested, clean image deploys
# Run the GitHub Actions pipeline on a clean image, then verify locally:
cosign verify \
  --certificate-identity-regexp="https://github.com/my-org/.*" \
  --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
  ghcr.io/my-org/myapp@sha256:abc123...

cosign verify-attestation --type spdxjson \
  --certificate-identity-regexp="https://github.com/my-org/.*" \
  --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
  ghcr.io/my-org/myapp@sha256:abc123...

# Deploy to Kubernetes; should succeed
kubectl run test --image=ghcr.io/my-org/myapp@sha256:abc123...

# TEST 2: Unsigned image; admission controller blocks it
# Push an unsigned image directly (bypassing pipeline)
docker push ghcr.io/my-org/unsigned-app:latest

# Attempt deploy; Kyverno should reject with:
# "image signature verification failed"
kubectl run test-unsigned --image=ghcr.io/my-org/unsigned-app:latest
# Expected: Error from server: admission webhook denied the request

# TEST 3: SBOM policy violation; pipeline fails at gate
# Add a blocked dependency (e.g., event-stream) to package.json
# Push to trigger pipeline
# Expected: conftest step fails with:
#   "BLOCKED: package 'event-stream' is on the deny list"

# TEST 4: Critical vulnerability; Grype fails the build
# Use a base image with known critical CVEs
# FROM node:14  (has multiple critical CVEs)
# Expected: scan-action step fails with exit code 1

# TEST 5: Dependency drift; new package flagged for review
# Add a new dependency not in baseline.json
# Submit as PR
# Expected: warning annotation on the PR:
#   "NEW DEPENDENCY: 'newpkg@1.0.0' not in approved baseline"

# TEST 6: Harbor blocks vulnerable image pull
docker pull harbor.example.com/myproject/vulnerable-app:latest
# Expected: Error: FORBIDDEN; image has critical vulnerabilities
</code></pre>
<hr />
<h1>Why would this architecture have caught the XZ Utils attack?</h1>
<p>Each enforcement layer independently addresses a different aspect of CVE-2024-3094.</p>
<p><strong>SBOM dependency drift detection</strong> would have flagged the new binary test files and the modified build scripts between xz-utils 5.5.x and 5.6.0 as unexpected changes requiring review. The baseline comparison policy would emit <code>NEW DEPENDENCY</code> warnings for any component changes.</p>
<p><strong>Maintainer change alerting</strong> would have flagged the transition from Lasse Collin to Jia Tan as the sole release author. The OpenSSF Scorecard integration would have assigned low maintainer scores to a single-maintainer project with a recently added co-maintainer.</p>
<p><strong>Reproducible build verification</strong> comparing SBOMs from git source versus the release tarball would have revealed the <code>build-to-host.m4</code> file present only in tarballs, the exact mechanism used to inject the backdoor.</p>
<p><strong>Binary SBOM analysis</strong> of sshd would have detected the unexpected transitive dependency on liblzma through libsystemd, flagging the attack surface that made SSH vulnerable to a compression library backdoor.</p>
<p><strong>Cryptographic provenance</strong> via SLSA provenance and Cosign attestations creates an auditable chain linking every artifact to its source commit, builder identity, and build environment, making it far harder to inject malicious code without detection.</p>
<p>No single tool would have stopped this attack. But multiple independent enforcement points, such as SBOM diffing, maintainer monitoring, build reproducibility, admission control, and continuous rescanning, create a defense-in-depth posture where the attacker would need to compromise multiple systems simultaneously. That transforms a supply chain attack from a single point of failure into a <strong>distributed consensus problem</strong> that the attacker must solve, dramatically raising the cost and reducing the probability of success.</p>
<hr />
<h1>Conclusion</h1>
<p>The XZ Utils backdoor was not a failure of technology; it was a failure of process. A single underfunded maintainer, no automated supply chain verification, and no policy enforcement between source code and production deployment. The technical sophistication of the attack (Ed448 cryptography, IFUNC hijacking, multi-year social engineering) was extraordinary, but the defense required is not.</p>
<p>The architecture described here, <strong>SBOM generation at build, OPA policy gates, Cosign signing, Kyverno admission control, and continuous drift detection</strong>, is implementable today with open-source tools. The key insight is that supply chain security is not a single gate but a <strong>continuous verification loop</strong>: generate, attest, verify, scan, re-scan, alert, remediate. Every step in the pipeline both produces and consumes cryptographic evidence, and every enforcement point operates independently.</p>
<p>The most important technical decision is to <strong>make the CI/CD pipeline the only path to production</strong>; no manual pushes, no exceptions, no tag-based references. Every image must be signed, every SBOM must be attested, and every deployment must be verified. The self-healing feedback loops ensure that even if a vulnerability is discovered after deployment, the system automatically detects it, alerts, and triggers remediation.</p>
<p>After XZ Utils, the question is no longer whether to implement supply chain security, but how fast you can deploy it.</p>
]]></content:encoded></item><item><title><![CDATA[The Global Cloud Blackout]]></title><description><![CDATA[If AWS disappeared tomorrow, would your company survive?

Not degraded. Not slow. Gone.
Most "highly available" systems would collapse within hours. This isn't fear-mongering. It's an architectural re]]></description><link>https://blogs.subhanshumg.com/the-global-cloud-blackout</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-global-cloud-blackout</guid><category><![CDATA[Cloud]]></category><category><![CDATA[architecture]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[SRE]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[Chaos Engineering]]></category><category><![CDATA[high availability]]></category><category><![CDATA[System Design]]></category><category><![CDATA[multi-cloud]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Tue, 17 Feb 2026 13:51:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771336191365/e99d2bce-4177-4d91-864e-699350d3ab1a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<h3><strong>If AWS disappeared tomorrow, would your company survive?</strong></h3>
</blockquote>
<p>Not degraded. Not slow. <em>Gone.</em></p>
<p>Most "highly available" systems would collapse within hours. This isn't fear-mongering. It's an architectural reality that 90% of engineering teams refuse to confront.</p>
<p>This article is not a theory piece. It's a <strong>survival blueprint</strong> with real architecture, real code, and a real-world case study you can follow end to end.</p>
<hr />
<h2>The Real-World Wake-Up Call</h2>
<h3>December 7, 2021: The Day us-east-1 Broke the Internet</h3>
<p>At 7:35 AM PST, AWS us-east-1 experienced a cascading failure. Here's what actually happened and why it matters:</p>
<p><strong>The Trigger:</strong> A networking issue disrupted communication between internal AWS services in us-east-1. This wasn't a power outage. It wasn't a natural disaster. It was an <em>internal control plane failure</em>.</p>
<p><strong>The Cascade:</strong></p>
<ul>
<li><p>AWS Console became unreachable and teams couldn't even <em>see</em> their infrastructure</p>
</li>
<li><p>DynamoDB, Lambda, SQS, Kinesis were all degraded or unavailable</p>
</li>
<li><p>CloudWatch stopped reporting, so teams were flying blind</p>
</li>
<li><p>Even the <strong>AWS Status Page</strong> was down because it was hosted on... AWS</p>
</li>
</ul>
<p><strong>The Real-World Damage:</strong></p>
<ul>
<li><p>Disney+, Netflix, Slack, Venmo, Coinbase all experienced outages</p>
</li>
<li><p>Robinhood users couldn't execute trades during market hours</p>
</li>
<li><p>Warehouse robots at Amazon's own fulfillment centers stopped working</p>
</li>
<li><p>Ring doorbell cameras went dark across millions of homes</p>
</li>
<li><p>Thousands of companies with "99.99% availability" architectures went completely offline</p>
</li>
</ul>
<p><strong>The Uncomfortable Truth:</strong></p>
<p>Every one of these companies had multi-AZ deployments. Many had auto-scaling. Some even had multi-region configurations. But almost <em>none</em> had <strong>control-plane independence</strong>.</p>
<p>When AWS couldn't talk to itself, it didn't matter how many availability zones you had.</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335476255/a6493850-5bf1-4571-aa21-25b4e8e693be.png" alt="" style="display:block;margin:0 auto" />

<blockquote>
<p><strong>Lesson:</strong> Multi-AZ is not resilience. Multi-region is not sovereignty. Multi-cloud without control-plane independence is illusion.</p>
</blockquote>
<p>This incident, along with the Azure global authentication failure of March 2021 and the Cloudflare DNS disruption of June 2022, led me to develop what I call:</p>
<h3><strong>Survival-Grade Cloud Architecture (SGCA)</strong></h3>
<p>A framework for designing systems that don't just survive component failures; they survive <em>provider</em> failures.</p>
<hr />
<h2>The 4 Blackout Threat Models</h2>
<p>Before we architect solutions, we need to understand what we're designing against. Not theoretical scenarios, but <em>real failure modes</em> that have already happened.</p>
<h3>Threat Model 1: Regional Infrastructure Collapse</h3>
<p><strong>What:</strong> An entire cloud region goes dark. Power grid failure, network backbone cut, physical disaster.</p>
<p><strong>Real precedent:</strong> The 2012 AWS us-east-1 outage caused by severe storms in Virginia knocked out power to multiple data centers simultaneously. Backup generators failed at some facilities.</p>
<p><strong>Why Multi-AZ fails here:</strong> All availability zones within a region share the same regional backbone, power grid dependencies, and often the same physical geography.</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335546965/64686838-8ce7-4314-a4ee-3eea4e523d38.png" alt="" style="display:block;margin:0 auto" />

<h3>Threat Model 2: Control Plane Failure</h3>
<p><strong>What:</strong> The cloud's management APIs, IAM, provisioning systems become unreachable. Your workloads <em>might</em> still run, but you cannot manage, scale, deploy, or authenticate.</p>
<p><strong>Real precedent:</strong> The December 2021 us-east-1 outage was fundamentally a control-plane failure. Also, Azure AD's global authentication outage in March 2021 locked users out of Microsoft 365, Azure Portal, and every application using Azure AD for SSO, worldwide.</p>
<p><strong>Why this is devastating:</strong> If you use AWS IAM for authentication, AWS Secrets Manager for credentials, and AWS CloudWatch for monitoring, a control plane failure means you're deaf, mute, and blind simultaneously.</p>
<h3>Threat Model 3: DNS / Routing-Level Disruption</h3>
<p><strong>What:</strong> DNS resolution fails or is hijacked. BGP routes are corrupted. Traffic literally cannot find your services.</p>
<p><strong>Real precedent:</strong> In June 2022, Cloudflare experienced an outage affecting 19 data centers due to a BGP routing change gone wrong. In October 2021, Facebook/Meta disappeared from the internet entirely for ~6 hours because of a BGP withdrawal that also took down their internal DNS.</p>
<p><strong>Impact:</strong> It doesn't matter if your servers are running perfectly. If DNS can't resolve your domain, you don't exist on the internet.</p>
<h3>Threat Model 4: Geopolitical / Sovereign Isolation</h3>
<p><strong>What:</strong> Government sanctions block access to a cloud provider. Data sovereignty laws force isolation. A country-level firewall blocks traffic.</p>
<p><strong>Real precedent:</strong> When sanctions were imposed on Russian entities in 2022, AWS and Azure suspended accounts, leaving affected businesses with zero access to their infrastructure. China's Great Firewall regularly disrupts traffic to foreign cloud services.</p>
<p><strong>Impact:</strong> Your entire infrastructure becomes legally or physically inaccessible. Not because it failed, but because <em>access was revoked</em>.</p>
<blockquote>
<p>⚠️ <strong>Pause and ask yourself:</strong> If your cloud console is unreachable right now, can you still deploy? Can users still log in? Can you even see what's happening?</p>
<p>If the answer is no to any of these, keep reading.</p>
</blockquote>
<hr />
<h2>Survival-Grade Architecture Blueprint</h2>
<p>Here's the complete 4-layer architecture that survives all four threat models:</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335603019/cd3252ea-1051-4694-bfe0-bd5198193754.png" alt="" style="display:block;margin:0 auto" />

<p>Let's break down each layer.</p>
<h3>Layer 1: Independent Global Traffic Authority</h3>
<p><strong>The Golden Rule: DNS must not depend on the cloud it routes to.</strong></p>
<p>If you're using Route53 to route traffic to AWS, and AWS goes down, your DNS goes down with it. This is the most common single point of failure in "multi-cloud" architectures.</p>
<p><strong>Design Principles:</strong></p>
<ul>
<li><p><strong>Dual DNS providers:</strong> Cloudflare as primary, NS1 (or Google Cloud DNS) as secondary</p>
</li>
<li><p><strong>Health-based routing:</strong> Active health checks against each cloud's endpoints, not just TCP ping but actual application-level health (<code>/healthz</code> returning 200 + data freshness check)</p>
</li>
<li><p><strong>Aggressive TTL:</strong> 30-second TTL for DNS records so failover propagates in under a minute</p>
</li>
<li><p><strong>Anycast:</strong> Use a DNS provider with anycast networking so DNS resolution itself is globally distributed and resilient</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335659127/511a716f-1ff1-404f-ba7b-fd7e57f8d060.png" alt="" style="display:block;margin:0 auto" />

<h3>Layer 2: Active-Active Multi-Cloud Compute</h3>
<p>This is where Kubernetes becomes your portability layer. The key insight: <strong>your container is your contract</strong>. If the same container runs on EKS and GKE, your application is cloud-agnostic.</p>
<p><strong>Architecture:</strong></p>
<ul>
<li><p>AWS EKS cluster as primary compute</p>
</li>
<li><p>GCP GKE cluster as secondary compute (always warm, always running)</p>
</li>
<li><p><strong>ArgoCD</strong> installed in <em>both</em> clusters, both pulling from the same Git repository</p>
</li>
<li><p>Identical container images pushed to both ECR and GCR via CI pipeline</p>
</li>
<li><p>Traffic shift = DNS routing change, <strong>not</strong> a deployment event</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335673911/60617581-80fc-4bbb-a307-e4b96e99e440.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Key Rule:</strong> The control plane of your compute must not depend on a single cloud. ArgoCD gives you this. It's <em>your</em> control plane, not AWS's.</p>
<h3>Layer 3: Data Survival Strategy (The Hardest Layer)</h3>
<p>Data is where multi-cloud gets genuinely hard. You have three models, each with real trade-offs:</p>
<p><strong>Model A: Globally Distributed SQL (CockroachDB / YugabyteDB)</strong></p>
<ul>
<li><p>True active-active writes across clouds</p>
</li>
<li><p>Automatic conflict resolution via consensus protocol</p>
</li>
<li><p>RPO: Near-zero (seconds)</p>
</li>
<li><p>Trade-off: Higher write latency (~50-100ms cross-region), higher cost, operational complexity</p>
</li>
</ul>
<p><strong>Model B: Event-Sourced Architecture (Kafka Multi-Cluster)</strong></p>
<ul>
<li><p>All state changes captured as immutable events</p>
</li>
<li><p>Kafka MirrorMaker 2 replicates across clusters</p>
</li>
<li><p>State can be reconstructed from event replay</p>
</li>
<li><p>RPO: Seconds to minutes depending on replication lag</p>
</li>
<li><p>Trade-off: Application must be designed for event sourcing from the start</p>
</li>
</ul>
<p><strong>Model C: Async Cross-Cloud Replication</strong></p>
<ul>
<li><p>Primary database (e.g., RDS PostgreSQL) with async replication to secondary cloud</p>
</li>
<li><p>Object storage sync (S3 → GCS)</p>
</li>
<li><p>RPO: Minutes (replication lag)</p>
</li>
<li><p>Trade-off: Data loss window during failover, conflict resolution needed</p>
</li>
</ul>
<p><strong>Comparison:</strong></p>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Model A (CockroachDB)</th>
<th>Model B (Event-Sourced)</th>
<th>Model C (Async Replication)</th>
</tr>
</thead>
<tbody><tr>
<td><strong>RPO</strong></td>
<td>~0 seconds</td>
<td>1-30 seconds</td>
<td>1-5 minutes</td>
</tr>
<tr>
<td><strong>Write Latency</strong></td>
<td>50-100ms</td>
<td>5-15ms local</td>
<td>5-10ms local</td>
</tr>
<tr>
<td><strong>Complexity</strong></td>
<td>High</td>
<td>Very High</td>
<td>Medium</td>
</tr>
<tr>
<td><strong>Monthly Cost</strong> (est.)</td>
<td>$3,000-8,000</td>
<td>$2,000-5,000</td>
<td>$500-1,500</td>
</tr>
<tr>
<td><strong>Best For</strong></td>
<td>Financial, healthcare</td>
<td>Event-driven platforms</td>
<td>Cost-sensitive, read-heavy</td>
</tr>
<tr>
<td><strong>Retrofit Difficulty</strong></td>
<td>Medium</td>
<td>Very Hard</td>
<td>Easy</td>
</tr>
</tbody></table>
<p><strong>My Recommendation:</strong> For most teams, start with <strong>Model C</strong> for your database and <strong>Model B</strong> for your event bus. Graduate to Model A when your business criticality demands near-zero RPO.</p>
<h3>Layer 4: Identity Independence</h3>
<p>If AWS IAM goes down, can your users still log in? If your secrets are only in AWS Secrets Manager, can your GCP services authenticate?</p>
<p><strong>Design:</strong></p>
<ul>
<li><p><strong>Self-hosted IdP:</strong> Keycloak deployed on <em>both</em> clouds, backed by the replicated database</p>
</li>
<li><p><strong>Federated OIDC tokens:</strong> Applications validate JWT tokens, not cloud-specific IAM policies</p>
</li>
<li><p><strong>HashiCorp Vault:</strong> Secrets replicated across both clouds, auto-unsealed independently</p>
</li>
<li><p><strong>mTLS via service mesh:</strong> Istio/Linkerd for inter-service auth that doesn't depend on any cloud IAM</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335686164/d5a1af1b-ad99-428f-9379-1d231307c6fa.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Never bind authentication exclusively to one hyperscaler.</strong> This is the #1 mistake I see in "multi-cloud" architectures.</p>
<hr />
<h2>Real-World Example</h2>
<h3>Building a Blackout-Proof E-Commerce Platform: "ShopGlobal"</h3>
<p>Let me walk you through a real scenario. <strong>ShopGlobal</strong> is a mid-size e-commerce company processing $2M/day in orders. They run entirely on AWS. Here's what happened and how they rebuilt.</p>
<h3>The Incident</h3>
<p>On a Tuesday morning, AWS us-east-1 experienced a partial control-plane outage. ShopGlobal's impact:</p>
<ul>
<li><p><strong>Payment service</strong> went down. Secrets Manager unreachable, Stripe API keys inaccessible.</p>
</li>
<li><p><strong>Auth</strong> went down. Cognito unavailable. No user could log in.</p>
</li>
<li><p><strong>Product catalog</strong> degraded. DynamoDB throttled, eventually unreachable.</p>
</li>
<li><p><strong>Order processing</strong> went down. SQS backed up, Lambda couldn't provision.</p>
</li>
<li><p><strong>Monitoring</strong> went down. CloudWatch unreachable. Team couldn't see what was failing.</p>
</li>
</ul>
<p><strong>Total downtime:</strong> 4 hours 22 minutes <strong>Revenue lost:</strong> ~$370,000 <strong>Customer trust impact:</strong> 12% increase in churn that quarter</p>
<h3>The Rebuild: Applying SGCA</h3>
<p>Here's how ShopGlobal rebuilt using the Survival-Grade Architecture:</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335703822/a688831a-372f-4495-8b88-9ce4edeacbbb.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Results after rebuild:</strong></p>
<ul>
<li><p>Next us-east-1 degradation → Automatic failover in <strong>47 seconds</strong></p>
</li>
<li><p>Zero revenue lost</p>
</li>
<li><p>Users didn't notice</p>
</li>
<li><p>Cloud Exit Time: <strong>&lt; 2 minutes</strong></p>
</li>
</ul>
<hr />
<h2>Step-by-Step Implementation Guide</h2>
<p>The entire working implementation is open-sourced. Clone the repo and follow along:</p>
<blockquote>
<p>🔗 <strong>GitHub Repository:</strong> <a href="https://github.com/SubhanshuMG/survival-grade-infra">github.com/SubhanshuMG/survival-grade-infra</a></p>
</blockquote>
<pre><code class="language-markdown">survival-grade-infra/
├── terraform/
│   ├── main.tf                          # Multi-cloud providers + Cloudflare DNS failover
│   ├── variables.tf                     # All configurable parameters
│   ├── outputs.tf
│   ├── modules/
│   │   ├── eks/                         # AWS EKS cluster module
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── outputs.tf
│   │   ├── gke/                         # GCP GKE cluster module
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── outputs.tf
│   │   └── dns/
│   │       ├── main.tf
│   │       └── variables.tf
├── k8s/
│   ├── base/                            # Shared Kubernetes manifests
│   │   ├── namespace.yaml
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── ingress.yaml
│   ├── overlays/                        # Cloud-specific Kustomize patches
│   │   ├── aws/
│   │   │   └── kustomization.yaml
│   │   └── gcp/
│   │       └── kustomization.yaml
│   ├── argocd/                          # GitOps Application definitions
│   │   ├── app-aws.yaml
│   │   └── app-gcp.yaml
│   └── base/
│       ├── cockroachdb-multicloud.yaml  # CockroachDB StatefulSet
│       └── keycloak.yaml                # Identity layer
├── src/
│   └── storage-sync/
│       └── sync-worker.py               # Cross-cloud S3→GCS replication
├── ci/
│   └── .github/workflows/
│       └── multi-cloud-deploy.yaml      # Build + sign + push to ECR &amp; GCR
├── chaos/
│   ├── blackout-test.sh                 # Full blackout drill script
│   └── dns-failover-test.sh
└── docker/
    └── Dockerfile
</code></pre>
<p>Let's walk through what each piece does and why it exists.</p>
<h3>Step 1: Terraform Multi-Cloud Infrastructure</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/terraform/variables.tf"><code>terraform/variables.tf</code></a></p>
<p>All configurable parameters: project name, AWS/GCP regions, Cloudflare zone, node counts, and instance types. Change these values once and every module inherits them.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/terraform/main.tf"><code>terraform/main.tf</code></a></p>
<p>This is the core of the entire infrastructure. Here's what it provisions and why:</p>
<ul>
<li><p><strong>AWS VPC + EKS cluster</strong> with nodes spread across 3 availability zones, NAT gateways per AZ (not a single shared one), and proper subnet tagging for Kubernetes load balancers</p>
</li>
<li><p><strong>GCP GKE cluster</strong> with workload identity enabled, node autoscaling, and a separately managed node pool (never use the default pool in production)</p>
</li>
<li><p><strong>Cloudflare health checks</strong> hitting <code>/healthz</code> on both clouds every 30 seconds. These aren't TCP pings. They're full HTTPS requests that verify the application is actually serving traffic.</p>
</li>
<li><p><strong>Cloudflare Load Balancer</strong> with geo-based steering: US East traffic goes to AWS, US Central goes to GCP. If either cloud fails the health check, <em>all</em> traffic shifts to the surviving cloud within one TTL cycle (30 seconds).</p>
</li>
</ul>
<p>The critical design decision here: DNS and traffic routing live on Cloudflare, completely independent of both AWS and GCP. If either cloud burns down, the routing layer keeps working.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/terraform/modules/eks/main.tf"><code>terraform/modules/eks/main.tf</code></a></p>
<p>EKS module using the official <code>terraform-aws-modules/eks/aws</code> with IRSA enabled for pod-level IAM. Nodes are in a managed node group with autoscaling set to 2x the baseline.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/terraform/modules/gke/main.tf"><code>terraform/modules/gke/main.tf</code></a></p>
<p>GKE module with VPC-native networking, workload identity, and the REGULAR release channel. The default node pool is removed immediately and replaced with a custom one (this is a GKE best practice that most teams skip).</p>
<h3>Step 2: Kubernetes Application Manifests</h3>
<p>The manifests use <strong>Kustomize</strong> with a base + overlays pattern. One set of base manifests, two cloud-specific overlays.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/base/deployment.yaml"><code>k8s/base/deployment.yaml</code></a></p>
<p>The deployment includes topology spread constraints to distribute pods evenly across zones, readiness and liveness probes on <code>/healthz</code>, and environment variables pulled from ConfigMaps and Secrets. The image tag is a placeholder that each overlay replaces with the correct cloud-specific registry URL.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/base/service.yaml"><code>k8s/base/service.yaml</code></a></p>
<p>ClusterIP service + Ingress with TLS via cert-manager. Both clouds serve the same hostname <code>app.shopglobal.com</code> because Cloudflare handles which cloud actually receives the traffic.</p>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/overlays/aws/kustomization.yaml"><code>k8s/overlays/aws/kustomization.yaml</code></a> 📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/overlays/gcp/kustomization.yaml"><code>k8s/overlays/gcp/kustomization.yaml</code></a></p>
<p>Each overlay patches three things: the container image registry (ECR vs GCR), the Kafka broker addresses (AWS cluster vs GCP cluster), and the auth issuer URL. Everything else stays identical. Same application code, same configuration structure, different cloud-specific endpoints.</p>
<h3>Step 3: ArgoCD Multi-Cluster GitOps</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/argocd/app-aws.yaml"><code>k8s/argocd/app-aws.yaml</code></a> 📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/argocd/app-gcp.yaml"><code>k8s/argocd/app-gcp.yaml</code></a></p>
<p>ArgoCD is installed independently on both clusters. Each instance pulls from the <strong>same Git repo</strong> but points to its respective Kustomize overlay. Auto-sync is enabled with self-healing and pruning turned on.</p>
<p>This is the key insight of the entire compute layer: <strong>you never deploy to two clouds.</strong> You push to Git once. Both ArgoCD instances independently sync the change. If you need to fail over, you change a DNS weight, not a deployment pipeline. There's nothing to "redeploy" because both clouds are always running the latest version.</p>
<p>Slack notifications are configured on sync failures so your on-call team knows immediately if a cloud falls out of sync.</p>
<h3>Step 4: CI/CD Multi-Cloud Image Pipeline</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/ci/.github/workflows/multi-cloud-deploy.yaml"><code>ci/.github/workflows/multi-cloud-deploy.yaml</code></a></p>
<p>The pipeline does five things on every push to <code>main</code>:</p>
<ol>
<li><p><strong>Builds the container image</strong> once using Docker Buildx</p>
</li>
<li><p><strong>Tags and pushes to AWS ECR</strong> using OIDC-based authentication (no long-lived AWS keys in GitHub)</p>
</li>
<li><p><strong>Tags and pushes to GCP Artifact Registry</strong> using workload identity federation (same principle, no keys)</p>
</li>
<li><p><strong>Signs both images with Cosign</strong> so both clusters can verify the image hasn't been tampered with</p>
</li>
<li><p><strong>Updates the Kustomize overlays</strong> with the new image tag and commits back to the repo, which triggers ArgoCD sync on both clouds</p>
</li>
</ol>
<p>The image tag format is <code>&lt;short-sha&gt;-&lt;unix-timestamp&gt;</code> to guarantee uniqueness and traceability. Both registries always have identical images. If one registry becomes unreachable, the other cloud still has its copy.</p>
<h3>Step 5: CockroachDB Multi-Cloud Deployment</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/base/cockroachdb-multicloud.yaml"><code>k8s/base/cockroachdb-multicloud.yaml</code></a></p>
<p>CockroachDB runs as a StatefulSet with 3 nodes per cloud (6 total). The replication zone is configured with locality-aware constraints: at least 2 replicas on AWS, at least 2 on GCP. The database is set to <code>SURVIVE REGION FAILURE</code>, meaning it maintains consensus even if an entire cloud goes offline.</p>
<p>Each node advertises its locality as <code>cloud=aws,region=us-east-1</code> or <code>cloud=gcp,region=us-central1</code>. CockroachDB uses this to make intelligent placement decisions, keeping reads local while ensuring writes are replicated cross-cloud before acknowledging.</p>
<p>The persistent volumes use 100Gi with <code>ReadWriteOnce</code> access. In production, you'll want to tune <code>--cache</code> and <code>--max-sql-memory</code> based on your node size.</p>
<h3>Step 6: Cross-Cloud Object Storage Sync</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/src/storage-sync/sync-worker.py"><code>src/storage-sync/sync-worker.py</code></a></p>
<p>This is an event-driven replication worker, not a cron job. It listens for S3 event notifications (via SQS or EventBridge) and replicates each object to GCS in real-time with MD5 integrity verification. On startup, it runs a full bucket reconciliation to catch anything that was missed.</p>
<p>Deletions are mirrored too. If an object is removed from S3, the worker removes it from GCS. The full reconciliation compares object sizes and only re-syncs what's actually different, so it's safe to run repeatedly without hammering your bandwidth.</p>
<p>Deploy this as a Kubernetes Deployment on both clouds. The AWS instance handles S3→GCS direction. A mirrored instance on GCP handles GCS→S3. Bidirectional sync with conflict resolution by last-write-wins.</p>
<h3>Step 7: Keycloak Multi-Cloud Identity Setup</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/k8s/base/keycloak.yaml"><code>k8s/base/keycloak.yaml</code></a></p>
<p>Keycloak runs as a 2-replica Deployment backed by CockroachDB (the same database that's already replicated across clouds). This means Keycloak on GCP automatically has the same user database, sessions, and realm configurations as Keycloak on AWS. No separate identity sync needed.</p>
<p>Cluster discovery between Keycloak instances uses a headless Kubernetes service with JGroups DNS_PING. The Infinispan cache is set to <code>kubernetes</code> stack mode so session data is shared across replicas within each cloud.</p>
<p>Both Keycloak instances serve <code>auth.shopglobal.com</code>, and Cloudflare routes auth traffic the same way it routes application traffic. If AWS goes down, users authenticate against GCP's Keycloak, which has the exact same data because it reads from the same CockroachDB cluster.</p>
<p><strong>This is why CockroachDB was chosen over simpler database options.</strong> It's not just for application data; it's the shared backbone that makes identity, sessions, and secrets work cross-cloud without custom sync pipelines.</p>
<hr />
<h2>Chaos Blackout Testing Strategy</h2>
<p>Architecture without testing is fiction. Here's how you prove it works.</p>
<h3>Blackout Drill Script</h3>
<p>📁 <a href="https://github.com/SubhanshuMG/survival-grade-infra/blob/main/chaos/blackout-test.sh"><code>chaos/blackout-test.sh</code></a></p>
<p>This is a 6-phase automated drill that simulates a complete primary cloud failure and grades your architecture. Run it quarterly. Make it policy, not optional.</p>
<p><strong>Phase 1: Baseline Measurement</strong> Records current latency, writes test data to the primary cloud, and verifies both endpoints are healthy before the drill begins. If either cloud is already unhealthy, the script aborts.</p>
<p><strong>Phase 2: Simulate Primary Cloud Failure</strong> Disables the AWS pool in Cloudflare's load balancer via API. This is identical to what happens during a real outage from the DNS perspective. Traffic has nowhere to go on the primary side.</p>
<p><strong>Phase 3: Measure Failover</strong> Polls the application endpoint every second, counting how long until a 200 response comes back (now served from GCP). This is your actual, measured RTO. Not a theoretical number from a spreadsheet. The real thing.</p>
<p><strong>Phase 4: Data Integrity Check</strong> Reads back the test data that was written <em>before</em> the failover. If it's accessible on the secondary cloud, your data replication is working. Then writes new data on the secondary and reads it back to confirm write continuity.</p>
<p><strong>Phase 5: Verify Auth Flow</strong> Hits the Keycloak OIDC discovery endpoint to confirm authentication is working on the surviving cloud. If users can't log in, surviving the outage doesn't matter.</p>
<p><strong>Phase 6: Restore Primary</strong> Re-enables the AWS pool, waits for health checks to pass, and confirms the primary cloud is back in the rotation.</p>
<p>The script outputs a final report card and sends results to Slack:</p>
<pre><code class="language-markdown">═══════════════════════════════════════════
  BLACKOUT DRILL RESULTS
═══════════════════════════════════════════

  RTO (Recovery Time):    42.37s
  RPO (Data Integrity):   PASS
  Auth Continuity:        PASS
  Post-Restore Health:    PASS

  GRADE: A. Survival-grade resilience confirmed
</code></pre>
<p><strong>Grading criteria:</strong></p>
<ul>
<li><p><strong>Grade A:</strong> RTO under 60 seconds + RPO pass</p>
</li>
<li><p><strong>Grade B:</strong> RTO under 300 seconds</p>
</li>
<li><p><strong>Grade F:</strong> Everything else</p>
</li>
</ul>
<h3>What This Tests</h3>
<table>
<thead>
<tr>
<th>Test Area</th>
<th>How It's Validated</th>
</tr>
</thead>
<tbody><tr>
<td><strong>DNS Failover</strong></td>
<td>Cloudflare pool disable via API</td>
</tr>
<tr>
<td><strong>Compute Failover</strong></td>
<td>All traffic shifts to GCP automatically</td>
</tr>
<tr>
<td><strong>Data Integrity</strong></td>
<td>Pre-failover data read from secondary</td>
</tr>
<tr>
<td><strong>Write Continuity</strong></td>
<td>Post-failover write + read on secondary</td>
</tr>
<tr>
<td><strong>Auth Continuity</strong></td>
<td>Keycloak OIDC discovery endpoint check</td>
</tr>
<tr>
<td><strong>Recovery</strong></td>
<td>Primary re-enable + health verification</td>
</tr>
</tbody></table>
<p>Run quarterly. No exceptions. "Reliability without testing is fiction."</p>
<hr />
<h2>Resilience Maturity Model</h2>
<p>Use this to assess where you are and where you need to be:</p>
<table>
<thead>
<tr>
<th>Tier</th>
<th>Architecture</th>
<th>What It Survives</th>
<th>What Kills It</th>
<th>Typical Org</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tier 0</strong></td>
<td>Multi-AZ</td>
<td>Single AZ failure</td>
<td>Region outage, control plane failure</td>
<td>Startups, MVPs</td>
</tr>
<tr>
<td><strong>Tier 1</strong></td>
<td>Multi-Region</td>
<td>Region failure</td>
<td>Provider-wide outage, DNS failure</td>
<td>Growing SaaS</td>
</tr>
<tr>
<td><strong>Tier 2</strong></td>
<td>Multi-Cloud Passive</td>
<td>Provider outage (with manual failover)</td>
<td>Slow RTO, data loss during cutover</td>
<td>Enterprise</td>
</tr>
<tr>
<td><strong>Tier 3</strong></td>
<td>Multi-Cloud Active-Active</td>
<td>Provider outage (automatic)</td>
<td>Geopolitical isolation, regulatory block</td>
<td>Mission-Critical</td>
</tr>
<tr>
<td><strong>Tier 4</strong></td>
<td>Sovereign Split</td>
<td>Everything above + data sovereignty</td>
<td>Nation-state level infrastructure attack</td>
<td>Global Enterprise, Defense</td>
</tr>
</tbody></table>
<h3>New Metrics: Your Resilience Scorecard</h3>
<p>I'm introducing three metrics that I believe every organization should track:</p>
<p><strong>Cloud Exit Time (CET)</strong> <em>How long to fully operate from an alternate provider after your primary disappears.</em></p>
<ul>
<li><p>Tier 0-1 organizations: CET is typically "unknown" or "weeks"</p>
</li>
<li><p>Tier 3-4 organizations: CET should be under 2 minutes</p>
</li>
</ul>
<p><strong>Control Plane Dependency Index (CPDI)</strong> <em>What percentage of your infrastructure depends on a single provider's APIs?</em></p>
<ul>
<li><p>Count every service: IAM, DNS, secrets, monitoring, logging, CI/CD, container registry</p>
</li>
<li><p>If CPDI &gt; 70%, you have a single-cloud architecture wearing a multi-cloud costume</p>
</li>
</ul>
<p><strong>Data Replication Confidence Score (DRCS)</strong> <em>Measured empirically via quarterly blackout drills, not theoretical.</em></p>
<ul>
<li><p>DRCS = (Successful data reads post-failover / Total data written pre-failover) × 100</p>
</li>
<li><p>If you haven't tested it, your DRCS is 0%. Not "assumed 99%." Zero.</p>
</li>
</ul>
<hr />
<h2>The Strategic Close</h2>
<p>Cloud resilience is no longer a technical problem. It's a <strong>geopolitical and architectural</strong> problem.</p>
<p>The companies that will survive the next decade aren't the ones with the most availability zones. They're the ones with <strong>provider independence, data sovereignty, and the discipline to test their survival assumptions quarterly</strong>.</p>
<p>Here's what you should do this week:</p>
<ol>
<li><p><strong>Calculate your CPDI.</strong> List every cloud-specific service you depend on. Be honest.</p>
</li>
<li><p><strong>Define your CET.</strong> If your primary cloud disappeared right now, how long until you're operational elsewhere? If you don't know, the answer is "too long."</p>
</li>
<li><p><strong>Schedule your first blackout drill.</strong> Even if it's just a tabletop exercise. Start somewhere.</p>
</li>
<li><p><strong>Move DNS off your primary cloud.</strong> This is the single highest-impact, lowest-effort change you can make today.</p>
</li>
</ol>
<p>The architecture in this article isn't theoretical. It's what separates companies that survive outages from companies that make headlines during them.  </p>
<hr />
<blockquote>
<h3><strong>If your cloud provider disappeared tomorrow, would your system survive?</strong></h3>
<p>If not, you now have the blueprint to fix it.</p>
</blockquote>
<hr />
]]></content:encoded></item><item><title><![CDATA[How Attackers Bypass Your “Compliant” CI/CD Pipeline (And How to Redesign It)]]></title><description><![CDATA[Modern CI/CD pipelines are often treated as untouchable “trusted builds” – locked down by code review and best practices but that trust is a myth. A pipeline is a prime attack surface, containing ever]]></description><link>https://blogs.subhanshumg.com/how-attackers-bypass-your-compliant-cicd-pipeline-and-how-to-redesign-it</link><guid isPermaLink="true">https://blogs.subhanshumg.com/how-attackers-bypass-your-compliant-cicd-pipeline-and-how-to-redesign-it</guid><category><![CDATA[Devops]]></category><category><![CDATA[cybersecurity]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[Security]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[supply chain]]></category><category><![CDATA[cicd]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[technology]]></category><category><![CDATA[github-actions]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 08 Feb 2026 23:43:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770594965405/937d8af0-7ff1-4055-bea7-09e82d2fcdf2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern CI/CD pipelines are often treated as untouchable “trusted builds” – locked down by code review and best practices but that trust is a myth. A pipeline is a <strong>prime attack surface</strong>, containing everything an attacker needs: deployment keys, API tokens, container registries, test artifacts and implicit trust that code executed in the pipeline is safe. Attackers know this. Real incidents (SolarWinds, Codecov, large GitHub Actions supply-chain campaigns) have shown how a compromised build system becomes a stealthy delivery vehicle for malware and secret exfiltration. In short: <em>a “compliant” pipeline can still be subverted from the inside.</em></p>
<p>This post takes the <strong>red-team viewpoint</strong>: real bypass paths (compromised runners, poisoned caches, dependency confusion, indirect pipeline poisoning), why they work and a concrete, cloud-agnostic redesign for GitHub Actions running Python microservices. You’ll get code, an implementation plan, testing approaches and a gripping real-world-inspired story to help this land with impact.</p>
<hr />
<h1>Quick summary (TL;DR for the impatient)</h1>
<ul>
<li><p>Treat the build environment as untrusted. Ephemeral runners, least privilege and strict workflow triggers reduce risk.</p>
</li>
<li><p>Caches, third-party actions and dependencies are blind trust boundaries that validate, sign or remove them for critical builds.</p>
</li>
<li><p>Use attestations (SLSA/in-toto), signed artifacts (cosign/Sigstore), SBOMs and reproducible builds to prove an artifact’s provenance.</p>
</li>
<li><p>Harden runners, avoid running unreviewed code on privileged runners and adopt runtime detection (Falco, Trivy) for defense in depth.</p>
</li>
</ul>
<hr />
<h1>Red-team POV: real bypass paths</h1>
<h2>Attack Path #1 - <strong>Compromised Runners (Persistent Backdoors)</strong></h2>
<p>Self-hosted or shared runners are especially dangerous. If an attacker can make the runner execute untrusted code (through a malicious PR or specially crafted workflow trigger), they can persist on the host, harvest environment variables and secrets, tamper with build outputs and later push poisoned artifacts. Even ephemeral GitHub-hosted runners are safer only because they’re ephemeral; self-hosted runners can survive malicious actions and provide an attacker with ongoing access.</p>
<p><strong>What attackers do:</strong> exfiltrate <code>GITHUB_TOKEN</code> or other secrets, install persistence (systemd/cron), modify artifacts mid-build or pivot into internal networks reachable from the runner.</p>
<p><strong>Why it works:</strong> runners have access to code, build caches and sometimes cloud credentials. Workflows often run scripts and test suites with no strong isolation between "build logic" and "developer-supplied code".</p>
<h2>Attack Path #2 - <strong>Poisoned Caches &amp; Artifacts</strong></h2>
<p>Build caches and artifact storage are performance shortcuts and trust shortcuts. Many cache systems will extract an archive and place its contents into the workspace without validating each file’s origin or integrity. An attacker with temporary access can push a crafted cache archive that overwrites files or injects malicious files that later get executed.</p>
<p><strong>What attackers do:</strong> obtain cache tokens or a write path, upload a malicious tarball (containing backdoors or altered dependencies), then wait for downstream builds to restore that cache and run the tainted content.</p>
<p><strong>Why it works:</strong> cache restore steps and some action-marketplace items assume the cache is benign; there is little to no path-level validation when extracting.</p>
<h2>Attack Path #3 - <strong>Dependency Confusion &amp; Malicious Packages</strong></h2>
<p>If your pipeline pulls dependencies from remote registries (PyPI, npm, etc.) using ambiguous names, attackers can publish a public package that shadows your internal one. When building scripts or tests <code>pip install mycorp-utils</code>, the public malicious package can be fetched and executed sometimes via post-install hooks.</p>
<p><strong>What attackers do:</strong> publish a malicious package with the same name as an internal package or craft a trojanized version of a transitive dependency.</p>
<p><strong>Why it works:</strong> developers or CI scripts use permissive installation rules and don’t pin exact sources or hashes.</p>
<h2>Attack Path #4 - <strong>Indirect Pipeline Poisoning (PPE)</strong></h2>
<p>Attackers can change artifacts that the pipeline executes without modifying the workflow itself. For example, if the pipeline runs <code>pytest</code>, an attacker who commits a malicious test will have their code executed by CI. The YAML looks clean; the real problem is that the runnable artifacts called by the YAML are controlled by code that may not be strictly reviewed.</p>
<p><strong>What attackers do:</strong> commit scripts, tests or makefiles that contain exfiltration or persistence code; then the pipeline runs them as part of normal testing or packaging.</p>
<p><strong>Why it works:</strong> workflows invoke project scripts without verifying the content or authorship of those scripts.</p>
<hr />
<h1>Why this goes viral (and scares execs)</h1>
<ul>
<li><p><strong>Hacker framing:</strong> the attack is easy to explain as “the build baked a backdoor” It’s visceral and scary.</p>
</li>
<li><p><strong>Security fear + practical fixes:</strong> readers want practical, implementable mitigations; this post gives them them from baked-in platform config to SLSA-level attestations.</p>
</li>
<li><p><strong>Observable risk:</strong> artifacts are shipped and can be used to breach customers and partners; the stakes are huge which drives virality.</p>
</li>
</ul>
<hr />
<h1>Concrete case study: Python microservices on GitHub Actions (cloud-agnostic)</h1>
<p>Scenario: two microservices (<code>service-auth</code> and <code>service-data</code>) packaged as Docker containers; GitHub repo with <code>main</code> branch and PR workflow protections. The pipeline builds, tests, pushes images and deploys to a Kubernetes cluster (any cloud or on-prem).</p>
<h3>Repo layout (example)</h3>
<pre><code class="language-plaintext">├── .github/workflows/ci-cd.yml
├── service-auth/
│   ├── Dockerfile
│   ├── app.py
│   ├── requirements.txt
│   └── tests/
├── service-data/
│   ├── Dockerfile
│   ├── main.py
│   ├── requirements.txt
│   └── tests/
└── k8s/
    ├── deployment-auth.yaml
    └── deployment-data.yaml
</code></pre>
<h3>Vulnerable pipeline (what an attacker abuses)</h3>
<p>A naive workflow:</p>
<pre><code class="language-yaml">name: CI/CD Pipeline
on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t myregistry/service-auth:latest ./service-auth
      - run: docker build -t myregistry/service-data:latest ./service-data
      - run: docker push myregistry/service-auth:latest
      - run: docker push myregistry/service-data:latest
      - run: kubectl apply -f k8s/
</code></pre>
<p><strong>Weaknesses exploited:</strong> runs-on shared environment, no scanning, no signing, caches or transitive dependencies unchecked, <code>main</code> branch pushes may accept automated commits and tests/scripts are executed without provenance.</p>
<hr />
<h1>Redesigned secure pipeline; principles first</h1>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770593765729/066a345e-0a45-4a16-b32c-9953d812b80b.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p><strong>Ephemeral &amp; Isolated Runners</strong></p>
<ul>
<li><p>Prefer GitHub-hosted runners (<code>runs-on: ubuntu-latest</code>) for sensitive jobs.</p>
</li>
<li><p>If self-hosted runners are necessary, host them in isolated ephemeral VMs/containers with daily rebuilds and no network access to internal systems.</p>
</li>
</ul>
</li>
<li><p><strong>Least Privilege &amp; Narrow Permissions</strong></p>
<ul>
<li><p>Use the <code>permissions:</code> field in GitHub Actions to restrict tokens.</p>
</li>
<li><p>Avoid exposing secrets to PRs from forks. Require maintainers to trigger privileged jobs manually or via protected labels.</p>
</li>
</ul>
</li>
<li><p><strong>Pin &amp; Verify Action Versions</strong></p>
<ul>
<li><p>Pin marketplace actions to immutable commits or release SHAs (avoid floating <code>@v3</code> where possible).</p>
</li>
<li><p>Maintain an allowlist of trusted actions.</p>
</li>
</ul>
</li>
<li><p><strong>No Blind Cache Restore for Critical Paths</strong></p>
<ul>
<li><p>Avoid using cache restore for anything that could alter build behaviour in security-sensitive pipelines or scope cache keys narrowly and validate contents.</p>
</li>
<li><p>Consider disabling caches for the critical path; accept slower builds for stronger security.</p>
</li>
</ul>
</li>
<li><p><strong>Dependency Verification</strong></p>
<ul>
<li><p>Pin dependencies and use hash verification in <code>requirements.txt</code> (<code>pip</code> supports <code>--require-hashes</code>).</p>
</li>
<li><p>Use private package registries or package proxying (e.g., mirror PyPI internally).</p>
</li>
<li><p>Implement dependency reviews for new packages.</p>
</li>
</ul>
</li>
<li><p><strong>Artifact Signing &amp; Attestations</strong></p>
<ul>
<li><p>Sign built images with <code>cosign</code> (Sigstore) and publish signatures to a transparency log (Rekor).</p>
</li>
<li><p>Generate SBOMs and store them alongside the artifact.</p>
</li>
<li><p>Adopt SLSA/in-toto-style attestations tying the artifact back to the exact build inputs.</p>
</li>
</ul>
</li>
<li><p><strong>Split Build &amp; Deploy</strong></p>
<ul>
<li><p>Build artifacts in one pipeline and store them as signed immutable images (by digest).</p>
</li>
<li><p>Run a separate, approved release pipeline that only takes signed artifacts and deploys them.</p>
</li>
</ul>
</li>
<li><p><strong>Runtime Detection</strong></p>
<ul>
<li>Use runtime security (Falco) and container scanning (Trivy) to detect anomalies in running workloads.</li>
</ul>
</li>
</ol>
<hr />
<h1>Implementation: step-by-step</h1>
<p>Below is a focused, practical pipeline that implements the secure recommendations. It’s <em>opinionated</em> and intended as a starting point.</p>
<h3>A - Harden GitHub Actions workflow (ci-cd.yml)</h3>
<pre><code class="language-yaml">name: Secure CI/CD

on:
  push:
    branches: [ main ]

permissions:
  contents: read
  id-token: write       # for OIDC token exchange to cloud (no long-lived creds)
  packages: write

concurrency:
  group: ci-${{ github.ref }}
  cancel-in-progress: true

jobs:
  build:
    name: Build, Test, Sign and Attest
    runs-on: ubuntu-latest
    environment: ci
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install build deps
        run: |
          python -m pip install --upgrade pip setuptools wheel

      - name: Run unit tests (isolated)
        run: |
          python -m pytest ./service-auth/tests -q
          python -m pytest ./service-data/tests -q

      - name: Build container images (immutable tag by digest)
        run: |
          docker build -t myregistry/service-auth:ci-${{ github.sha }} ./service-auth
          docker build -t myregistry/service-data:ci-${{ github.sha }} ./service-data

      - name: Scan images with Trivy
        uses: aquasecurity/trivy-action@v1
        with:
          image-ref: |
            myregistry/service-auth:ci-${{ github.sha }}
            myregistry/service-data:ci-${{ github.sha }}

      - name: Push images to registry
        uses: docker/build-push-action@v3
        with:
          push: true
          tags: |
            myregistry/service-auth:ci-${{ github.sha }}
            myregistry/service-data:ci-${{ github.sha }}

      - name: Sign images with cosign (keyless via OIDC)
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          cosign sign --keyless myregistry/service-auth:ci-${{ github.sha }}
          cosign sign --keyless myregistry/service-data:ci-${{ github.sha }}

      - name: Create SBOMs
        run: |
          # example using syft
          syft packages:docker:myregistry/service-auth:ci-\({{ github.sha }} -o spdx-json &gt; sbom-auth-\){{ github.sha }}.json
          syft packages:docker:myregistry/service-data:ci-\({{ github.sha }} -o spdx-json &gt; sbom-data-\){{ github.sha }}.json

      - name: Upload artifacts (SBOM &amp; attestations)
        uses: actions/upload-artifact@v4
        with:
          name: sboms-and-attestations-${{ github.sha }}
          path: |
            sbom-auth-${{ github.sha }}.json
            sbom-data-${{ github.sha }}.json

      - name: Create in-toto attestation
        run: |
          # pseudo-command tie it into your in-toto workflow
          in-toto-record --step build --materials . --products "myregistry/service-auth@sha256:..." --subject "sha256:${{ github.sha }}"

  promote:
    name: Promote signed artifacts to production (manual &amp; audited)
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://your-k8s-console.example
    permissions:
      contents: read
      id-token: write
      packages: read
    steps:
      - name: Download SBOM &amp; Attestations
        uses: actions/download-artifact@v4
        with:
          name: sboms-and-attestations-${{ needs.build.outputs.sha }}

      - name: Verify cosign signatures &amp; Rekor
        run: |
          cosign verify --keyless myregistry/service-auth:ci-${{ needs.build.outputs.sha }}
          cosign verify --keyless myregistry/service-data:ci-${{ needs.build.outputs.sha }}

      - name: Deploy (only signed images by digest)
        env:
          KUBECONFIG: ${{ secrets.KUBECONFIG }}
        run: |
          kubectl set image deployment/auth auth=myregistry/service-auth@sha256:&lt;digest&gt;
          kubectl set image deployment/data data=myregistry/service-data@sha256:&lt;digest&gt;
          kubectl rollout status deployment/auth
          kubectl rollout status deployment/data
</code></pre>
<p><strong>Notes on the workflow above</strong></p>
<ul>
<li><p>We build once, sign the images with <code>cosign</code> using keyless OIDC (no long-lived signing keys required), generate SBOMs (using <code>syft</code>) and upload attestation artifacts for later verification.</p>
</li>
<li><p>Deployment is a separate job that verifies signatures before applying changes. This prevents “build-time tampering” from automatically reaching production.</p>
</li>
<li><p>Use <code>fetch-depth: 0</code> so the build has full history for reproducibility checks where necessary.</p>
</li>
</ul>
<h3>B - Lockdown dependency installs (requirements.txt with hashes)</h3>
<p>Use pip’s <code>--require-hashes</code> option to guarantee packages haven’t changed:</p>
<pre><code class="language-plaintext"># requirements.txt
flask==2.0.1 \
    --hash=sha256:abcdef...
requests==2.25.1 \
    --hash=sha256:123456...
</code></pre>
<p>Install with: <code>pip install --require-hashes -r requirements.txt</code></p>
<p>Generate hashes with <code>pip-compile</code> or <code>pip hash</code> when creating a lockfile.</p>
<h3>C - Limit runner reach &amp; secrets exposure</h3>
<ul>
<li><p>Use the <code>permissions:</code> block to limit <code>GITHUB_TOKEN</code> scopes.</p>
</li>
<li><p>For any job that uses a secret, prefer OIDC-based short-lived tokens (GitHub OIDC) to obtain cloud credentials dynamically, rather than storing long-lived secrets in GitHub Secrets.</p>
</li>
<li><p>Disable secret access in workflows triggered by untrusted events.</p>
</li>
</ul>
<h3>D - Protect caches or avoid them for sensitive builds</h3>
<ul>
<li><p>If you must cache, scope the cache keys narrowly and validate the contents of restored caches before use.</p>
</li>
<li><p>For the highest assurance builds, disable cache restore and accept longer build times.</p>
</li>
</ul>
<h3>E - Runtime detection</h3>
<ul>
<li><p>Add Falco or similar runtime detection inside the cluster to watch for suspicious syscalls, spawned shells inside containers or unexpected outbound traffic:</p>
<ul>
<li>Falco rules can alert on <code>curl</code>/<code>wget</code> from unexpected processes or writes to <code>/etc/cron*</code> or creation of suspicious network connections.</li>
</ul>
</li>
</ul>
<hr />
<h1>Tests and validation techniques</h1>
<ol>
<li><p><strong>Red-team your CI</strong></p>
<ul>
<li>Spin up a disposable repo that simulates forks and PR flows; test whether a crafted PR can exfiltrate secrets or alter caches. Attempt to poison a cache and verify whether a later build restores malicious files. This is a pragmatic way to test proofs-of-concept.</li>
</ul>
</li>
<li><p><strong>Reproducible-build checks</strong></p>
<ul>
<li>Rebuild the same commit twice in different clean environments and compare digests/hashes. If artifacts differ, investigate non-determinism sources.</li>
</ul>
</li>
<li><p><strong>Attestation verification</strong></p>
<ul>
<li>For each released artifact, verify cosign signatures and Rekor entries. Write an automated check that fails CI if the Rekor proof is missing.</li>
</ul>
</li>
<li><p><strong>SBOM and vulnerability scanning</strong></p>
<ul>
<li>Scan generated SBOMs for known CVEs and compare SBOMs across builds. Use Trivy as part of CI and a daily scheduled job.</li>
</ul>
</li>
<li><p><strong>Runtime anomaly detection</strong></p>
<ul>
<li>Deploy Falco rules in staging and production. Trigger synthetic anomalies and confirm logging/alerting pipelines work.</li>
</ul>
</li>
<li><p><strong>Periodic key &amp; secret audits</strong></p>
<ul>
<li>Rotate any secret that ever was accessible to runners. Keep a tight inventory of service accounts with deploy permissions.</li>
</ul>
</li>
</ol>
<hr />
<h1>Real-world life story (Inspired by real incidents)</h1>
<h2><strong>The Midnight Commit: How one build led to a breach</strong></h2>
<p><em>This story is inspired by real supply chain incidents (publicly reported SolarWinds and Codecov investigations and later GitHub Actions campaigns). Names and some details have been fictionalized for clarity.</em></p>
<p>It was 02:08 local time when an engineer on the on-call rotation noticed a small alert: the outbound network firewall had logged an odd POST to an external IP from a CI runner. The runner had just completed a nightly build of the company’s payment microservice. At first it looked like a failed telemetry call but the payload contained a blob that, when decoded, revealed multiple environment-like variables.</p>
<p>A frantic investigation followed. The team discovered that three weeks prior, an innocuous cache-key collision had allowed a malicious archive to be stored in the build cache. A contractor had merged a tiny patch into a tooling repo that populated a tarball on the main branch; the cache key was shared across repos. The attacker’s archive contained a small Python module that would, under CI execution, scan the environment for tokens and POST them to an external collector.</p>
<p>How did it evade detection? The malicious code sat inside a file that only test runners executed; the normal code-review checks focused on YAML workflow changes and missed a new test file in a deep <code>tests/</code> package. The organization’s CI used caches aggressively and had a few self-hosted runners accessible from the corporate network. The attacker had exploited a misconfigured pull-request trigger on a low-privilege repo to place the payload into the cache, then waited for the payment-service build to restore the cache. When the collector received the first tokens, the attacker quickly used them to pull private images and access a staging environment. From there, lateral movement found a misconfigured database user and a handful of customer records were exposed to a breach that could have been orders of magnitude worse.</p>
<p>After-action findings catalyzed fast change: the company turned off shared caches for critical pipelines, replaced self-hosted runners with ephemeral hosted runners for production jobs, introduced SBOMs and cosign signatures into the build flow and validated that only signed artifacts could be promoted to production. The next year they ran a red-team exercise that recreated the attack and this time the detection pipeline caught the exfil attempt in minutes.</p>
<p>The verdict was simple: <em>it wasn’t the test suite that was the problem; it was the trust they had given to automation.</em> Once they treated the build as untrusted, the attack surface shrank dramatically.</p>
<hr />
<h1>Conclusion: actionable checklist</h1>
<ul>
<li><p>Use ephemeral GitHub-hosted runners for sensitive jobs; isolate any self-hosted runners.</p>
</li>
<li><p>Pin and audit GitHub Actions and third-party actions. Use immutable references where possible.</p>
</li>
<li><p>Disable or narrowly scope caches for sensitive builds; validate cache contents.</p>
</li>
<li><p>Pin dependency versions and use hash verification; prefer private mirrors.</p>
</li>
<li><p>Sign artifacts (cosign/Sigstore), publish attestations (Rekor) and adopt SLSA/in-toto where practical.</p>
</li>
<li><p>Split build and promotion pipelines; only signed artifacts should be promoted.</p>
</li>
<li><p>Add runtime detection (Falco) and container scanning (Trivy).</p>
</li>
<li><p>Run red-team CI tests: simulate cache poisoning, compromised runner and dependency confusion.</p>
</li>
</ul>
<hr />
<h2>References</h2>
<ol>
<li><p><a href="https://www.solarwinds.com/blog/new-findings-from-our-investigation-of-sunburst">SolarWinds: New Findings From Our Investigation of SUNBURST (SolarWinds blog)</a></p>
</li>
<li><p><a href="https://www.wired.com/story/the-untold-story-of-solarwinds-the-boldest-supply-chain-hack-ever">“The Untold Story of the Boldest Supply-Chain Hack Ever” (Wired) - deep reporting on SolarWinds</a></p>
</li>
<li><p><a href="https://about.codecov.io/apr-2021-post-mortem/">Codecov - Post-Mortem / Root Cause Analysis (April 2021)</a></p>
</li>
<li><p><a href="https://www.sonatype.com/blog/what-you-need-to-know-about-the-codecov-incident-a-supply-chain-attack-gone-undetected-for-2-months">Sonatype - What you need to know about the Codecov incident</a></p>
</li>
<li><p><a href="https://codeql.github.com/codeql-query-help/actions/actions-cache-poisoning-direct-cache/">CodeQL / GitHub Security; Cache poisoning technique description (actions cache poisoning)</a></p>
</li>
<li><p><a href="https://adnanthekhan.com/2024/05/06/the-monsters-in-your-build-cache-github-actions-cache-poisoning/">Adnan Khan - “The monsters in your build cache: GitHub Actions cache poisoning” (analysis &amp; PoC)</a></p>
</li>
<li><p><a href="https://blog.gitguardian.com/ghostaction-campaign-3-325-secrets-stolen/">GitGuardian / CSO coverage: GhostAction campaign and GitHub Actions supply-chain campaigns</a></p>
</li>
<li><p><a href="https://medium.com/@alex.birsan/dependency-confusion-how-i-hacked-into-apple-microsoft-and-dozens-of-other-companies-4a5d60fec610">Alex Birsan - “Dependency Confusion” (how public packages can shadow internal ones)</a></p>
</li>
<li><p><a href="https://docs.github.com/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-google-cloud-platform">GitHub OIDC / Workload Identity Federation guides (GCP example)</a></p>
</li>
<li><p><a href="https://www.techradar.com/pro/why-every-ciso-should-demand-a-comprehensive-software-bill-of-materials-sbom">Why SBOMs matter (why every CISO should demand SBOMs)</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[The $0 Compliance Stack]]></title><description><![CDATA[The Lie We’re All Sold

“You need expensive GRC tools to pass enterprise audits.”

Reality:

Auditors don’t care about tools

They care about controls, traceability and evidence


Compliance ≠ Softwar]]></description><link>https://blogs.subhanshumg.com/the-zero-dollar-compliance-stack</link><guid isPermaLink="true">https://blogs.subhanshumg.com/the-zero-dollar-compliance-stack</guid><category><![CDATA[Devops]]></category><category><![CDATA[audit]]></category><category><![CDATA[compliance ]]></category><category><![CDATA[ISO 27001]]></category><category><![CDATA[PCI DSS]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[Security]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Mon, 26 Jan 2026 20:38:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769457995177/d3655bc8-606f-446d-8e85-efc19a4639c9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>The Lie We’re All Sold</h1>
<blockquote>
<p>“You need expensive GRC tools to pass enterprise audits.”</p>
</blockquote>
<p>Reality:</p>
<ul>
<li><p>Auditors don’t care about <strong>tools</strong></p>
</li>
<li><p>They care about <strong>controls, traceability and evidence</strong></p>
</li>
</ul>
<p><strong>Compliance ≠ Software</strong><br /><strong>Compliance = Verifiable system behaviour</strong></p>
<hr />
<h1>Core Philosophy: Compliance-by-Construction</h1>
<p>Instead of:</p>
<ul>
<li><p>Manual screenshots</p>
</li>
<li><p>Jira tickets</p>
</li>
<li><p>Excel risk registers</p>
</li>
</ul>
<p>Design systems where:</p>
<ul>
<li><p><strong>Evidence is produced automatically</strong></p>
</li>
<li><p><strong>Controls are enforced at runtime</strong></p>
</li>
<li><p><strong>Audits become read-only queries</strong></p>
</li>
</ul>
<hr />
<h1>The Compliance Stack</h1>
<h2>🔐 Identity &amp; Access (ISO A.5, A.9 | PCI 7, 8)</h2>
<table>
<thead>
<tr>
<th><code>Control Goal</code></th>
<th><code>Open Source</code></th>
</tr>
</thead>
<tbody><tr>
<td>Central identity</td>
<td><strong>Keycloak</strong></td>
</tr>
<tr>
<td>MFA / SSO</td>
<td>Keycloak + WebAuthn</td>
</tr>
<tr>
<td>Service identity</td>
<td>SPIFFE / SPIRE</td>
</tr>
<tr>
<td>RBAC enforcement</td>
<td>Kubernetes native RBAC</td>
</tr>
</tbody></table>
<h2>📦 Source Control &amp; CI/CD (ISO A.8, A.12 | PCI 6)</h2>
<table>
<thead>
<tr>
<th><code>Control</code></th>
<th><code>Open Source</code></th>
</tr>
</thead>
<tbody><tr>
<td>Git integrity</td>
<td>Git + signed commits</td>
</tr>
<tr>
<td>CI/CD</td>
<td>GitHub Actions / GitLab CI</td>
</tr>
<tr>
<td>Secrets</td>
<td>HashiCorp Vault</td>
</tr>
<tr>
<td>IaC scanning</td>
<td>Checkov</td>
</tr>
<tr>
<td>SAST</td>
<td>Semgrep</td>
</tr>
</tbody></table>
<h2>🐳 Runtime &amp; Infrastructure (ISO A.12, A.13 | PCI 2, 10)</h2>
<table>
<thead>
<tr>
<th><code>Area</code></th>
<th><code>Open Source</code></th>
</tr>
</thead>
<tbody><tr>
<td>Orchestration</td>
<td>Kubernetes</td>
</tr>
<tr>
<td>Network policy</td>
<td>Cilium</td>
</tr>
<tr>
<td>Runtime security</td>
<td>Falco</td>
</tr>
<tr>
<td>Admission control</td>
<td>Kyverno</td>
</tr>
<tr>
<td>eBPF telemetry</td>
<td>Cilium + Falco</td>
</tr>
</tbody></table>
<h2>📊 Logging, Monitoring &amp; Evidence (ISO A.12, A.16 | PCI 10)</h2>
<table>
<thead>
<tr>
<th><code>Need</code></th>
<th><code>Open Source</code></th>
</tr>
</thead>
<tbody><tr>
<td>Logs</td>
<td>Loki</td>
</tr>
<tr>
<td>Metrics</td>
<td>Prometheus</td>
</tr>
<tr>
<td>Traces</td>
<td>OpenTelemetry</td>
</tr>
<tr>
<td>Dashboards</td>
<td>Grafana</td>
</tr>
<tr>
<td>SIEM-lite</td>
<td>Wazuh</td>
</tr>
</tbody></table>
<h2>📁 Evidence Storage (ISO A.7, A.18 | PCI 12)</h2>
<table>
<thead>
<tr>
<th><code>Requirement</code></th>
<th><code>Open Source</code></th>
</tr>
</thead>
<tbody><tr>
<td>WORM storage</td>
<td>MinIO (Object Lock)</td>
</tr>
<tr>
<td>Retention policies</td>
<td>Lifecycle rules</td>
</tr>
<tr>
<td>Audit trails</td>
<td>Hash-based integrity</td>
</tr>
</tbody></table>
<hr />
<h1>Real-World System Design</h1>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769459435741/4ccf3e65-5ece-4a23-ae31-a407ef14cfe5.png" alt="" style="display:block;margin:0 auto" />

<p>Key Insight: <em><strong><mark>Auditors never touch production; they interrogate evidence APIs</mark></strong></em></p>
<hr />
<h1>ISO-27001 &amp; PCI-DSS Control Mapping</h1>
<h3>Example: ISO-27001 A.12.4 (Logging)</h3>
<table>
<thead>
<tr>
<th><code>Requirement</code></th>
<th><code>Implementation</code></th>
</tr>
</thead>
<tbody><tr>
<td>Event logging</td>
<td>Loki</td>
</tr>
<tr>
<td>Access logs</td>
<td>Kubernetes Audit Logs</td>
</tr>
<tr>
<td>Integrity</td>
<td>Object Lock</td>
</tr>
<tr>
<td>Review</td>
<td>Grafana dashboards</td>
</tr>
</tbody></table>
<h3>Example: PCI-DSS 10.2 (Audit Trails)</h3>
<table>
<thead>
<tr>
<th><code>Requirement</code></th>
<th><code>Implementation</code></th>
</tr>
</thead>
<tbody><tr>
<td>User actions</td>
<td>Keycloak + K8s audit</td>
</tr>
<tr>
<td>System events</td>
<td>Falco</td>
</tr>
<tr>
<td>Retention</td>
<td>MinIO Object Lock</td>
</tr>
<tr>
<td>Alerting</td>
<td>Prometheus</td>
</tr>
</tbody></table>
<hr />
<h1>What Auditors Actually Said</h1>
<blockquote>
<p>“This is one of the cleanest evidence trails we’ve seen.”</p>
</blockquote>
<p>Why?</p>
<ul>
<li><p>No human-generated artifacts</p>
</li>
<li><p>No subjective interpretation</p>
</li>
<li><p>Everything timestamped, immutable and reproducible</p>
</li>
</ul>
<hr />
<h1>Why This Scales Better Than Paid GRC Tools</h1>
<table>
<thead>
<tr>
<th><code>Paid GRC</code></th>
<th><code>$0 Stack</code></th>
</tr>
</thead>
<tbody><tr>
<td>Manual updates</td>
<td>Auto-generated</td>
</tr>
<tr>
<td>Screenshot culture</td>
<td>Telemetry culture</td>
</tr>
<tr>
<td>Lagging indicators</td>
<td>Real-time controls</td>
</tr>
<tr>
<td>Vendor lock-in</td>
<td>Architecture ownership</td>
</tr>
</tbody></table>
<hr />
<h1>The Hard Truth</h1>
<blockquote>
<p>If your compliance fails when Jira is down, <strong>you were never compliant</strong></p>
</blockquote>
<p>In conclusion, this isn’t a “cost-saving hack”. It’s how <strong>high-trust, high-scale systems</strong> are designed when:</p>
<ul>
<li><p>Security is non-negotiable</p>
</li>
<li><p>Audits are frequent</p>
</li>
<li><p>Engineering time is sacred</p>
</li>
</ul>
<p>so overall, <strong>Compliance didn’t get cheaper. It just got engineered ;)</strong></p>
]]></content:encoded></item><item><title><![CDATA[Secrets are a Supply Chain]]></title><description><![CDATA[Everyone rotates secrets.
Very few design secret lifecycle risk and that gap is where breaches live.

Most organizations believe secret rotation equals security. It doesn’t.
Rotation is a maintenance ]]></description><link>https://blogs.subhanshumg.com/secrets-are-a-supply-chain</link><guid isPermaLink="true">https://blogs.subhanshumg.com/secrets-are-a-supply-chain</guid><category><![CDATA[cybersecurity]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Security]]></category><category><![CDATA[secrets]]></category><category><![CDATA[Supply Chain Management]]></category><category><![CDATA[architecture]]></category><category><![CDATA[leadership]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Fri, 23 Jan 2026 14:32:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769178434563/b26c2e9a-7235-4ada-aabe-4abd21d32eff.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Everyone rotates secrets.</p>
<p>Very few design <strong>secret lifecycle risk</strong> and that gap is where breaches live.</p>
</blockquote>
<p>Most organizations believe secret rotation equals security. It doesn’t.</p>
<p>Rotation is a <strong>maintenance activity</strong>. Security is a <strong>system design outcome</strong>.</p>
<p>This article reframes secrets as <strong>first-class supply-chain artifacts</strong>; governed by contracts, events, blast radius, and standards, not cron jobs and hope.</p>
<hr />
<h1>Why This Matters (The Reality)</h1>
<p>In every serious breach review, the same pattern emerges:</p>
<ul>
<li><p>The secret <em>was</em> rotated</p>
</li>
<li><p>The vault <em>did</em> exist</p>
</li>
<li><p>Access <em>was</em> “restricted”</p>
</li>
</ul>
<p>And yet:</p>
<ul>
<li><p>The secret lived longer than the risk window</p>
</li>
<li><p>The blast radius was undefined</p>
</li>
<li><p>Revocation depended on humans</p>
</li>
<li><p>Audits validated screenshots, not behaviour</p>
</li>
</ul>
<p><strong>The failure was architectural, not operational.</strong></p>
<hr />
<h1>The Shift in Thinking: Secrets as a Supply Chain</h1>
<p>Treat secrets like:</p>
<ul>
<li><p>TLS certificates</p>
</li>
<li><p>IAM trust relationships</p>
</li>
<li><p>API contracts</p>
</li>
</ul>
<p>They have a lifecycle:</p>
<ol>
<li><p>Creation</p>
</li>
<li><p>Distribution</p>
</li>
<li><p>Consumption</p>
</li>
<li><p>Expiration</p>
</li>
<li><p>Revocation</p>
</li>
<li><p>Forensics</p>
</li>
</ol>
<p>If any of these are implicit, undocumented or manual, the system is fragile by design.</p>
<hr />
<h1>Core Design Principles (Non-Negotiable)</h1>
<h3>1️⃣ Secrets Are Expiring Contracts</h3>
<p>Every secret must explicitly define:</p>
<ul>
<li><p><strong>Owner</strong></p>
</li>
<li><p><strong>Consumer</strong></p>
</li>
<li><p><strong>Environment</strong></p>
</li>
<li><p><strong>Maximum lifetime</strong></p>
</li>
<li><p><strong>Invalidation triggers</strong></p>
</li>
</ul>
<p>A secret without an expiry condition is a <strong>latent incident</strong>.</p>
<h3>2️⃣ Rotation Must Be Event-Driven</h3>
<p>Time-based rotation answers auditors. Event-based rotation answers attackers.</p>
<p>Rotation should be triggered by <strong>risk</strong>, not calendars:</p>
<table>
<thead>
<tr>
<th>Event</th>
<th>Why</th>
</tr>
</thead>
<tbody><tr>
<td>Auth code change</td>
<td>Exposure risk</td>
</tr>
<tr>
<td>Production deployment</td>
<td>Trust boundary reset</td>
</tr>
<tr>
<td>Incident/alert</td>
<td>Containment</td>
</tr>
<tr>
<td>Access policy change</td>
<td>Least privilege enforcement</td>
</tr>
</tbody></table>
<h3>3️⃣ Blast Radius Is a First-Class Property</h3>
<p>Every secret must answer one question clearly:</p>
<blockquote>
<p><em>If this leaks, what breaks and what does not?</em></p>
</blockquote>
<p>If you can’t answer that in one sentence, the secret is already unsafe.</p>
<hr />
<h2>Reference Architecture (End-to-End)</h2>
<img src="https://www.hashicorp.com/_next/image?q=75&amp;url=https%3A%2F%2Fwww.datocms-assets.com%2F2885%2F1691011664-k8s-vault-sidecar-workflow-copy-2x.png&amp;w=3840" alt="Image" />

<h3>Architectural Components</h3>
<ul>
<li><p><strong>Secrets Authority</strong> (Vault / Secrets Manager)</p>
</li>
<li><p><strong>Contracts as Code</strong> (YAML)</p>
</li>
<li><p><strong>CI/CD Pipelines</strong> (event triggers)</p>
</li>
<li><p><strong>Policy Engine</strong> (blast-radius enforcement)</p>
</li>
<li><p><strong>Runtime Injection</strong> (no persistence)</p>
</li>
<li><p><strong>Audit Sink</strong> (immutable evidence)</p>
</li>
</ul>
<p>This architecture makes <strong>compromise naturally expire</strong>.</p>
<hr />
<h2>Real-World Failure (Before)</h2>
<p>A production API key leaked via application logs.</p>
<p>What actually happened:</p>
<ul>
<li><p>Rotated every 30 days</p>
</li>
<li><p>Same key used across prod, staging, DR</p>
</li>
<li><p>Incident response revoked prod only</p>
</li>
<li><p>Staging continued leaking data silently</p>
</li>
</ul>
<p><strong>Root cause:</strong></p>
<ul>
<li><p>No lifecycle ownership</p>
</li>
<li><p>No blast-radius modelling</p>
</li>
<li><p>No event-driven revocation</p>
</li>
</ul>
<p>Rotation existed. Security did not.</p>
<hr />
<h1>The Fix: Lifecycle-Aware Secret Design</h1>
<img src="https://developer.hashicorp.com/_next/image?dpl=dpl_DsRcgFnyFJztV9HKwqPqF6Ai7RGX&amp;q=75&amp;url=https%3A%2F%2Fcontent.hashicorp.com%2Fapi%2Fassets%3Fproduct%3Dtutorials%26version%3Dmain%26asset%3Dpublic%252Fimg%252Fvalidated-patterns%252Fterraform-better-together-vault%252Fterraform-secrets.png%26width%3D1560%26height%3D1427&amp;w=3840" alt="Integrate Terraform with Vault | HashiCorp Developer" />

<h2>Implementation</h2>
<h3>Step 1: Define Secret Contracts (Single Source of Truth)</h3>
<p><code>contracts/payment-api.yaml</code></p>
<pre><code class="language-yaml">name: payment-api-key
owner: payments-team
environment: prod
services:
  - billing-service
  - reconciliation-worker
ttl: 86400
rotate_on:
  - commit
  - deployment
  - incident
blast_radius: minimal
compliance:
  iso_27001: A.9.2
  pci_dss: 3.6
</code></pre>
<p>This file is simultaneously:</p>
<ul>
<li><p>Design documentation</p>
</li>
<li><p>Security policy input</p>
</li>
<li><p>Audit evidence</p>
</li>
</ul>
<h3>Step 2: Provision Secrets via Terraform</h3>
<pre><code class="language-markdown">resource "random_password" "secret" {
  length  = 32
  special = false
}

resource "vault_generic_secret" "payment" {
  path = "secret/payment-api-key"

  data_json = jsonencode({
    value = random_password.secret.result
    owner = "payments-team"
    env   = "prod"
  })

  lifecycle {
    create_before_destroy = true
  }
}
</code></pre>
<p>✔ Immutable<br />✔ Auditable<br />✔ Automatically regenerated</p>
<h3>Step 3: Enforce Blast Radius with Policy</h3>
<pre><code class="language-markdown">path "secret/payment-api-key" {
  capabilities = ["read"]
  allowed_parameters = {
    env = ["prod"]
  }
}
</code></pre>
<p>A leaked secret <strong>cannot escape its boundary</strong> — even if exposed.</p>
<h3>Step 4: Rotate on Risky Commits</h3>
<pre><code class="language-yaml">on:
  push:
    paths:
      - "auth/**"
      - "security/**"

jobs:
  rotate:
    runs-on: ubuntu-latest
    steps:
      - run: |
          vault lease revoke -prefix secret/payment-api-key
</code></pre>
<p>Secrets die <strong>before attackers finish reconnaissance</strong>.</p>
<h3>Step 5: Rotate on Production Deployments</h3>
<pre><code class="language-yaml">on:
  deployment:
    environment: production

jobs:
  rotate:
    steps:
      - run: |
          vault lease revoke -prefix secret/payment-api-key
</code></pre>
<p>Every deploy = <strong>fresh trust boundary</strong>.</p>
<h3>Step 6: Incident-Triggered Revocation</h3>
<pre><code class="language-python">def handler(event, context):
    if event["severity"] == "CRITICAL":
        revoke("payment-api-key")
</code></pre>
<p>Connected to:</p>
<ul>
<li><p>SIEM</p>
</li>
<li><p>PagerDuty</p>
</li>
<li><p>Cloud alerts</p>
</li>
</ul>
<p>Human response time → <strong>zero</strong>.</p>
<h3>Step 7: Runtime-Only Injection</h3>
<pre><code class="language-yaml">env:
  PAYMENT_API_KEY: "{{ vault.secret.payment-api-key }}"
</code></pre>
<ul>
<li><p>Never stored</p>
</li>
<li><p>Never baked into images</p>
</li>
<li><p>Auto-expires post-deploy</p>
</li>
</ul>
<hr />
<h2>Testing = Evidence (Not Optional)</h2>
<h3>Blast Radius Test</h3>
<pre><code class="language-bash">SERVICE=analytics vault read secret/payment-api-key
# Permission denied
</code></pre>
<h3>Expiry Test</h3>
<pre><code class="language-bash">sleep 86400
vault read secret/payment-api-key
# Lease expired
</code></pre>
<p>Tests double as <strong>audit artifacts</strong>.</p>
<hr />
<h1>ISO 27001 &amp; PCI Mapping (By Design)</h1>
<table>
<thead>
<tr>
<th>Control</th>
<th>How It’s Satisfied</th>
</tr>
</thead>
<tbody><tr>
<td>ISO A.9 Access Control</td>
<td>Policy-enforced blast radius</td>
</tr>
<tr>
<td>ISO A.12 Logging</td>
<td>Immutable pipeline + vault logs</td>
</tr>
<tr>
<td>ISO A.14 Secure SDLC</td>
<td>Event-driven rotation</td>
</tr>
<tr>
<td>PCI 3.6</td>
<td>TTL, revocation, segregation</td>
</tr>
</tbody></table>
<p>Auditors don’t ask for screenshots. They inspect <strong>system behaviour</strong>.</p>
<hr />
<h1>What Changes Organizationally</h1>
<p><strong>Before</strong></p>
<ul>
<li><p>Manual rotations</p>
</li>
<li><p>Jira tickets</p>
</li>
<li><p>Screenshots</p>
</li>
<li><p>High MTTR</p>
</li>
</ul>
<p><strong>After</strong></p>
<ul>
<li><p>Zero tickets</p>
</li>
<li><p>Zero screenshots</p>
</li>
<li><p>Seconds to containment</p>
</li>
<li><p>Compliance as a side effect</p>
</li>
</ul>
<p>Security stops being a <strong>department</strong>. It becomes a <strong>property of the system</strong>.</p>
<hr />
<h1>Final Takeaway</h1>
<blockquote>
<p>Mature security isn’t about adding more tools.<br />It’s about designing systems where <strong>trust naturally expires</strong>.</p>
</blockquote>
<p>Secrets are not configuration. They are <strong>relationships,</strong> and every relationship needs:</p>
<ul>
<li><p>boundaries</p>
</li>
<li><p>ownership</p>
</li>
<li><p>expiration</p>
</li>
<li><p>accountability</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Designing an ISO-27001-Native CI/CD Pipeline on AWS]]></title><description><![CDATA[“We passed an ISO-27001 surveillance audit with zero Jira tickets, zero screenshots, and zero manual evidence.”

That line usually gets silence. Then disbelief. Then the real question:

“Okay… how?”

]]></description><link>https://blogs.subhanshumg.com/iso-27001</link><guid isPermaLink="true">https://blogs.subhanshumg.com/iso-27001</guid><category><![CDATA[Devops]]></category><category><![CDATA[audit]]></category><category><![CDATA[ISO 27001]]></category><category><![CDATA[compliance ]]></category><category><![CDATA[automation]]></category><category><![CDATA[AWS]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Wed, 21 Jan 2026 08:20:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768983734184/90288d6b-5e4c-4b0d-92b6-b9dd78ff0dfa.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>“We passed an ISO-27001 surveillance audit with zero Jira tickets, zero screenshots, and zero manual evidence.”</strong></p>
</blockquote>
<p>That line usually gets silence. Then disbelief. Then the real question:</p>
<blockquote>
<p><em>“Okay… how?”</em></p>
</blockquote>
<p>This article answers that. Not with theory but with <strong>architecture, you can actually build on AWS</strong>.</p>
<hr />
<h1>Where This Idea Really Came From (No Fiction)</h1>
<p>This didn’t start as a “compliance initiative”. It started during a late-night ISO audit prep cycle on a production AWS platform that was already:</p>
<ul>
<li><p>secure by design</p>
</li>
<li><p>fully automated</p>
</li>
<li><p>running mature CI/CD pipelines</p>
</li>
</ul>
<p>Yet, every audit cycle looked the same. Two weeks before the audit:</p>
<ul>
<li><p>Jira filled with <em>“ISO Evidence – URGENT”</em></p>
</li>
<li><p>Engineers re-explained changes from months ago</p>
</li>
<li><p>Screenshots of pipelines that already enforced controls</p>
</li>
<li><p>Security teams became evidence collectors instead of engineers</p>
</li>
</ul>
<p>Nothing was broken. The <strong>system was already enforcing the controls</strong>. However, ISO didn’t trust the system because it didn’t <strong>speak ISO’s language</strong>.</p>
<hr />
<h1>The Breaking Point</h1>
<p>In one call, someone said:</p>
<blockquote>
<p><em>“Can you just raise a Jira ticket so we have evidence?”</em></p>
</blockquote>
<p>That “change” already had:</p>
<ul>
<li><p>protected branches</p>
</li>
<li><p>multiple approvals</p>
</li>
<li><p>immutable artifacts in ECR</p>
</li>
<li><p>full CloudTrail logs</p>
</li>
</ul>
<p>Yet none of it counted.</p>
<p>That’s when the thought hit, not as inspiration but frustration:</p>
<blockquote>
<p><strong>Why are humans translating system behavior into documents<br />instead of systems generating audit-grade evidence themselves?</strong></p>
</blockquote>
<hr />
<h1>The Shift: Reading ISO as Architecture</h1>
<p>That night, ISO-27001 stopped looking like a policy.</p>
<p>It started looking like a <strong>system design specification</strong>.</p>
<p>Not:</p>
<blockquote>
<p><em>“How do we prove this control?”</em></p>
</blockquote>
<p>But:</p>
<blockquote>
<p><em>“If this control were enforced by software, what would it look like?”</em></p>
</blockquote>
<p>Suddenly, Annex A became deterministic.</p>
<table>
<thead>
<tr>
<th>ISO Control Intent</th>
<th>Pipeline Primitive</th>
</tr>
</thead>
<tbody><tr>
<td>Change control</td>
<td>Branch protection</td>
</tr>
<tr>
<td>Segregation of duties</td>
<td>Approval graph</td>
</tr>
<tr>
<td>Integrity</td>
<td>Immutable artifacts</td>
</tr>
<tr>
<td>Traceability</td>
<td>Hash-linked logs</td>
</tr>
<tr>
<td>Audit evidence</td>
<td>Auto-generated facts</td>
</tr>
</tbody></table>
<p>ISO wasn’t asking for screenshots. ISO was describing <strong>how a system should behave</strong>.</p>
<hr />
<h1>The First Experiment (Small but Dangerous)</h1>
<p>I didn’t redesign everything. I added one thing. After every deployment, the pipeline emitted a JSON file:</p>
<pre><code class="language-json">{
  "control_id": "A.8.32",
  "commit": "9a1c…",
  "approvals": ["security", "platform"],
  "artifact_digest": "sha256:…",
  "pipeline_run": "run-8732",
  "timestamp": "2026-01-21T10:41:00Z"
}
</code></pre>
<p>No humans. No tickets. No screenshots.</p>
<p>Just <strong>facts</strong>, produced by the system.</p>
<hr />
<h1>The Audit That Changed Everything</h1>
<p>At the next ISO surveillance audit, instead of folders, we gave the auditor:</p>
<ul>
<li><p>read-only access</p>
</li>
<li><p>time-bounded queries</p>
</li>
<li><p>evidence already mapped to controls</p>
</li>
</ul>
<p>No walkthrough.<br />No explanations.</p>
<p>After exploring quietly, the auditor said:</p>
<blockquote>
<p><em>“So this pipeline doesn’t allow violations in the first place?”</em></p>
</blockquote>
<p>Exactly.</p>
<p>We didn’t prove compliance. We <strong>eliminated the possibility of non-compliance</strong>.</p>
<hr />
<h1>What “ISO-Native” Actually Means</h1>
<table>
<thead>
<tr>
<th>Traditional ISO</th>
<th>ISO-Native</th>
</tr>
</thead>
<tbody><tr>
<td>ISO as paperwork</td>
<td>ISO as system logic</td>
</tr>
<tr>
<td>Evidence collected later</td>
<td>Evidence generated by default</td>
</tr>
<tr>
<td>Humans enforce</td>
<td>Pipelines enforce</td>
</tr>
<tr>
<td>Audit preparation</td>
<td>Continuous audit readiness</td>
</tr>
</tbody></table>
<p>This is <strong>ISO-driven engineering</strong>, not ISO-aware tooling.</p>
<hr />
<h1>The AWS ISO-Native CI/CD Architecture</h1>
<h3>In depth</h3>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768983328023/c050e6d3-8989-4126-bd39-cc559c8edfb6.jpeg" alt="" style="display:block;margin:0 auto" />

<h3>Core Components (AWS)</h3>
<ol>
<li><p><strong>ISO Control Registry (YAML in Git)</strong></p>
</li>
<li><p><strong>CI/CD Orchestrator (GitHub Actions / CodePipeline)</strong></p>
</li>
<li><p><strong>Policy Engine (OPA / Conftest)</strong></p>
</li>
<li><p><strong>Artefact Store (ECR – immutable digests)</strong></p>
</li>
<li><p><strong>Evidence Ledger (S3 Object Lock + KMS)</strong></p>
</li>
<li><p><strong>Audit Logs (CloudTrail + hash chaining)</strong></p>
</li>
<li><p><strong>Auditor Read-Only Interface (Athena / API Gateway)</strong></p>
</li>
</ol>
<h2>Step 1: ISO Control Registry (Source of Truth)</h2>
<h3><code>controls/iso27001.yaml</code></h3>
<pre><code class="language-yaml">A.8.32:
  title: Change Management
  pipeline:
    branch_protection:
      - main
    approvals:
      min: 2
      roles:
        - security
        - platform
    artifacts:
      immutable: true
    evidence:
      retention_years: 7
</code></pre>
<p>This file <strong>replaces</strong>:</p>
<ul>
<li><p>change tickets</p>
</li>
<li><p>wiki pages</p>
</li>
<li><p>approval SOPs</p>
</li>
</ul>
<h2>Step 2: Branching Strategy Enforced by ISO</h2>
<pre><code class="language-text">main        → production (locked)
release/*  → promotion only
develop    → integration
feature/*  → ephemeral
</code></pre>
<h3>GitHub Branch Protection (Example)</h3>
<pre><code class="language-yaml">required_reviews: 2
required_status_checks:
  - iso-policy-check
  - security-scan
</code></pre>
<p>You <strong>cannot</strong> bypass ISO controls even accidentally.</p>
<h2>Step 3: Segregation of Duties (OPA)</h2>
<h3><code>policy/approvals.rego</code></h3>
<pre><code class="language-rego">package iso.approvals

deny[msg] {
  input.approvals.count &lt; 2
  msg := "ISO violation: insufficient approvals"
}

deny[msg] {
  input.approvals.roles[_] == "developer"
  msg := "ISO violation: self-approval blocked"
}
</code></pre>
<p>Pipeline fails instantly. No human escalation required.</p>
<h2>Step 4: Artefact Immutability (AWS ECR)</h2>
<h3>Build Once, Promote Everywhere</h3>
<pre><code class="language-bash">docker buildx build \
  --provenance=true \
  --sbom=true \
  -t 764227591594.dkr.ecr.eu-west-2.amazonaws.com/app:${GIT_SHA} \
  --push .
</code></pre>
<p>Deployments <strong>only reference digests</strong>:</p>
<pre><code class="language-yaml">image: app@sha256:abc123
</code></pre>
<p>OPA blocks mutable tags.</p>
<h2>Step 5: Evidence Auto-Generation</h2>
<h3>Evidence Schema</h3>
<pre><code class="language-json">{
  "control_id": "A.8.32",
  "commit": "abc123",
  "approvals": ["security", "platform"],
  "artifact_digest": "sha256:…",
  "pipeline_run": "run-9921",
  "timestamp": "2026-01-21T12:01:00Z"
}
</code></pre>
<p>Generated on <strong>every pipeline run</strong>.</p>
<h2>Step 6: Immutable Evidence Ledger (AWS)</h2>
<h3>S3 + Object Lock + KMS</h3>
<pre><code class="language-markdown">resource "aws_s3_bucket" "evidence" {
  bucket = "iso-evidence-ledger"
  object_lock_enabled = true
}

resource "aws_s3_bucket_object_lock_configuration" "lock" {
  bucket = aws_s3_bucket.evidence.id
  rule {
    default_retention {
      mode  = "COMPLIANCE"
      days = 2555
    }
  }
}
</code></pre>
<p>✔ Cannot be altered<br />✔ Cannot be deleted<br />✔ Auditor-grade by design</p>
<h2>Step 7: Auditor-Ready Queries (No Screenshots)</h2>
<h3>Athena Example</h3>
<pre><code class="language-sql">SELECT *
FROM evidence
WHERE control_id = 'A.8.32'
AND timestamp BETWEEN date '2025-01-01' AND date '2026-01-01';
</code></pre>
<p>Auditors <strong>self-serve evidence</strong>.</p>
<hr />
<h1>Auditor’s Perspective (Sidebar)</h1>
<blockquote>
<p><em>“Most teams show me screenshots and tell me what should have happened.”</em></p>
<p><em>“This system shows me what</em> <em><strong>could not have happened</strong></em>*.”*</p>
</blockquote>
<p>From an auditor’s point of view, this architecture is powerful because:</p>
<ul>
<li><p>controls are preventative, not detective</p>
</li>
<li><p>evidence is generated, not curated</p>
</li>
<li><p>logs are immutable and queryable</p>
</li>
<li><p>explanations are unnecessary</p>
</li>
</ul>
<p><strong>Trust shifts from people to systems,</strong> and that’s exactly what ISO intended.</p>
<hr />
<h2>Testing the System (Break It on Purpose)</h2>
<table>
<thead>
<tr>
<th>Test</th>
<th>Result</th>
</tr>
</thead>
<tbody><tr>
<td>Push to <code>main</code></td>
<td>❌ blocked</td>
</tr>
<tr>
<td>Self-approve PR</td>
<td>❌ denied</td>
</tr>
<tr>
<td>Redeploy old image</td>
<td>❌ digest mismatch</td>
</tr>
<tr>
<td>Delete evidence</td>
<td>❌ S3 Object Lock</td>
</tr>
</tbody></table>
<p>If it fails in production, it <strong>fails before the audit</strong>.</p>
<hr />
<h1>Why This Changes Everything</h1>
<p>ISO-27001 stops being:</p>
<ul>
<li><p>a quarterly fire drill</p>
</li>
<li><p>a security tax</p>
</li>
<li><p>an engineering slowdown</p>
</li>
</ul>
<p>And becomes:</p>
<ul>
<li><p>a <strong>property of the pipeline</strong></p>
</li>
<li><p>a <strong>by-product of delivery</strong></p>
</li>
<li><p>a <strong>competitive advantage</strong></p>
</li>
</ul>
<hr />
<h1>Final Thought</h1>
<p>ISO 27001 was never meant to be bureaucratic. It was meant to describe <strong>safe system behaviour</strong>.</p>
<blockquote>
<p><strong>When controls become code, compliance stops being work and becomes inevitable.</strong></p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Beyond the Kernel]]></title><description><![CDATA[Security no longer lives outside the system. It lives within the kernel.
In a world of microservices, containers, and ephemeral workloads, traditional observability tools see only what applications ex]]></description><link>https://blogs.subhanshumg.com/beyond-the-kernel</link><guid isPermaLink="true">https://blogs.subhanshumg.com/beyond-the-kernel</guid><category><![CDATA[eBPF]]></category><category><![CDATA[DevSecOps]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Security]]></category><category><![CDATA[falco]]></category><category><![CDATA[cilium]]></category><category><![CDATA[AWS]]></category><category><![CDATA[Linux]]></category><category><![CDATA[Kernel]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[observability]]></category><category><![CDATA[zerotrust]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 09 Nov 2025 17:00:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762706486169/15596b11-8c95-4e2e-b4c1-809c6f333b2d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Security no longer lives <em>outside</em> the system. It lives <strong>within the kernel</strong>.</p>
<p>In a world of microservices, containers, and ephemeral workloads, traditional observability tools see only what applications expose; not what’s actually happening beneath.<br />To detect, understand, and stop threats in real time, DevSecOps teams need a lens that looks <strong>beyond the user space</strong>.</p>
<p>That lens is <strong>eBPF</strong>.</p>
<hr />
<h2>Part 1: The Vision - From Blind Spots to Kernel Clarity</h2>
<p>Cloud-native architectures have fragmented visibility. Each microservice, container, and node is its own ephemeral universe.<br />By the time logs reach your SIEM, the process that caused them is already gone.</p>
<p><strong>eBPF (Extended Berkeley Packet Filter)</strong> rewrites this story.<br />It allows developers to inject small, sandboxed programs directly into the <strong>Linux kernel</strong>, observing syscalls, network traffic, and process behaviour as they happen without modifying the kernel or impacting performance.</p>
<p>Think of eBPF as a <em>programmable security microscope</em> inside your nodes.</p>
<h3>Why It Matters</h3>
<p>Unlike traditional agents:</p>
<ul>
<li><p>eBPF observes <strong>every process, syscall, and packet</strong> in real time.</p>
</li>
<li><p>It enriches events with <strong>Kubernetes metadata</strong>, including pod name, namespace, and service account.</p>
</li>
<li><p>It operates <strong>in-kernel</strong>, avoiding user-space latency or privilege escalation.</p>
</li>
</ul>
<p>This is observability <strong>at the source</strong>, before anything can hide or mutate.</p>
<hr />
<h2>The Stack: eBPF, Falco, and Automated Defence</h2>
<p>To make kernel-level visibility actionable, we combine three pillars:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Purpose</th>
<th>Tool</th>
</tr>
</thead>
<tbody><tr>
<td>Kernel Hooks</td>
<td>Captures real-time events</td>
<td><strong>eBPF</strong></td>
</tr>
<tr>
<td>Rule Engine</td>
<td>Analyzes behaviour and triggers alerts</td>
<td><strong>Falco</strong></td>
</tr>
<tr>
<td>Response Automation</td>
<td>Reacts and isolates threats</td>
<td><strong>Remediator + Cilium + Operator</strong></td>
</tr>
</tbody></table>
<p>Each layer complements the next, creating a feedback loop of detection, context, and prevention.</p>
<hr />
<h2>Architecture Overview</h2>
<p>Let’s visualize the architecture that powers this approach.</p>
<h3><strong>Minimalistic Schematic Diagram</strong></h3>
<p><a href="https://github.com/SubhanshuMG/eBPF-Driven-Security-Observability/blob/main/diagrams/minimalistic.png"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762706619685/f1fbbc17-aff2-4b72-b0cd-b8840700a30d.png" alt="" style="display:block;margin:0 auto" /></a></p>
<p>This shows the logical flow:</p>
<ol>
<li><p>eBPF probes capture kernel-level events (syscalls, sockets).</p>
</li>
<li><p>Falco consumes these events and applies security rules.</p>
</li>
<li><p>Detected anomalies trigger alerts to Sidekick → Remediator → Isolation logic.</p>
</li>
</ol>
<p>in <strong>AWS EKS:</strong></p>
<ul>
<li><p>Pods (frontend, backend, payments) run in distinct namespaces.</p>
</li>
<li><p>eBPF DaemonSets monitor every node.</p>
</li>
<li><p>Falco analyzes syscall streams.</p>
</li>
<li><p>Alerts flow to <strong>Slack</strong>, <strong>Prometheus</strong>, and <strong>AWS GuardDuty</strong>.</p>
</li>
<li><p>The <strong>Remediator</strong> patches pods in real time.</p>
</li>
<li><p><strong>Cilium</strong> applies zero-trust policies to isolate compromised containers.</p>
</li>
</ul>
<hr />
<h2>Part 2: Implementation - Turning Theory into Defense</h2>
<p>All implementation resources live in the repository:<br />👉 <a href="https://github.com/SubhanshuMG/eBPF-Driven-Security-Observability"><strong>SubhanshuMG/eBPF-Driven-Security-Observability</strong></a></p>
<p>This open repository demonstrates a full <strong>end-to-end, production-ready pipeline</strong> built on:</p>
<ul>
<li><p>Falco Helm deployment</p>
</li>
<li><p>Python BCC DaemonSet</p>
</li>
<li><p>Falco Sidekick with Slack + webhook output</p>
</li>
<li><p>Automated Remediator service</p>
</li>
<li><p>Cilium runtime blocking</p>
</li>
<li><p>Optional Go-based Operator and eBPF-LSM for advanced scenarios</p>
</li>
</ul>
<h3>Step 1: Deploy Falco (eBPF Mode)</h3>
<p>Falco listens directly to kernel events via eBPF probes:</p>
<pre><code class="language-bash">helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm install falco falcosecurity/falco -n falco --create-namespace -f charts/falco-values.yaml
</code></pre>
<p>Key Helm settings:</p>
<pre><code class="language-yaml">backend:
  enable_bpf: true
mountHost:
  sys: true
  dev: true
  proc: true
  libModules: true
</code></pre>
<p>This ensures Falco runs in eBPF mode without kernel modules.</p>
<h3>Step 2: Add a Python BCC Agent</h3>
<p>The repository includes a DaemonSet that runs a <strong>Python eBPF (BCC)</strong> script to detect outbound connections to suspicious ports.</p>
<pre><code class="language-python">from bcc import BPF

program = r"""
#include &lt;net/sock.h&gt;
#include &lt;bcc/proto.h&gt;
int trace_connect(struct pt_regs *ctx, struct sock *sk) {
    u16 dport = sk-&gt;__sk_common.skc_dport;
    if (dport == bpf_htons(4444)) {
        bpf_trace_printk("ALERT: Pod attempted connect to port 4444\\n");
    }
    return 0;
}
"""
b = BPF(text=program)
b.attach_kprobe(event="tcp_connect", fn_name="trace_connect")
b.trace_print()
</code></pre>
<p>Deployed as a <strong>privileged DaemonSet</strong>, this agent continuously monitors <code>tcp_connect</code> syscalls and streams alerts to Falco or your logs.</p>
<h3>Step 3: Falco Sidekick + Automated Remediation</h3>
<p>Falco Sidekick forwards alerts to multiple sinks, including Slack and a <strong>Remediator</strong> webhook.</p>
<pre><code class="language-bash">helm install falcosidekick falcosecurity/falcosidekick \
  --set config.slack.webhookurl="https://hooks.slack.com/services/..." \
  --set config.outputs.webhook.url="http://remediator.remediator.svc.cluster.local:8080/webhook"
</code></pre>
<p>The Remediator (Python Flask service) parses these events and patches pods with:</p>
<pre><code class="language-json">"metadata": {
  "labels": {
    "compromised": "true"
  }
}
</code></pre>
<p>Once patched, the Cilium NetworkPolicy isolates the pod instantly.</p>
<h3>Step 4: Auto-Isolation with Cilium</h3>
<p>Cilium enforces kernel-native network segmentation via eBPF.</p>
<pre><code class="language-yaml">apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: isolate-compromised
spec:
  endpointSelector:
    matchLabels:
      compromised: "true"
  ingress:
  - {}
  egress:
  - {}
</code></pre>
<p>When a pod receives the <code>compromised=true</code> label, it loses all network connectivity — preventing lateral movement or exfiltration.</p>
<h3>Step 5: CI/CD and Testing</h3>
<p>GitHub Actions (in <code>.github/workflows/ci-build-push.yml</code>) automates:</p>
<ul>
<li><p>Building and pushing <strong>Remediator</strong> and <strong>Python-BCC</strong> images to ECR</p>
</li>
<li><p>Validating manifests</p>
</li>
<li><p>Deploying to test clusters</p>
</li>
</ul>
<p><strong>Test the system</strong>:</p>
<pre><code class="language-bash">kubectl run test-shell --image=ubuntu -- sleep 3600
kubectl exec -it test-shell -- bash
nc 1.2.3.4 4444
</code></pre>
<p>Expected outcomes:</p>
<ul>
<li><p>Falco detects <code>connect()</code> syscall</p>
</li>
<li><p>Sidekick forwards alert to Slack and Remediator</p>
</li>
<li><p>Pod labeled <code>compromised=true</code></p>
</li>
<li><p>Cilium isolates it automatically</p>
</li>
</ul>
<h2>Beyond Automation: The Operator &amp; eBPF-LSM Frontier</h2>
<h3>🧩 The Go Operator</h3>
<p>In <code>operator-remediator/</code>, a hardened <strong>Go controller</strong> extends remediation logic to Kubernetes-native workflows.<br />It watches ConfigMaps (or future CRDs) and automatically patches targeted pods — a scalable foundation for multi-tenant clusters.</p>
<p>Built with <code>controller-runtime</code>, it exemplifies <strong>operator-style automation</strong>: secure, event-driven, and declarative.</p>
<h3>🧬 eBPF-LSM (Linux Security Module)</h3>
<p>For kernels ≥5.7, eBPF can attach directly to <strong>LSM hooks</strong>, allowing <em>in-kernel blocking</em>.<br />This enables actions like denying socket creation or file writes based on live context.</p>
<p>Example snippet (from <code>/eBPF-LSM/lsm_sample.c</code>):</p>
<pre><code class="language-c">SEC("lsm/socket_create")
int BPF_PROG(socket_create_lsm, int family, int type, int protocol) {
    if (family == AF_INET) {
        return -1; // deny
    }
    return 0; // allow
}
</code></pre>
<p>This brings <strong>preventive enforcement</strong> right into the kernel itself.</p>
<hr />
<h2>Observability Meets Prevention</h2>
<p>eBPF changes how DevSecOps teams think.<br />Instead of relying on logs, you rely on <strong>kernel telemetry</strong>.<br />Instead of waiting for alerts, your systems <strong>respond autonomously</strong>.</p>
<p>Falco + eBPF gives you real-time behavioural insight.<br />Cilium + Operator + LSM turns that insight into <strong>action</strong>.</p>
<p>This is what we call <em>Kernel-Native Security Observability</em> — where the infrastructure defends itself.</p>
<hr />
<h2>Repository &amp; Resources</h2>
<p>All implementation assets are available in the repository:<br />👉 <a href="https://github.com/SubhanshuMG/eBPF-Driven-Security-Observability"><strong>GitHub: SubhanshuMG/eBPF-Driven-Security-Observability</strong></a></p>
<p>Includes:</p>
<ul>
<li><p>Helm values</p>
</li>
<li><p>Falco rules</p>
</li>
<li><p>Remediator service</p>
</li>
<li><p>Cilium NetworkPolicy</p>
</li>
<li><p>Go Operator</p>
</li>
<li><p>CI/CD pipeline</p>
</li>
<li><p>eBPF-LSM sample</p>
</li>
<li><p>Deployment scripts (<code>scripts/deploy-all.sh</code> / <code>cleanup-all.sh</code>)</p>
</li>
</ul>
<p>Each component forms part of a living DevSecOps pipeline built around <strong>real-time kernel visibility</strong>.</p>
<hr />
<h2>The Future: eBPF as the DNA of DevSecOps</h2>
<p>In the coming years, <strong>observability and defence will merge</strong>.<br />We won’t monitor systems; we’ll <em>listen</em> to their kernels.<br />We won’t react to breaches; we’ll <em>intercept</em> them mid-syscall.</p>
<p>eBPF makes this possible. By uniting runtime telemetry, policy enforcement, and automated remediation, you’re not just securing workloads; you’re building a <strong>self-healing &amp; self-observing cloud</strong>.</p>
]]></content:encoded></item><item><title><![CDATA[DevSecOps for the Mind]]></title><description><![CDATA[Introduction
Developers build systems. DevSecOps engineers build secure, automated pipelines.But what happens when you apply the same principles to your own life and knowledge?
Over the last year, I’v]]></description><link>https://blogs.subhanshumg.com/devsecops-for-the-mind</link><guid isPermaLink="true">https://blogs.subhanshumg.com/devsecops-for-the-mind</guid><category><![CDATA[DevSecOps]]></category><category><![CDATA[cognitive]]></category><category><![CDATA[Futureofwork]]></category><category><![CDATA[KnowledgeManagement]]></category><category><![CDATA[secondbrain]]></category><category><![CDATA[pkm]]></category><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[collective thinking]]></category><category><![CDATA[zerotrust]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sun, 31 Aug 2025 19:26:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756650247045/498eef54-20c0-4ba0-a6f6-042809b76e95.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1>
<p>Developers build systems. DevSecOps engineers build <strong>secure, automated pipelines</strong>.<br />But what happens when you apply the same principles to your own life and knowledge?</p>
<p>Over the last year, I’ve been engineering my <strong>second brain</strong>; not just as a productivity system but as a <strong>cognitive DevSecOps pipeline</strong>. It’s still evolving, but the core idea is simple:</p>
<ul>
<li><p><em>Treat knowledge like code</em></p>
</li>
<li><p><em>Secure it like infrastructure</em></p>
</li>
<li><p><em>Automate it like CI/CD</em></p>
</li>
</ul>
<p>The result? A <strong>living, secure second brain</strong> that continuously ingests, secures and deploys knowledge just like a DevSecOps system handles applications.</p>
<hr />
<h1>Why a Second Brain Needs DevSecOps</h1>
<p>Traditional PKM (Personal Knowledge Management) systems - Notion, Obsidian, Roam; are great but they miss <strong>two critical aspects</strong>:</p>
<ol>
<li><p><strong>Security &amp; Trust</strong> → How do I know my knowledge is authentic, free of bias, and not vulnerable to manipulation?</p>
</li>
<li><p><strong>Automation &amp; Scalability</strong> → How do I make knowledge flow seamlessly from capture to deployment without manual friction?</p>
</li>
</ol>
<p>This is exactly where <strong>DevSecOps principles</strong> come in. My second brain isn’t a static wiki; it’s a <strong>zero-trust, continuously validated &amp; auto-deploying knowledge pipeline</strong>.</p>
<hr />
<h1>Architecture of My Cognitive DevSecOps Pipeline</h1>
<p>Here’s how I mapped <strong>DevSecOps concepts</strong> into my second brain:</p>
<table>
<thead>
<tr>
<th><strong>DevSecOps Concept</strong></th>
<th><strong>Second Brain Equivalent</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Source Code</strong></td>
<td>Notes, articles, papers, conversations</td>
</tr>
<tr>
<td><strong>Version Control (Git)</strong></td>
<td>Git-based Markdown + Obsidian vault</td>
</tr>
<tr>
<td><strong>CI/CD Pipelines</strong></td>
<td>Capture → Process → Deploy knowledge</td>
</tr>
<tr>
<td><strong>SAST/DAST Scanners</strong></td>
<td>AI-based validation of bias, misinformation</td>
</tr>
<tr>
<td><strong>Infrastructure as Code (IaC)</strong></td>
<td>Knowledge as Code (KaC) — structured, modular notes</td>
</tr>
<tr>
<td><strong>Zero Trust Security</strong></td>
<td>Encrypted knowledge storage + SSI authentication</td>
</tr>
<tr>
<td><strong>Monitoring &amp; Observability</strong></td>
<td>Alerts on stale/outdated knowledge, AI-driven relevance scoring</td>
</tr>
</tbody></table>
<hr />
<h3><strong>1. Capture Layer (Knowledge Ingestion)</strong></h3>
<ul>
<li><p>APIs to ingest blogs, papers and docs.</p>
</li>
<li><p>Markdown files stored in Git for version control.</p>
</li>
<li><p>Auto-encryption with <strong>GPG + Vault</strong> for sensitive notes.</p>
</li>
<li><p>AI-based deduplication &amp; tagging (like SAST for concepts).</p>
</li>
</ul>
<h3><strong>2. Processing &amp; Security Layer</strong></h3>
<ul>
<li><p><strong>NLP Pipelines</strong> → Summarization, embeddings, semantic search.</p>
</li>
<li><p><strong>Bias/Misinformation Scans</strong> → Just like DAST but for knowledge.</p>
</li>
<li><p><strong>Blockchain Proof-of-Authenticity</strong> → Verifying sources for integrity.</p>
</li>
</ul>
<h3><strong>3. Deployment Layer</strong></h3>
<ul>
<li><p>GitOps-style sync to Obsidian, Notion, or custom dashboards.</p>
</li>
<li><p>Secure <strong>Zero-Trust Knowledge Sharing</strong> → JWT + Self-Sovereign Identity (SSI).</p>
</li>
<li><p>Multi-device CI/CD → Knowledge "deploys" everywhere without manual copy-paste.</p>
</li>
</ul>
<h3><strong>4. Observability &amp; Monitoring</strong></h3>
<ul>
<li><p>AI alerts when I reference outdated/stale knowledge.</p>
</li>
<li><p>Graph DB maps showing <strong>concept dependencies</strong> (like microservices).</p>
</li>
<li><p>Real-time visualization of knowledge flows.</p>
</li>
</ul>
<h3>5. High-Level Diagram</h3>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756663465976/7943ff83-91bc-444d-b8a6-1937fee44fd5.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h1>Real-World Example from My Workflow</h1>
<p>Recently, while researching <strong>Quantum AI for DevSecOps</strong>, here’s what happened inside my second brain:</p>
<ol>
<li><p><strong>Capture</strong> → API pulls the latest arXiv papers and blog posts.</p>
</li>
<li><p><strong>Processing</strong> → The NLP pipeline summarised them, flagging one as outdated (published in 2017, low relevance).</p>
</li>
<li><p><strong>Security Check</strong> → AI detected potential bias in a vendor blog (marketing-heavy, not research-backed).</p>
</li>
<li><p><strong>Deployment</strong> → Cleaned insights synced to my Obsidian vault &amp; Notion dashboard.</p>
</li>
<li><p><strong>Monitoring</strong> → A week later, an update alert popped up when a new 2025 paper was published; my second brain automatically queued it for ingestion.</p>
</li>
<li><p><strong>Slack Notification</strong> → The pipeline sent a structured alert to my Slack channel:</p>
<pre><code class="language-markdown">📚 New Knowledge Added

Title: Quantum AI for DevSecOps
Source: &lt;https://arxiv.org/abs/2501.12345|View Paper&gt;
Captured: 2025-08-31 12:45 UTC
Bias / Flags: None

Summary:
- Introduces hybrid quantum-classical models for threat detection
- Benchmarks performance against classical ML
- Highlights cryptographic implications in CI/CD
- Suggests real-time anomaly detection
- Outlines future research directions

🔒 Routed via Cognitive DevSecOps Pipeline
</code></pre>
</li>
</ol>
<p>It felt like having a <strong>self-healing DevSecOps pipeline for cognition</strong>.</p>
<hr />
<h1>Mini Implementation: Second Brain Pipeline in Python</h1>
<pre><code class="language-python">#!/usr/bin/env python3
import os, re, subprocess, requests, json
from datetime import datetime
from bs4 import BeautifulSoup
import openai

# CONFIG
REPO_PATH = "/path/to/second-brain-vault"
NOTES_DIR = os.path.join(REPO_PATH, "Knowledge")
os.makedirs(NOTES_DIR, exist_ok=True)

def fetch_article(url: str) -&gt; str:
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")
    return "\n".join([p.get_text() for p in soup.find_all("p")])

def validate_content(text: str) -&gt; dict:
    suspicious = ["sponsored", "buy now", "exclusive deal"]
    flags = [kw for kw in suspicious if kw.lower() in text.lower()]
    return {"bias_flags": flags, "is_suspicious": len(flags) &gt; 0}

def summarize_with_ai(text: str) -&gt; str:
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content": f"Summarize this in 5 bullets:\n{text[:5000]}"}]
    )
    return response["choices"][0]["message"]["content"]

def save_to_vault(title: str, summary: str, metadata: dict):
    safe_title = re.sub(r"[^a-zA-Z0-9]+", "-", title)
    filename = os.path.join(NOTES_DIR, f"{safe_title}.md")
    with open(filename, "w") as f:
        f.write(f"# {title}\n\n**Captured:** {datetime.utcnow()} UTC\n\n")
        f.write(f"**Flags:** {metadata['bias_flags']}\n\n## Summary\n\n{summary}\n")
    subprocess.run(["git", "-C", REPO_PATH, "add", filename])
    subprocess.run(["git", "-C", REPO_PATH, "commit", "-m", f"Add note: {title}"])
</code></pre>
<hr />
<h1>Slack Notifications Integration</h1>
<p>To make the pipeline more <strong>DevSecOps-native</strong>, I added a <strong>layer for Slack notifications</strong>. This way, every new knowledge item or suspicious flag instantly triggers an alert in a Slack channel.</p>
<h3>Example Code Addition</h3>
<pre><code class="language-python">def send_notification(title: str, url: str, metadata: dict, summary: str, webhook_url: str):
    """Send a structured Slack notification with blocks for better readability."""
    payload = {
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": "📚 New Knowledge Added"
                }
            },
            {
                "type": "section",
                "fields": [
                    {
                        "type": "mrkdwn",
                        "text": f"*Title:*\n{title}"
                    },
                    {
                        "type": "mrkdwn",
                        "text": f"*Source:*\n&lt;{url}|View Article&gt;"
                    },
                    {
                        "type": "mrkdwn",
                        "text": f"*Captured:*\n{datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}"
                    },
                    {
                        "type": "mrkdwn",
                        "text": f"*Bias / Flags:*\n{metadata['bias_flags'] or 'None'}"
                    }
                ]
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*Summary:*\n{summary}"
                }
            },
            {
                "type": "context",
                "elements": [
                    {
                        "type": "mrkdwn",
                        "text": "🔒 Routed via Cognitive DevSecOps Pipeline"
                    }
                ]
            }
        ]
    }

    response = requests.post(webhook_url, data=json.dumps(payload),
                             headers={"Content-Type": "application/json"})
    if response.status_code != 200:
        print(f"[!] Notification failed: {response.text}")
    else:
        print("[✔] Slack notification sent successfully!")
</code></pre>
<h3>Example Slack Output</h3>
<pre><code class="language-markdown">📚 New Knowledge Added

Title: Kubernetes Security Best Practices
Source: 🔗 View Article
Captured: 2025-08-31 12:45 UTC
Bias / Flags: ⚠️ Marketing-heavy, Sponsored

Summary:
- Explains container runtime isolation
- Highlights RBAC best practices
- Warns about common misconfigs
- Emphasizes audit logging
- Recommends upgrading to the latest API versions

🔒 Routed via Cognitive DevSecOps Pipeline
</code></pre>
<p>This mirrors how DevSecOps teams receive alerts on vulnerabilities but applied to knowledge management.</p>
<hr />
<h1>Multi-User / Team Mode</h1>
<p>To scale this beyond one person:</p>
<ul>
<li><p><strong>GitOps Repo</strong> → Team knowledge base with PR reviews.</p>
</li>
<li><p><strong>RBAC</strong> → Contributors, Reviewers, Security Officers.</p>
</li>
<li><p><strong>Zero Trust</strong> → JWT/SSI authentication before accessing knowledge.</p>
</li>
<li><p><strong>Notifications</strong> → Slack/Discord alerts for suspicious/critical knowledge.</p>
</li>
<li><p><strong>Observability</strong> → Grafana/ELK dashboards tracking knowledge health.</p>
</li>
</ul>
<h3>Architecture Diagram</h3>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756663579756/d2fd9188-ae78-42cc-8e28-4dffcf4d17eb.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h1>Knowledge Mesh for Organizations</h1>
<p>When multiple teams adopt second brains:</p>
<ul>
<li><p>Each team has its own secure, automated knowledge pipeline.</p>
</li>
<li><p>An <strong>AI + DevSecOps Service Mesh</strong> ensures integrity and trust across teams.</p>
</li>
<li><p>Knowledge flows securely, just like microservices in a mesh network.</p>
</li>
</ul>
<h3>Architecture Diagram</h3>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756663878190/e44b8929-96c1-4030-be6a-0f644fbe4fbd.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h1>Conclusion</h1>
<p>Building my second brain with DevSecOps isn’t about productivity hacks; it’s about <strong>engineering trust, automation and resilience into cognition itself</strong>.</p>
<p>In a world where misinformation spreads faster than vulnerabilities, securing knowledge is as critical as securing infrastructure.</p>
<p>And just like software, the second brain is <strong>never finished</strong>.<br />It’s a living pipeline; always building, always evolving.</p>
<p><em><strong><mark>What if your team or your entire organization treated knowledge like code and secured it with DevSecOps?</mark></strong></em></p>
]]></content:encoded></item><item><title><![CDATA[Deploying a Bitcoin Regtest Network with Docker and CI/CD Tools]]></title><description><![CDATA[Hands-on + R&D guide.
We’ll stand up a private Bitcoin regtest network with two nodes, peer them, mine spendable coins and send/confirm transactions; wrapped in a clean repo with CI and an optional de]]></description><link>https://blogs.subhanshumg.com/deploying-a-bitcoin-regtest-network-with-docker-and-cicd-tools</link><guid isPermaLink="true">https://blogs.subhanshumg.com/deploying-a-bitcoin-regtest-network-with-docker-and-cicd-tools</guid><category><![CDATA[Bitcoin]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Docker]]></category><category><![CDATA[Blockchain]]></category><category><![CDATA[engineering]]></category><category><![CDATA[SRE]]></category><category><![CDATA[cicd]]></category><category><![CDATA[observability]]></category><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Subhanshu Mohan Gupta]]></dc:creator><pubDate>Sat, 23 Aug 2025 08:17:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755936766158/771c9e87-eb64-435b-b4e9-79a66dade97e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>Hands-on + R&amp;D guide.</strong></h3>
<p>We’ll stand up a private Bitcoin <strong>regtest</strong> network with two nodes, peer them, mine spendable coins and send/confirm transactions; wrapped in a clean repo with CI and an optional demo workflow.</p>
<p><em><strong>📦 Repo:</strong></em> <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops"><em><strong>https://github.com/SubhanshuMG/bitcoin-regtest-devops</strong></em></a></p>
<hr />
<h2><strong>TL;DR</strong></h2>
<pre><code class="language-plaintext">git clone https://github.com/SubhanshuMG/bitcoin-regtest-devops.git
cd bitcoin-regtest-devops

# Start both nodes (node1 mines 101 blocks on first run)
docker compose up -d

# Watch logs live while services get healthy
docker compose logs -f --tail=100

# Create &amp; confirm a tx from node1 ➜ node2
bash ./scripts/create-tx.sh 0.10

# Tear down
docker compose down -v
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">🏦  Creating + broadcasting 0.10 BTC from node1 ➜ node2
🪙  TX broadcast – &lt;txid&gt;
✅  Confirmed in block (confirmations: 1)
</code></pre>
<hr />
<h2><strong>What you’ll build</strong></h2>
<p>A two-node regtest network that’s easy to run and observe:</p>
<pre><code class="language-plaintext">+----------------------------+         +----------------------------+
|        btc-node1           |         |          btc-node2         |
|  bitcoind (regtest)        | &lt;-----&gt; |   bitcoind (regtest)       |
|  RPC : 18443               |   P2P   |   RPC : 18445              |
|  P2P : 18444               |         |   P2P : 18446              |
|  Wallet "wallet"           |         |   Wallet "wallet"          |
|  Mines initial 101 blocks  |         |   Peers via addnode        |
+----------------------------+         +----------------------------+
</code></pre>
<p><strong>Why regtest?</strong> Instant blocks, infinite coins (mine on demand), deterministic behavior—perfect for Ops, testing, and fee/mempool experiments.</p>
<hr />
<h2><strong>Project layout (overview)</strong></h2>
<pre><code class="language-plaintext">.
├─ docker-compose.yml          # 2 Bitcoin Core nodes, healthchecks (with start_period)
├─ Dockerfile                  # extends bitcoin/bitcoin:25 with jq
├─ scripts/
│  ├─ entrypoint.sh            # builds bitcoind args from env; idempotent 101-block bootstrap
│  ├─ create-tx.sh             # re-runnable: fresh address + confirm block each run
│  └─ wait-for-bitcoind.sh     # RPC readiness probe for healthchecks
└─ .github/workflows/
   ├─ ci.yml                   # lint + build + up + live logs + tx (with timings)
   └─ demo.yml                 # on-demand “Run workflow” demo; uploads node logs as artifact
└─ LICENSE                     # MIT
</code></pre>
<hr />
<h2><strong>Step-by-step: try it locally</strong></h2>
<h3><strong>Pre-requisites</strong></h3>
<ul>
<li><p>Docker Engine with <strong>docker compose v2</strong></p>
</li>
<li><p>Internet for the first image pull</p>
</li>
</ul>
<h3><strong>Clone &amp; start</strong></h3>
<pre><code class="language-plaintext">git clone https://github.com/SubhanshuMG/bitcoin-regtest-devops.git
cd bitcoin-regtest-devops
docker compose up -d
</code></pre>
<h3><strong>Observe live logs</strong></h3>
<pre><code class="language-plaintext">docker compose logs -f --tail=100
</code></pre>
<h3><strong>Send a transaction (re-runnable)</strong></h3>
<pre><code class="language-plaintext">bash ./scripts/create-tx.sh 0.10
bash ./scripts/create-tx.sh 0.25  # send another, new address each run
</code></pre>
<h3><strong>Quick checks</strong></h3>
<pre><code class="language-plaintext"># Chain &amp; peers
docker exec btc-node1 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18443 getblockchaininfo
docker exec btc-node2 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18445 getpeerinfo

# Balances
docker exec btc-node1 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18443 getbalance
docker exec btc-node2 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18445 getbalance
</code></pre>
<h3><strong>Clean up</strong></h3>
<pre><code class="language-plaintext">docker compose down -v
</code></pre>
<hr />
<h2><strong>Under the hood</strong></h2>
<ul>
<li><p><strong>Compose</strong> wires two nodes with distinct P2P/RPC ports and persistent volumes. Healthchecks call <a href="http://wait-for-bitcoind.sh">wait-for-bitcoind.sh</a> to poll getblockchaininfo until RPC is ready.</p>
</li>
<li><p><strong>Entrypoint script</strong> assembles all bitcoind flags from environment variables (robust—no brittle env expansion inside Compose). On first run, <strong>node1</strong> mines <strong>101 blocks</strong> so coinbase UTXOs mature and become spendable.</p>
</li>
<li><p><strong>Transaction helper</strong> (<a href="http://create-tx.sh">create-tx.sh</a>) asks <strong>node2</strong> for a fresh address, sends coins from <strong>node1</strong>, mines one block on <strong>node1</strong> to confirm, and reports confirmations via jq.</p>
</li>
</ul>
<hr />
<h2><strong>CI/CD that doubles as documentation</strong></h2>
<p>The repo ships with two workflows:</p>
<ol>
<li><p><strong>CI (ci.yml)</strong> – runs on every push/PR</p>
<ul>
<li><p>Lints Bash (ShellCheck) &amp; Dockerfile (Hadolint)</p>
</li>
<li><p>Builds the image and brings the network up</p>
</li>
<li><p><strong>Streams container logs live</strong> until both nodes are healthy (no silent hangs)</p>
</li>
<li><p>Sends a real transaction and prints <strong>timing metrics</strong> for build/up/tx</p>
</li>
</ul>
</li>
<li><p><strong>Demo (demo.yml)</strong> – click <em>Run workflow</em> in GitHub Actions</p>
<ul>
<li><p>Same bring-up, then broadcasts a tx (configurable amount)</p>
</li>
<li><p><strong>Uploads docker logs from both nodes</strong> as an artifact—great for reviewers</p>
</li>
</ul>
</li>
</ol>
<p>See them in action here: <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/actions"><strong>https://github.com/SubhanshuMG/bitcoin-regtest-devops/actions</strong></a></p>
<blockquote>
<p>Tip: The repo badges in the README point to CI and the Demo workflow, so status is always visible at a glance.</p>
</blockquote>
<hr />
<h2><strong>R&amp;D lab: experiments to try</strong></h2>
<ul>
<li><p><strong>Fee policy:</strong> tweak the -fallbackfee in the entrypoint or experiment with manual fees using sendtoaddress/fundrawtransaction/walletcreatefundedpsbt, then watch how confirmations behave when you mine.</p>
</li>
<li><p><strong>Mining cadence:</strong> mine variable numbers of blocks between transactions to simulate different confirmation latencies:</p>
<pre><code class="language-plaintext">docker exec btc-node1 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18443 \
generatetoaddress 3 "$(docker exec btc-node1 bitcoin-cli -regtest -rpcuser=user -rpcpassword=pass -rpcport=18443 getnewaddress)"
</code></pre>
</li>
<li><p><strong>Resilience:</strong> restart containers mid-run and confirm idempotency (node1 won’t re-mine its initial 101 blocks thanks to a .bootstrapped flag).</p>
</li>
<li><p><strong>Scaling out:</strong> add node3..n in docker-compose.yml and set ADDNODE to point at node1 (or build a small mesh).</p>
</li>
<li><p><strong>Deeper wallet ops:</strong> list UTXOs, craft raw transactions, try PSBT flows (walletcreatefundedpsbt, analyzepsbt, finalizepsbt).</p>
</li>
</ul>
<hr />
<h2><strong>Design choices &amp; trade-offs</strong></h2>
<ul>
<li><p><strong>Compose over Kubernetes</strong> for local ergonomics; the same container pattern ports cleanly to Helm/k8s later.</p>
</li>
<li><p><strong>One image for both nodes</strong> to minimize cache misses and binary drift.</p>
</li>
<li><p><strong>Demo-level creds</strong> via env for clarity; prefer rpcauth or a secret manager in real deployments.</p>
</li>
<li><p><strong>Observability-first CI</strong>: live logs + timeouts + explicit health waits—no mysterious red builds.</p>
</li>
<li><p><strong>Idempotency</strong>: first-run mining happens once per data volume; every <a href="http://create-tx.sh">create-tx.sh</a> invocation produces a fresh on-chain transaction.</p>
</li>
</ul>
<hr />
<h2><strong>Troubleshooting quick hits</strong></h2>
<ul>
<li><p><strong>Container unhealthy at start</strong> → RPC may not be ready yet; CI/healthchecks will wait. If it persists, check docker compose logs -f and verify env vars.</p>
</li>
<li><p><strong>ShellCheck warnings</strong> → scripts are lint-clean; use _ for intentionally unused loop vars.</p>
</li>
<li><p><strong>Hadolint DL3008</strong> → we intentionally avoid pinning packages for CI portability; an inline ignore is documented.</p>
</li>
<li><p><strong>Badges stale</strong> → ensure badges include ?branch=main (and &amp;event=workflow_dispatch for the demo); run the workflow once on main.</p>
</li>
</ul>
<hr />
<h2><strong>Where to go next?</strong></h2>
<ul>
<li><p><strong>Add rpcauth</strong> or inject secrets from Vault/SOPS.</p>
</li>
<li><p><strong>Enable txindex</strong> for advanced queries.</p>
</li>
<li><p><strong>Export metrics/logs</strong> to your observability stack (Loki/Promtail, CloudWatch, etc.).</p>
</li>
<li><p><strong>Integration tests</strong>: assert balances/confirmations post-tx as part of CI.</p>
</li>
</ul>
<hr />
<h2><strong>Links</strong></h2>
<ul>
<li><p><em>Automation:</em> <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/scripts/create-tx.sh"><code>scripts/create-tx.sh</code></a> | <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/scripts/entrypoint.sh"><code>scripts/entrypoint.sh</code></a> | <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/scripts/wait-for-bitcoind.sh"><code>scripts/wait-for-bitcoind.sh</code></a></p>
</li>
<li><p><em>CI workflow:</em> <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/.github/workflows/ci.yml"><code>.github/workflows/ci.yml</code></a></p>
</li>
<li><p><em>Demo workflow:</em> <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/.github/workflows/demo.yml"><code>.github/workflows/demo.yml</code></a></p>
</li>
<li><p><em>MIT License:</em> <a href="https://github.com/SubhanshuMG/bitcoin-regtest-devops/blob/main/LICENSE"><code>LICENSE</code></a></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>