Skip to main content

Command Palette

Search for a command to run...

The agentic SOC is here

Platform teams: here is what you own

Published
11 min read
The agentic SOC is here
S

A versatile DevSecOps Engineer specialized in creating secure, scalable, and efficient systems that bridge development and operations. My expertise lies in automating complex processes, integrating AI-driven solutions, and ensuring seamless, secure delivery pipelines. With a deep understanding of cloud infrastructure, CI/CD, and cybersecurity, I thrive on solving challenges at the intersection of innovation and security, driving continuous improvement in both technology and team dynamics.


Three vendor agentic-SOC platforms in Q1 2026. One platform-engineering charter that decides whether they work.

Microsoft Sentinel AI shipped. Splunk Agentic SOC launched at RSAC. Google SecOps and Torq integrated. Each one is real; each one consumes whatever telemetry, runbooks, and policy your platform team owns. The seam between vendor agent loop and platform substrate is where the value lands or fails to land. This is the own-vs-buy matrix Greendell Critical Infra used to ship its agentic SOC, with the confidence-band policy that decided what the agent did unsupervised.


The thesis

Agentic SOC is a joint venture between the platform and SOC. Map the seams now.


Why this matters now

Microsoft, Splunk, Google SecOps all shipped agentic SOCs in Q1 2026. The platform vs vendor seam is the question every SOC leader is answering this quarter.


Narrative arc

What shipped (Microsoft Sentinel AI, Splunk Agentic SOC, Google SecOps) -> the platform-owned layer (telemetry, runbooks, policy) vs vendor layer (agent loop, UI) → the confidence-band policy that gates autonomous resolution.


What most people believe and why it falls apart

"Buy an agentic SOC vendor." True, and incomplete. The platform-owned half is what determines whether the vendor works.

Vendor agentic SOCs (Microsoft Sentinel AI, Splunk Agentic SOC, Google SecOps + Torq) ship real value; they automate work that previously sat in a ticket queue. The gap is that the platform-owned half (telemetry, policy-as-code runbooks, confidence bands) determines 80% of the outcome.


The timeline

  • 2026-04-09, Microsoft Security Blog: 'The agentic SOC, Rethinking SecOps for the next decade' introduces the new operating model.

  • 2026-RSAC, Splunk launches Agentic SOC; Google SecOps + Torq integrate for agentic SecOps loops.

  • 2026,

    • Stellar Cyber ranks top 10 agentic SOC platforms; Pondurance Kanati ships as managed agentic-SOC MDR.

    • Sigma + OCSF adoption crosses the threshold where detection rules are CI-tested and vendor-portable.

    • AgentSOC paper (IEEE 2026) formalizes multi-layer agentic SOC with confidence-band autonomous action.


The decision tree, matrix, and runbook

  1. Who owns the telemetry pipeline? If the vendor does, you're renting your own ground truth.

  2. Who owns the runbook catalog? Structured runbooks are the agent's execution surface.

  3. Who defines the confidence band? Autonomous vs escalate is your policy.

  4. Is the agent audited (every decision logged)?

  5. Is there a human escalation tier, with a named on-call?


Real-world scenario, how this plays out under pressure

The setup. Greendell Critical Infra (transit cybersecurity) ran a 24/7 SOC and faced a 2026 Microsoft Sentinel AI rollout. Platform-owned telemetry; vendor-owned the agent loop. The team rebuilt the substrate first (OpenTelemetry + OCSF), authored Sigma rules in a Git repo with CI red-samples, structured the on-call runbooks so agents could execute reversible steps, and let the vendor's agent loop sit on top. The human tier moved up the stack: exception handling and confidence-band policy.

The lesson the team wrote on the whiteboard. Vendors ship the loop; platforms own the substrate. This piece walks the telemetry pipeline, the Sigma + OCSF detection-as-code, the structured runbook format, and the red-vs-green test suite that gated every rule before it reached production.


Concept breakdown: what we are actually building

The concept in one paragraph. An agentic SOC is a layered system: telemetry (OpenTelemetry, OCSF) feeds a detection layer (Sigma rules in Git, CI-tested), which feeds a correlation layer (graph engines), which feeds a response layer (structured runbooks, agents that execute reversible steps). Confidence bands decide what the agent runs unsupervised and what escalates to a human. Platform engineering owns the substrate (telemetry pipeline, runbook catalog, policy); vendor agentic-SOC products own the loop and the UI on top. Detection becomes code, runbooks become YAML, and humans move up the stack to exception handling and policy setting.


The reference architecture

Platform owns the substrate: telemetry (OTel + OCSF), policy-as-code runbooks, confidence bands. Vendor owns the agent loop and UI. The seam is documented and auditable.

Architecture notes:

  • OpenTelemetry + OCSF as the telemetry substrate.

  • Runbook catalog: structured YAML, version-controlled, CI-tested.

  • Confidence band policy: platform-owned.

  • Agent loop: vendor-owned; consumes telemetry, executes runbooks.

  • Human escalation tier with named on-call.

Manifests: code/


End-to-end implementation guide

A precise build order from zero to production, with the manifests and scripts the team actually shipped. Every block below corresponds to a file in code/ so you can read each step in isolation, then run the suite together.

Step 1: Pipe telemetry through OpenTelemetry and OCSF

Detection is only as good as the substrate. OpenTelemetry collects logs, metrics, and traces; the OCSF normalizer reshapes them into a vendor-neutral schema. The collector below routes every event into both the SIEM and a long-term object store.

receivers:
  otlp: { protocols: { grpc: {}, http: {} } }
processors:
  ocsf_normalizer: { schema_version: "1.2" }
exporters:
  splunk_hec:
    endpoint: https://splunk.example.com:8088
    token: "${SPLUNK_TOKEN}"
  s3:
    endpoint: https://s3.example.com
    bucket: ocsf-events
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [ocsf_normalizer]
      exporters:  [splunk_hec, s3]

Step 2: Author detection rules in Sigma, test in CI

Sigma is the source of truth; rules compile to Splunk, Elastic, Chronicle, or CrowdStrike. Each rule ships with a red-sample (an event that should fire) and a green-sample (an event that should not). CI runs both before merge.

# The agentic SOC is here. Platform teams: here is what you own.
# Sigma detection rule (vendor-agnostic, compiles to Splunk/Elastic/Chronicle).
title: Suspicious container exec of shell
id: 6fa5e0c8-5a6a-4bd4-ae3d-9b55e0f1a001
status: stable
description: Detects an exec into a container that launches a shell after startup.
logsource:
    category: process_creation
    product: linux
detection:
    selection_proc:
        Image|endswith:
            - '/bash'
            - '/sh'
            - '/zsh'
    selection_parent:
        ParentImage|endswith:
            - '/containerd-shim-runc-v2'
            - '/runc'
    condition: selection_proc and selection_parent
falsepositives:
    - Maintenance exec
    - Kubernetes kubectl exec by trusted operators
level: high
tags:
    - attack.execution
    - attack.t1609

Step 3: Compile Sigma to your SIEM and ship via CI

sigma-cli compiles to vendor SPL/EQL/YARA-L. The CI pipeline below compiles, deploys, and verifies the rule fires against a known-bad sample before promoting to production.

name: sigma-ship
on: { pull_request: {}, push: { branches: [main] } }
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install sigma-cli pysigma-backend-splunk
      - run: sigma convert -t splunk -p splunk_windows ./rules > out.spl
      - run: ./test/red-sample.sh < out.spl
      - if: github.ref == 'refs/heads/main'
        run: ./scripts/deploy-splunk.sh out.spl

Step 4: Convert tribal runbooks into structured YAML

Agents cannot execute Confluence pages; humans skim them under pressure. The structured runbook below has preconditions, reversible steps, verification gates, and a mandatory cleanup. The same YAML drives an agentic SOC and an on-call human.

# The agentic SOC is here. Platform teams: here is what you own.
# Structured runbook: handle a container-exec alert.
name: container-exec-triage
version: 1
preconditions:
  - cluster: { in: [prod-a, prod-b, prod-c] }
  - alert_severity: { gte: high }
  - on_call: { available: true }
steps:
  - id: gather_context
    reversible: true
    action: shell
    command: kubectl describe pod \({alert.pod} -n \){alert.namespace}
  - id: check_image_provenance
    reversible: true
    action: shell
    command: cosign verify ${alert.image}
  - id: cordon_node
    reversible: true
    action: shell
    command: kubectl cordon ${alert.node}
  - id: snapshot_pod
    reversible: true
    action: shell
    command: crictl ps -a --name ${alert.pod}
  - id: human_gate
    reversible: false
    action: approval
    approvers: [sre-oncall, security-lead]
  - id: terminate_pod
    reversible: false
    action: shell
    command: kubectl delete pod \({alert.pod} -n \){alert.namespace}
cleanup:
  - kubectl uncordon ${alert.node}
# Tech components referenced: Microsoft Sentinel AI, Splunk Agentic SOC (RSAC 2026), Google SecOps + Torq, Stellar Cyber, Radiant Security, OpenTelemetry pipeline, OCSF schema, structured runbook library.

Step 5: Define the confidence band the agent operates within

The platform owns the policy that decides what the agent does autonomously and what escalates. Below: a YAML that the agentic SOC platform consumes, mapping action classes to confidence thresholds.

apiVersion: socops.example.com/v1
kind: ConfidenceBand
metadata: { name: tier-1-policies }
spec:
  bands:
    - action_class: enrichment
      auto_threshold: 0.5
    - action_class: containment_reversible
      auto_threshold: 0.85
      escalation_threshold: 0.7
    - action_class: containment_irreversible
      auto_threshold: 1.01     # never autonomous
      escalation_threshold: 0.0

Security considerations

  • IAM: the agentic SOC platform is a tenant of the platform's identity plane: telemetry collectors, detection runners, and response agents each carry SPIFFE SVIDs. Vendor SaaS agentic SOCs federate via OIDC into a least-privilege cross-account role.

  • Secrets management: detection rules carry no secrets; the playbook layer references secrets via Vault paths that resolve at runtime. Tokens for the SOC platform's external integrations rotate on the same cadence as the rest of the platform.

  • Vulnerability scanning: Sigma rule packs live in Git; CI scans them for syntax, performance regressions, and references to deprecated fields. Telemetry collectors are signed and admission-gated like any other workload.

  • Network policies: telemetry flows out to the SIEM only; collectors do not reach the internet directly. Vendor agent loops integrate via dedicated VPC endpoints with mTLS, not public APIs.

  • Confidence bands and human gates: policy declares which action classes the agent runs unsupervised, which escalate, and which always wait for a human; every autonomous action is audit-logged with the structured intent and the rule that justified it.


Testing strategy

Unit, integration, and chaos exercises that gate the rollout. Run each in a non-production cluster first; expand to staging once the green-path tests pass and the negative tests reject the bad input the way the policy says they will.

Test 1: Sigma red-sample fires the rule

sigma test ./rules/container-shell-exec.yml --backend splunk --sample test/red.evtx

Expected: MATCHED against the red sample; rule promoted to deploy.

Test 2: Green-sample stays quiet

sigma test ./rules/container-shell-exec.yml --backend splunk --sample test/green.evtx

Expected: NO MATCH; baseline kubectl exec by trusted operators is not flagged.

Test 3: Runbook executes through the gate

runbook run container-exec-triage --simulate --alert ./testdata/alert.json

Expected: Steps gather_context, cordon_node run; terminate_pod waits for human approval.


Scaling and optimization

  • Horizontal scaling: the telemetry pipeline scales with traffic; OpenTelemetry collectors are stateless and add a partition layer at the OCSF normalizer. Sigma rule evaluation runs in the SIEM and benefits from the SIEM's scaling model.

  • Vertical scaling: detection rules with high false-positive rates dominate evaluation cost; tune via the FP budget so the platform stays cost-bounded. The agent loop's cost is per-event under autonomous resolution; budget per-action per-class.

  • Cost optimization: Sigma + OCSF avoids vendor-lock and lets you migrate SIEMs without rewriting rules; the savings show up at procurement renewal. The agent loop saves analyst hours; track the autonomous-resolution rate as a KPI alongside cost.

  • Performance tuning: structured runbooks let agents execute without paging; tune the confidence-band thresholds against the FP rate so escalations are precision-bounded.


Failure scenarios and recovery

  1. Vendor changes telemetry requirements mid-contract. Platform owns the telemetry contract; vendor adapts.

  2. Confidence band drifts because agent tuning evolves. Lock policy at a cadence; review quarterly.

  3. Runbook fails mid-execution; agent cannot recover. Reversibility tag per step; rollback on failure.

When NOT to do this

Small orgs without a dedicated SOC may not have the own-side investment to justify the vendor loop. For orgs with 24/7 SOC operations, the platform-owned substrate is baseline.


What to ship this quarter

  • Stand up OTel + OCSF telemetry substrate.

  • Author 20 structured runbooks for common incidents.

  • Define the confidence-band policy.

  • Pilot one vendor agentic SOC.

  • Document the platform-vendor seam.

Production observability

  • Mean time to detect (MTTD) and mean time to respond (MTTR) tracked per detection class.

  • Sigma rule coverage by MITRE ATT&CK technique; gaps drive the next rule sprint.

  • Confidence-band autonomous-resolution rate; a sudden drop indicates the agent's signal degraded.

  • Runbook execution success rate; runbooks failing mid-execution are bugs, not bad luck.

  • False-positive budget per rule; alert fatigue is the slowest, surest way to lose a SOC.


Tech components

Microsoft Sentinel AI, Splunk Agentic SOC (RSAC 2026), Google SecOps + Torq, Stellar Cyber, Radiant Security, OpenTelemetry pipeline, OCSF schema, structured runbook library.


Final word

Vendors ship the agent loop. Platforms own the substrate. The seam between them is where every agentic-SOC rollout succeeds or fails; map the seam before you sign the contract.


Further reading

  1. Microsoft Security Blog, The agentic SOC (April 2026), The flagship vendor framing.

  2. Splunk Agentic SOC RSAC 2026 launch, The parallel-vendor framing.

  3. Elastic Security Labs, Why 2026 is the year to upgrade to an Agentic AI SOC, The industry synthesis.

See references.md for the full bibliography

S10: The Defender's Agentic Stack

Part 1 of 1

Microsoft, Splunk, and Google SecOps shipped agentic SOCs in 2026. Red-team agents recon 24/7. Blue-team agents triage in seconds. The tempo shifted. Human judgment retreated to fewer but higher-stakes calls. Your detection stack, runbooks, and red-team rules of engagement all have to catch up. Platform owns the substrate, not the agent. Telemetry pipeline, policy-as-code runbooks, detection-as-code libraries, and the confidence-band policy that decides what the agent runs unsupervised. This is a platform-engineering charter, not a vendor buy.