Dave Beckett

AI for Production Engineering - Who Gets to Build It

2026-05-07 09:00

The companion post makes the structural argument: production engineering doesn't fit the LLM-eats-domain pattern that coding does. Verifiability is missing, the publishing format is missing, and STPA (which puts correctness in the live control loop rather than the artifact) explains why both gaps are fundamental.

This post is about what that means in practice if you're running an SRE team and trying to work out what AI-for-ops actually buys you, where the useful tools are coming from, and what adopting any of them commits you to. The vendor landscape is busier and more layered than people often assume, and the layering matters because each group has access to a different slice of the data that drives capability, which in turn shapes what they can credibly offer you.

Where AI-for-ops is actually coming from

A useful first-pass approach is to group them by what operational data the players can train on and what they can do with it. This is not in any particular order.

  • Hyperscalers building for themselves. Amazon, Google, Microsoft etc. They own dashboards, logs, post-mortems, change-management trail, meeting transcripts, training infrastructure (everything), and they're allowed to use all of it for their own operations. Google's SRE teams use their own internal tooling on top of Gemini for incident work; AWS and Azure ship native SRE agents into their respective clouds. Most of this isn't available to you unless you're inside one of those companies.
  • Observability and incident-management vendors training on aggregated customer data. Datadog and PagerDuty are the clearest examples. Datadog has spent years accumulating telemetry across thousands of customers and is now using it to train AI features under the Bits AI and Watchdog labels. PagerDuty's marketing quotes sixteen years of incident data, billions of events and hundreds of millions of incidents per year. These vendors aren't hyperscalers, but their customers' combined operational corpus is a serious moat in its own right and if you're already a customer, this is the group where AI-for-ops shows up first as new features in tools you already pay for.
  • Pure-play AI SRE startups. Resolve.ai is the headline example from ex-Splunk founders, a $1B valuation in late 2025 and targeting high-percentage autonomous resolution. Cleric and Traversal are other names worth knowing. These don't own the data but integrate into your environment via APIs and bet that good agent design and good model use beat the incumbent vendors' data advantage.
  • Companies with local telemetry but no AI capacity. Self-hosted Prometheus, Grafana, OpenTelemetry, ELK, or similar. You have your own incident and runbook tooling with decent operational discipline but have no internal ML team and have reasonable concerns about sending all the telemetry to a third party. The practical options here are constrained, and worth working through deliberately; there's a section on that below.
  • Companies without mature telemetry at all. The AI question doesn't really apply here yet. The first move is the telemetry data plane, not the AI tool.

The reason this picture matters is that the second group is where most of the actual product activity is happening, and the fourth group is where most teams actually sit. The simpler "hyperscalers have it, no one else does" story skips both, and skips most of the choices a Head of SRE would actually be making.

Meeting capture is the new corpus

One genuinely new development deserves its own treatment: incident meeting bots have gone from novelty to standard feature in the last 18 months. Every serious incident-management vendor (for example PagerDuty Scribe Agent, incident.io Scribe, Rootly Meeting Scribe) now ships a bot that joins Zoom, Meet, Teams, or Slack Huddle calls, transcribes in real time, and feeds the transcript into the incident timeline.

Several of these are built on top of Recall.ai, which is becoming the de facto infrastructure layer for meeting capture. The transcripts feed automated post-mortem drafts, late-joiner catchup summaries and increasingly training signal for the vendor's own AI features.

This is a real corpus shift, and different in kind from runbooks and post-mortems. It's reasoning narrated in real time by the people holding the process model — "page the database team, they own that shard," "don't restart node 3, it's still draining," "last time this looked like this, it was the CDN." That's chain-of-thought training data for production work, in a form that didn't meaningfully exist before.

Two things worth being clear about, though. The verification function is still missing in that the bot captures what people said but not whether the call was right. Joining transcripts to outcomes (good calls versus bad ones) is the next problem, and nobody has solved it cleanly. The trust boundary implications are significant since incident calls contain credentials shared verbally, customer data in screen-shares, vulnerability discussion, vendor escalation paths. The transcripts are more sensitive than runbooks, not less. When you adopt a meeting bot, you're extending the trust boundary to whoever processes those transcripts, which is typically the incident-management vendor and sometimes Recall.ai underneath.

The main thing to notice is that the vendors who own meeting capture are accumulating one of the richest corpora the field has ever had. That's not necessarily bad, but it changes who has the training-signal advantage, and it's worth knowing when you're deciding whether to bring one of these bots into your incident process.

What if you have local telemetry but no AI capacity?

This is the question that simpler framings tend to skip, and it's probably the largest by company count. It consists of decent observability but no internal ML team and healthy reasons not to send the firehose to a third party.

The choices are constrained and most of them involve trade-offs you should make deliberately:

  • Adopt an incumbent vendor. Datadog, Dynatrace, PagerDuty et al. Your operational data becomes part of their training corpus in exchange for AI features you couldn't build yourself. The economics often work out, but it's a real shift in the trust boundary.
  • Adopt a pure-play AI SRE startup. Cleric, Resolve.ai, Traversal. These integrate into your stack via APIs without necessarily ingesting your full telemetry. Lower data exposure, but you're betting on a smaller company with less proven longevity.
  • Run open-weight models locally. Against your own data, with prompting and retrieval rather than fine-tuning. The capability ceiling is lower than what the vendors can offer. Increasingly viable for narrow, well-bounded tasks (config linting, SLO calculation, alert summarisation) and increasingly unviable for the harder agentic work.
  • Wait. The market hasn't settled, the early adopters are paying the integration tax, and the second-mover discount is real.

None of these are obviously right. They depend on how much you trust your existing vendors, how much operational data you're willing to share, and how much you believe the AI-for-ops layer will actually improve your reliability versus just generating demos.

Own the joins

Whatever group you're in, one thing the AI-for-ops shift has done is sharpen the build-vs-buy question on observability and operational data. The data plane decisions you make now constrain what AI-for-ops looks like later.

If telemetry goes to provider A, the AI-for-ops layer goes to provider B, your meeting bot is provider C, and your infrastructure runs on D, you have at least four boundaries where data has to cross. At each boundary you lose context, fidelity, latency, and ownership of the schema. Provider B is doing pattern recognition on whatever provider A chose to export, which is whatever your instrumentation produced, which is whatever your infrastructure happened to emit. The joins that make slices actually verifiable happen across vendors who don't share identifiers and don't see each other's data. When they do join, they enable correlation across a latency spike, a deploy, a config change, and a downstream error rate.

Owning the joins is as much a metadata discipline as a vendor choice. Joinable identifiers (consistent trace IDs, span attributes, resource naming across providers) determine whether the joins exist at all.

A couple of things follow from that. The vendor consolidation pattern the AI-for-ops shift encourages (picking the same vendor for observability, incident management, and AI-for-ops) is a real hedge against the join problem, but also a real lock-in, and worth being honest about. The meeting-bot decision is now its own data-plane decision in its own right, with its own trust-boundary implications. Treating it as a nice-to-have feature underweights what it is.

You don't have to build everything yourself, but you should know which boundaries cost you and which don't, and that's a question worth working through before the AI-for-ops layer goes in.

Thoughts

What's distinctive about how AI-for-ops looks in 2026, compared to coding LLMs, is that the data advantages compound. Public code spread because training data was public; production engineering data is private by default, which means the moats persist. Coding LLM capability diffused through public models; AI-for-ops capability concentrates in whoever owns the data, and the meeting-capture wave is making that concentration sharper, not weaker.

If you're leading SRE and trying to make decisions in this space, the takeaway is that the data plane matters more than the model. Where your telemetry lives, who can see it, who joins it together, and who records your incident calls — those decisions shape what AI-for-ops can do for you, far more than the model choice does. They're also the decisions that are hardest to unwind once made.

The structural argument behind this post is in the companion: The Verifiability Gap in Production Engineering.