Coding LLMs can write production code, pass test suites, and refactor entire codebases. That part is familiar enough now that it's easy to forget how recent it is.
Where they still struggle is in places that look superficially similar but behave quite differently in practice. Ask one to safely roll out a schema migration on a sharded database under live traffic, and the gap shows up almost immediately.
The usual explanation is that there isn't enough training data on production engineering, which isn't wrong, but doesn't quite get at what's going on. What matters more is that the work doesn't have the same kind of feedback loop.
Why code works
When people talk about why coding works well for LLMs, they often point to the volume of public code. That's part of it, but it's mostly a surface answer. The more important property is that code carries its own notion of correctness with it. Given a candidate solution, you can usually do something concrete with it: compile it, run the tests, compare the output against what you expected, and get a clear signal back without needing a person to interpret it. Crucially, you can do that cheaply enough that it's practical to repeat at scale. You can even do it synthetically.
That's what made reinforcement learning from verifiable rewards (RLVR) actually work in practice for coding and maths, and which drove most of the 2025 capability jump. You don't need a person to score a million attempts if the problem can be reduced to something that returns a clear pass-or-fail signal, and once you have that, the model can iterate.
Andrej Karpathy's framing is the cleanest version of this: the predictive feature for "will AI do this well" used to be specifiability (whether a person can write the algorithm down), and the better predictor now is verifiability (whether the result can be checked or scored automatically). Code happens to satisfy both, as do maths problems and a certain class of logic tasks, which makes the recent progress less surprising once you look at it that way.
The shift: verifiability replaced specifiability as the predictor of which jobs LLMs absorb next.
Production engineering fails the test
Some pieces of production engineering are verifiable such as SLO maths, error budget arithmetic and capacity forecasting. I'll come back to those later. The central PE work isn't and it fails the verifiability bar in a structural way.
Correctness is multi-dimensional and time-dependent: availability, latency, data integrity, blast radius, cost, recovery time. The signal arrives slowly, probabilistically, and often only via incidents. The same action can be correct in one process state and catastrophic in another such as deploying during steady state versus during a partial regional outage. From the outside it looks like the same change, the same Terraform, the same rollout plan, but the outcome depends on the surrounding conditions. And there's no cheap way to explore the space of possible actions, because each attempt depends on a real distributed system, real traffic, and real time.
High-fidelity simulation or digital-twin approaches do work in some adjacent fields but in production systems the cost of reproducing the traffic mix, dependency state, and partial-failure modes of a real system is enormous, and the simulator's own process model becomes a problem in its own right.
Even with abundant training data, there'd be no automated reward function to make an RLVR feedback loop work. The data scarcity that people usually point to is itself downstream of this: nobody builds large public corpora of "good rollout decisions" because there's no cheap way to label them.
You could try to use RLHF instead and let humans rate the outputs. The trouble is that production engineering decisions don't fit the shape RLHF works well on. A "completion" isn't a paragraph of text the rater can read in seconds; it's a sequence of actions taken over hours against a real system, with outcomes that resolve with significant lag. And the rater pool that can credibly judge whether a particular resharding sequence was the right call given that day's conditions is essentially the on-call engineers themselves, who are simultaneously the most expensive humans in the building and the people whose understanding of the system is necessarily incomplete. You end up with RLHF where the labels are themselves expert judgements under uncertainty, not preferences between two clear options. That can produce some signal, but it likely doesn't scale the way RLHF on text does.
There's no publishing format either
Even setting the verification problem aside, there's a second gap: production engineering doesn't have a standard way to publish itself.
Open source code ships with three co-located artifacts:
- Spec - README, docs, type signatures
- Code - the implementation
- Tests - executable verification
That triple is what fed coding LLMs. Anyone can publish, anyone can consume, the format is universal, and the parts are mechanically linked.
Production engineering has fragments of the same ideas, but they don't line up the same way:
- IaC (Terraform, Helm) covers part of "the code" i.e. what gets deployed
- Runbooks cover part of "the spec" but they're prose, mostly internal, and assume context the reader shares.
- Metrics, dashboards, and SLOs are the runtime evidence but they're snapshots of a specific system, not portable artifacts.
- Post-mortems are the closest thing to "tests that failed" but they're sparse, sanitised, and per-incident.
There is no production-engineering-pattern-of-the-week repo. No LeetCode for
"safely roll out a sharded schema change." The format doesn't exist, so the
corpus doesn't exist, so the training signal doesn't exist.
It's worth acknowledging what does get published. Google's SRE books are out there, Cloudflare, Stripe and others routinely write detailed post-mortems, the chaos engineering literature is reasonably mature, and there's a steady flow of conference talks and newsletters. This isn't to say that there's no public knowledge; there clearly is, and some of it is good. The point is that none of it composes into something an LLM training pipeline can consume the way it consumes code. A post-mortem is prose about a unique incident, not an executable artifact paired with the conditions that produced it. The pieces are there but they don't snap together.
The incentive to publish for operations is inverted. Open source code gets shared because publishing implementations creates network effects such as more users, more contributors and better libraries. Production engineering knowledge is the opposite where rollout playbooks are competitive advantage, incident patterns reveal architectural weaknesses, and runbooks contain information attackers would love. Companies have strong reasons not to publish, and legal review usually takes care of whatever they were willing to share.
So the corpus gap isn't a data-collection problem waiting for someone to solve it. The format and the incentives both work against it.
STPA: why this isn't fixable
System-Theoretic Process Analysis (Leveson) frames safety as a control problem, where every system has a control structure and accidents happen when controllers issue unsafe control actions or fail to issue needed ones. The key concept is the process model where each controller's current understanding of system state determines its actions, and unsafe actions come from models that are incomplete or wrong.
In a production system, that state is never fully captured in any single artifact. A database might be mid-migration while traffic has been shifted away, a feature flag is half-rolled, a downstream dependency is degraded, and the on-call has just paged for something unrelated. Some of that lives in dashboards, some in human decisions made minutes earlier, and some only in the running system's behaviour.
The 2017 AWS S3 outage is an example of where an operator ran a playbook command intended to remove a small number of servers from a subsystem used by S3's billing process. One of the inputs was mistyped, and a much larger set of servers was removed than intended, and those servers turned out to support two other S3 subsystems as well, including the index subsystem that holds metadata and location information for every object in the region. Both subsystems had to be fully restarted, which took hours because they hadn't been restarted at that scale in years.
Two layers of process-model gap show up here, and STPA names them both. The operator's model of the command itself didn't catch the typo before submission. And the operator's model of the wider system didn't anticipate that removing those particular servers would cascade into the index and placement subsystems. The artifacts (the playbook, the command) were correct in isolation; the live state of the system that those artifacts were acting on wasn't fully captured anywhere the operator could check against.
Correctness is a property of the control loop as a whole, not of the configuration or code in front of you. The right action depends on a live, evolving understanding of the system, which is hard to capture in a way that can be checked automatically, and harder still to turn into a reusable training signal. STPA gives a cleaner way to describe what's going on here: the verification function isn't a property of the artifact, it's a property of the loop, and that isn't something you can scrape.
The contrast in practice. For coding, writing a REST endpoint with input
validation and a test suite fits neatly into a cycle where the result can be
checked quickly and unambiguously. For operations, safely resharding a live
distributed database handling sustained writes is different: the implementation
matters, but most of the work sits around it in sequencing, observability,
coordination with other systems and deciding when to proceed or stop. The first
task fits in a pytest run; the second plays out over hours and is judged
against broader system and business outcomes, primarily by people.
The verifiable slices
The pieces of production engineering that are verifiable are where AI tooling has already started landing, and where it'll keep landing fastest. These are SLO maths, error budget arithmetic, capacity forecasting, query optimisation, config linting and IAM policy diffing. These have testable properties or numeric outputs you can check, which means they fit the RLVR shape that drove the 2025 capability jump. Not in the sense that they're the judgement-heavy core of the role, but in the sense that they're the bounded, tractable domains where automated tooling has a real foothold. Anomaly detection, alert correlation, and AI-assisted RCA features are already shipping inside major observability platforms; that's the shape of "verifiable slice" automation in production.
Two caveats matter. The first is that these are the practices of fairly sophisticated production engineering organisations. They're not where teams start, and they're not the bulk of what production engineering work looks like day-to-day. The second is that they depend on the basics being in place: working telemetry across metrics, logs, and traces, and joinable identifiers that let you correlate them. Without that foundation, "verifiable" is just numbers sitting next to each other.
Canary analysis and progressive rollouts are an interesting middle case. They're the industry's existing partial answer to the verification problem in large systems. You take a fundamentally non-verifiable action (a deploy) and approximate verifiability through controlled exposure and statistical comparison. These don't close the gap, but they show what manufactured verifiability looks like in production, and they're the closest thing to an automated reward function the field has built so far.
So when AI tooling shows up usefully in production environments, it shows up here: in the bounded, instrumented, statistically-tractable domains. Not in end-to-end autonomous incident response, which is where the demos always go but where the verifiability gap bites hardest.
One thing that is changing the picture, at least internally to organisations that capture it, is operational context. Incident calls, chat logs, and on-call coordination are increasingly recorded, which creates new internal sources of training signal that weren't really available before. That doesn't change the external picture much (the publishing-format gap remains) but it does mean the gap between organisations that have the data and those that don't is widening, not narrowing. There's a separate question about who actually builds production-engineering AI under these constraints, what kinds of provider end up dominating, and what that means if you're trying to adopt any of it. I'll come back to that in a follow-up post.
Thoughts
Coding has been a good fit for current AI approaches because it is verifiable, widely published, and structured in a way that links intent to outcome. Production engineering doesn't align with those properties in the same way, and the gap isn't really about volume of training material, it's that the work depends on a kind of system-level judgement built up over time through operating real systems, and that isn't something easily reduced to a corpus or a reward function.
The structural mismatch doesn't just limit what AI can do in production engineering, it also shapes who can build the tools at all. The data that matters is mostly internal or vendor-held, so capability follows access in a way that doesn't apply to coding models. The result is likely to look quite different from how coding capability spread.
More on that in another post.
Permalink
