Dave Beckett

The Verifiability Gap in Production Engineering

2026-05-04 09:00

Coding LLMs can write production code, pass test suites, and refactor entire codebases. That part is familiar enough now that it's easy to forget how recent it is.

Where they still struggle is in places that look superficially similar but behave quite differently in practice. Ask one to safely roll out a schema migration on a sharded database under live traffic, and the gap shows up almost immediately.

The usual explanation is that there isn't enough training data on production engineering, which isn't wrong, but doesn't quite get at what's going on. What matters more is that the work doesn't have the same kind of feedback loop.

Why code works

When people talk about why coding works well for LLMs, they often point to the volume of public code. That's part of it, but it's mostly a surface answer. The more important property is that code carries its own notion of correctness with it. Given a candidate solution, you can usually do something concrete with it: compile it, run the tests, compare the output against what you expected, and get a clear signal back without needing a person to interpret it. Crucially, you can do that cheaply enough that it's practical to repeat at scale. You can even do it synthetically.

That's what made reinforcement learning from verifiable rewards (RLVR) actually work in practice for coding and maths, and which drove most of the 2025 capability jump. You don't need a person to score a million attempts if the problem can be reduced to something that returns a clear pass-or-fail signal, and once you have that, the model can iterate.

Andrej Karpathy's framing is the cleanest version of this: the predictive feature for "will AI do this well" used to be specifiability (whether a person can write the algorithm down), and the better predictor now is verifiability (whether the result can be checked or scored automatically). Code happens to satisfy both, as do maths problems and a certain class of logic tasks, which makes the recent progress less surprising once you look at it that way.

The shift: verifiability replaced specifiability as the predictor of which jobs LLMs absorb next.

Production engineering fails the test

Some pieces of production engineering are verifiable such as SLO maths, error budget arithmetic and capacity forecasting. I'll come back to those later. The central PE work isn't and it fails the verifiability bar in a structural way.

Correctness is multi-dimensional and time-dependent: availability, latency, data integrity, blast radius, cost, recovery time. The signal arrives slowly, probabilistically, and often only via incidents. The same action can be correct in one process state and catastrophic in another such as deploying during steady state versus during a partial regional outage. From the outside it looks like the same change, the same Terraform, the same rollout plan, but the outcome depends on the surrounding conditions. And there's no cheap way to explore the space of possible actions, because each attempt depends on a real distributed system, real traffic, and real time.

High-fidelity simulation or digital-twin approaches do work in some adjacent fields but in production systems the cost of reproducing the traffic mix, dependency state, and partial-failure modes of a real system is enormous, and the simulator's own process model becomes a problem in its own right.

Even with abundant training data, there'd be no automated reward function to make an RLVR feedback loop work. The data scarcity that people usually point to is itself downstream of this: nobody builds large public corpora of "good rollout decisions" because there's no cheap way to label them.

You could try to use RLHF instead and let humans rate the outputs. The trouble is that production engineering decisions don't fit the shape RLHF works well on. A "completion" isn't a paragraph of text the rater can read in seconds; it's a sequence of actions taken over hours against a real system, with outcomes that resolve with significant lag. And the rater pool that can credibly judge whether a particular resharding sequence was the right call given that day's conditions is essentially the on-call engineers themselves, who are simultaneously the most expensive humans in the building and the people whose understanding of the system is necessarily incomplete. You end up with RLHF where the labels are themselves expert judgements under uncertainty, not preferences between two clear options. That can produce some signal, but it likely doesn't scale the way RLHF on text does.

There's no publishing format either

Even setting the verification problem aside, there's a second gap: production engineering doesn't have a standard way to publish itself.

Open source code ships with three co-located artifacts:

  • Spec - README, docs, type signatures
  • Code - the implementation
  • Tests - executable verification

That triple is what fed coding LLMs. Anyone can publish, anyone can consume, the format is universal, and the parts are mechanically linked.

Production engineering has fragments of the same ideas, but they don't line up the same way:

  • IaC (Terraform, Helm) covers part of "the code" i.e. what gets deployed
  • Runbooks cover part of "the spec" but they're prose, mostly internal, and assume context the reader shares.
  • Metrics, dashboards, and SLOs are the runtime evidence but they're snapshots of a specific system, not portable artifacts.
  • Post-mortems are the closest thing to "tests that failed" but they're sparse, sanitised, and per-incident.

There is no production-engineering-pattern-of-the-week repo. No LeetCode for "safely roll out a sharded schema change." The format doesn't exist, so the corpus doesn't exist, so the training signal doesn't exist.

It's worth acknowledging what does get published. Google's SRE books are out there, Cloudflare, Stripe and others routinely write detailed post-mortems, the chaos engineering literature is reasonably mature, and there's a steady flow of conference talks and newsletters. This isn't to say that there's no public knowledge; there clearly is, and some of it is good. The point is that none of it composes into something an LLM training pipeline can consume the way it consumes code. A post-mortem is prose about a unique incident, not an executable artifact paired with the conditions that produced it. The pieces are there but they don't snap together.

The incentive to publish for operations is inverted. Open source code gets shared because publishing implementations creates network effects such as more users, more contributors and better libraries. Production engineering knowledge is the opposite where rollout playbooks are competitive advantage, incident patterns reveal architectural weaknesses, and runbooks contain information attackers would love. Companies have strong reasons not to publish, and legal review usually takes care of whatever they were willing to share.

So the corpus gap isn't a data-collection problem waiting for someone to solve it. The format and the incentives both work against it.

STPA: why this isn't fixable

System-Theoretic Process Analysis (Leveson) frames safety as a control problem, where every system has a control structure and accidents happen when controllers issue unsafe control actions or fail to issue needed ones. The key concept is the process model where each controller's current understanding of system state determines its actions, and unsafe actions come from models that are incomplete or wrong.

In a production system, that state is never fully captured in any single artifact. A database might be mid-migration while traffic has been shifted away, a feature flag is half-rolled, a downstream dependency is degraded, and the on-call has just paged for something unrelated. Some of that lives in dashboards, some in human decisions made minutes earlier, and some only in the running system's behaviour.

The 2017 AWS S3 outage is an example of where an operator ran a playbook command intended to remove a small number of servers from a subsystem used by S3's billing process. One of the inputs was mistyped, and a much larger set of servers was removed than intended, and those servers turned out to support two other S3 subsystems as well, including the index subsystem that holds metadata and location information for every object in the region. Both subsystems had to be fully restarted, which took hours because they hadn't been restarted at that scale in years.

Two layers of process-model gap show up here, and STPA names them both. The operator's model of the command itself didn't catch the typo before submission. And the operator's model of the wider system didn't anticipate that removing those particular servers would cascade into the index and placement subsystems. The artifacts (the playbook, the command) were correct in isolation; the live state of the system that those artifacts were acting on wasn't fully captured anywhere the operator could check against.

Correctness is a property of the control loop as a whole, not of the configuration or code in front of you. The right action depends on a live, evolving understanding of the system, which is hard to capture in a way that can be checked automatically, and harder still to turn into a reusable training signal. STPA gives a cleaner way to describe what's going on here: the verification function isn't a property of the artifact, it's a property of the loop, and that isn't something you can scrape.

The contrast in practice. For coding, writing a REST endpoint with input validation and a test suite fits neatly into a cycle where the result can be checked quickly and unambiguously. For operations, safely resharding a live distributed database handling sustained writes is different: the implementation matters, but most of the work sits around it in sequencing, observability, coordination with other systems and deciding when to proceed or stop. The first task fits in a pytest run; the second plays out over hours and is judged against broader system and business outcomes, primarily by people.

The verifiable slices

The pieces of production engineering that are verifiable are where AI tooling has already started landing, and where it'll keep landing fastest. These are SLO maths, error budget arithmetic, capacity forecasting, query optimisation, config linting and IAM policy diffing. These have testable properties or numeric outputs you can check, which means they fit the RLVR shape that drove the 2025 capability jump. Not in the sense that they're the judgement-heavy core of the role, but in the sense that they're the bounded, tractable domains where automated tooling has a real foothold. Anomaly detection, alert correlation, and AI-assisted RCA features are already shipping inside major observability platforms; that's the shape of "verifiable slice" automation in production.

Two caveats matter. The first is that these are the practices of fairly sophisticated production engineering organisations. They're not where teams start, and they're not the bulk of what production engineering work looks like day-to-day. The second is that they depend on the basics being in place: working telemetry across metrics, logs, and traces, and joinable identifiers that let you correlate them. Without that foundation, "verifiable" is just numbers sitting next to each other.

Canary analysis and progressive rollouts are an interesting middle case. They're the industry's existing partial answer to the verification problem in large systems. You take a fundamentally non-verifiable action (a deploy) and approximate verifiability through controlled exposure and statistical comparison. These don't close the gap, but they show what manufactured verifiability looks like in production, and they're the closest thing to an automated reward function the field has built so far.

So when AI tooling shows up usefully in production environments, it shows up here: in the bounded, instrumented, statistically-tractable domains. Not in end-to-end autonomous incident response, which is where the demos always go but where the verifiability gap bites hardest.

One thing that is changing the picture, at least internally to organisations that capture it, is operational context. Incident calls, chat logs, and on-call coordination are increasingly recorded, which creates new internal sources of training signal that weren't really available before. That doesn't change the external picture much (the publishing-format gap remains) but it does mean the gap between organisations that have the data and those that don't is widening, not narrowing. There's a separate question about who actually builds production-engineering AI under these constraints, what kinds of provider end up dominating, and what that means if you're trying to adopt any of it. I'll come back to that in a follow-up post.

Thoughts

Coding has been a good fit for current AI approaches because it is verifiable, widely published, and structured in a way that links intent to outcome. Production engineering doesn't align with those properties in the same way, and the gap isn't really about volume of training material, it's that the work depends on a kind of system-level judgement built up over time through operating real systems, and that isn't something easily reduced to a corpus or a reward function.

The structural mismatch doesn't just limit what AI can do in production engineering, it also shapes who can build the tools at all. The data that matters is mostly internal or vendor-held, so capability follows access in a way that doesn't apply to coding models. The result is likely to look quite different from how coding capability spread.

More on that in another post.

Permalink

Text Files as Tables

2026-05-01 09:00

This is a companion post to Twitter Hadoop Breakfix, Four Years On. That post was about the architecture of the system as a whole. This one is about the particular part that did the most work per line of code: sorted text files, manipulated with comm, join, awk, grep, sort -u.

I would claim that this style is a small line-oriented relational algebra. Once you see that, a lot of operational shell stops looking like a pile of text tricks and starts looking like a database you happen to be able to read with cat.

One row per line

The breakfix system in the companion post used a working directory full of text files. Each file held one hostname per line. Sorted. That was the whole schema.

Some of the files the scanner maintained:

all
hadoop-workers
excluded
managed-puppet-failing
responsive
dead-rekick
breakfix-good
maintenance-in-service

Plain newline-separated text files. In most cases the entire line (row) or the first field is the key. That was enough structure for comm, join, sort, awk, and grep to behave like a tiny analytics engine, and the engine ran on anything that had a shell.

Sort first, then think in sets

comm and join expect sorted input. If the input is not sorted they produce silently wrong answers — not an obvious error, just bad data. Sort order is the contract needed here; the one thing a pipeline in this style has to respect.

Once the inputs are sorted, set algebra becomes a one-liner.

  • comm -12 A B — lines in both A and B (intersection)
  • comm -23 A B — lines in A that are not in B (set difference, i.e. an anti-join)

The output is itself sorted, so it composes and you can pipe a comm into another comm or into join without re-sorting.

From the breakfix retrospective, the computation to form candidate hosts to rekick (reinstall, put through the install loop) was roughly:

all
  - excluded                # comm -23
  ∩ responsive              # comm -12
  ∩ puppet_ok               # comm -12
  → candidate-rekick

Three input files, one output file, every step in a file and inspectable. The SQL translation is something like:

SELECT hostname
FROM   hosts
WHERE  hostname NOT IN (SELECT hostname FROM excluded)
  AND  hostname IN     (SELECT hostname FROM responsive)
  AND  hostname IN     (SELECT hostname FROM puppet_ok);

SQL is more declarative and has better names. The shell version materialises every step as a real file on disk, which you can wc -l, head, diff against yesterday's version, attach to a ticket, or hand to another operator without having to export anything. Inspectability in exchange for elegance is the repeating tradeoff in this style.

Columns arrive later

Once the classifier had done its set algebra on hostnames, the next step was usually to join inventory attributes to a group. The breakfix scanner kept a wide-format dump called all-data with one row per host, generated from the infrastructure query tool (loony), with fields for hostname, role, puppet branch, kernel, service ticket, Wilson lifecycle state, and so on.

awk turned that into a projection and filter:

awk -F'\t' '$6 == "managed" && $5 == "-" { print $1 }' all-data

That is

SELECT hostname FROM all_data
WHERE lifecycle = 'managed' AND service_ticket IS NULL;

with the column names explicit in SQL, and implied in plain text where each field is separated by whitespace.

This is where the style starts being both powerful and fragile. The schema is not written down anywhere, it's a positional API. Column 7 means something important and the maintainer six months later is the person who has to remember what. In practice this was handled by keeping the loony format strings in the scanner's config and treating the wide-dump format as a contract that nobody edited casually. The contract lived in convention but not in code.

join is exactly what it sounds like

The POSIX join command does an equi-join on a shared key. By default the key is the first field of each file, and as with comm, both inputs must be sorted on that key.

Given the hadoop-workers hostname list and the all-data wide dump, joining them gave two fields (host, role) projected from the inventory view, restricted to worker nodes:

join -1 1 -2 1 -o 1.1,1.2 all-data hadoop-workers > hadoop-workers-role

-1 1 -2 1 says "join on field 1 of both files", which is also the default; spelling it out makes the shape of the join obvious. -o 1.1,1.2 is the projection — pick fields 1 and 2 from the first file. When the key is not field one, the -1 N -2 M form is how you say so. It is fiddlier than SQL ON and SELECT. The semantics are the same.

That file was then the input to more filters, more joins, more set operations. Intermediate stages had names like managed-puppet-failing-workers and breakfix-good. Every one of them was a file on disk: inspectable, diffable, re-runnable. That's outside the scope of this post, but if you removed an output file the breakfix runner would notice and run only the work to compute the missing output file.

Distinct, union, group by

The other common relational operations have a similar shape.

sort -u is SELECT DISTINCT. Apply it to concatenated inputs and you get UNION:

sort -u hadoop-workers hbase-workers > all-workers

sort | uniq -c is GROUP BY plus COUNT(*):

awk '{print $5}' all-data | sort | uniq -c | sort -nr | head

That one counts hosts per puppet branch, sorts by count descending, takes the top few for a person to review. No database has to be involved. In this model, the database is the directory, the tables are the files and the query plan is the pipeline.

Where it breaks

The weaknesses are real and I lived through most of them.

If the inputs are not sorted on the right key, comm and join produce wrong answers. If a delimiter turns up inside a field, your columns shift. In a wide row, $17 is unmemorable and you have to go and read the script comments.

Empty fields were another quirk. In SQL a missing value is NULL; in whitespace-separated text it is ambiguous with a run of spaces or a collapsed column. The convention I settled on was - as the empty-field placeholder in files on disk: very readable at a glance, easy to grep -v '^-' or spot while scanning output. When a pipeline needed an unambiguous delimiter mid-stream I switched to @, which is illegal in hostnames and rare in filenames, picked up through awk -F@, sort -t@, join -t@, and IFS='@' read for shell loops. @-delimited data never landed in a file though, because it reads as very noisy to a person. Files got optimised for human reading; pipelines got optimised for parsing correctness. A bit messy, but reliable.

There is also a maintainability cliff. The retrospective describes the scanner that eventually grew to hundreds of lines of shell — that's the same code I'm talking about here, and the cliff is the same one. The pattern scaled beautifully from 5 lines to 50; somewhere between 50 and 500 it started fighting back. Not because shell was the wrong tool for line-oriented set algebra, but because once a program is big enough to need a mental model, the lack of named columns and typed relations becomes a cost rather than a freedom.

Why it survives anyway

Given all of that, why not just load everything into SQLite? Often, that's the better answer. Load the extracts into tables, name the columns, add indexes, write the query with JOIN, EXCEPT, and GROUP BY. For anything branching or long-lived, that's easier to review and safer to refactor.

The line-oriented style holds its ground because of three properties.

Portability. sh, awk, sort, comm, join, grep, sed are usually already there. No runtime, no driver, no data directory. A bastion host has them; a rescue image has them; a ten-year-old machine has them.

Inspectability. Every intermediate result is a named file on disk. When someone asks "why did we drain these ten hosts?" the answer is in the working directory — in the actual group files the decision was made from. An SRE who had never seen the code before could read the file names in sequence and reconstruct what the scanner had concluded and why.

The filesystem is the program counter. This is the one I underestimated at the time. Because every stage's output is a file, the pipeline's state is the working directory. You can stop a run with ^C, look at what's there, delete a file you don't trust, and resume. Partial progress is free. The companion post talks about "do nothing" as a first-class outcome of the automation; the file-backed pipeline is the operational equivalent at the implementation layer. Stopping is safe. Inspecting is free. Resuming costs nothing.

When to reach for which

Use sorted files with comm, join, and awk when:

  • Inputs are already logs, host lists, inventory dumps, or metric exports. You are not choosing the representation; you are receiving it.
  • You want every intermediate result visible as a file for audit, handoff, or post-hoc review.
  • The environment is minimal or you cannot rely on SQLite being there.
  • The pipeline is short enough to hold in your head.
  • The mental effort of using SQL is just too much for the job at hand.

Reach for SQLite or a real engine when:

  • The logic branches a lot or spans many steps.
  • You need correctness under messy delimiters, typed fields, or larger-than-RAM data with indexes.
  • One declarative query is easier for your team to review than a page of pipelines.
  • The schema is going to be edited by more than one person over more than one year.

The mental model on one screen

Once it clicks, the whole model fits in a screenful:

  • A file is a table.
  • A line is a row.
  • The first field is usually the key.
  • awk is selection and projection.
  • sort -u is distinct.
  • cat | sort -u is union.
  • comm -12 is intersection.
  • comm -23 is set difference.
  • join is an inner join.
  • sort | uniq -c is group by and count.

That is a composable, line-oriented relational algebra. It has sharp edges and almost no dependencies. Respect sort order, keep one row per record, name your intermediate files like tiny tables. What you have then is a query engine: portable, transparent and built out of POSIX tools that are already on the machine.

It's not the right tool for every job. It remains a legitimate tool, and in the right environment it's still the path of least resistance, especially when the inputs are already line-oriented files.

The breakfix post was about the architecture; this one is about the query engine hiding inside it.

Permalink

Twitter Hadoop Breakfix, Four Years On

2026-04-29 09:00

I was tech lead of the Data Platform SRE team at Twitter through 2022 and on the reactive side of my role, I spent most of my time on fleet health operations which we called "breakfix". Tens of thousands of data nodes and 100+ head nodes, across a handful of datacenters, with no single system that could tell you the truth about any of them.

Four years later, I've been going back through the architecture and trying to work out what actually made it work. Not which tools we used — most of those names won't mean anything outside Twitter, and the equivalents exist everywhere — but what the shape of the system was, and why that shape was the right one for the problem.

This is the "why it worked" post. There's a companion post Text Files as Tables about the sorted-file query engine hiding inside it; this one is about the Twitter implementation specifically.

Three sources of truth, and none of them winning

The first thing that mattered was recognising that no single system was authoritative. We had three:

  1. Audubon, the machine database of record: host inventory, role membership via Colony, lifecycle state in Wilson (allocated, managed, repair, decommissioned), per-host attributes like the current breakfix ticket or the hadoop_exclude flag. That flag was more than a marker: setting it in Audubon was the intent, and a downstream path (Puppet-driven export, config sync) then regenerated the HDFS exclude file (/etc/...) that HDFS already knew how to read. Within a refresh cycle HDFS and YARN would start draining the node from service. The point is that Audubon wasn't just a passive record; a single attribute change was the wire into the cluster control plane, even if the wire had a couple of hops in it.

  2. Metrics and Healthchecks : per-node telemetry on Hadoop was collected by an on-node daemon called vexd polling every minute (or slower) and listening on local HTTP for cheap local access to those metrics. These metrics were then pushed minutely into Cuckoo, the central time-series database. Operators queried Cuckoo via CQL — its own query language, unrelated to Cassandra's CQL — for fleet-wide or per-node questions.

  3. The Hadoop cluster control plane : HDFS and YARN themselves, which had their own opinions about whether a node was in service, draining, dead, or in some mixed state that none of the other two sources could see.

None of the three was authoritative on its own. Each system was correct in isolation and wrong in practice. Audubon lagged reality because humans and slow workflows updated it. The metrics pipeline was timely but didn't know about lifecycle or tickets. The cluster control plane knew about service state but not about whether a host was ticketed for hardware repair or flagged for a firmware update. Any decision worth automating needed information from all three.

Venn diagram of the 3 sources of truth - audubon inventory, cuckoo metrics and hadoop hdfs yarn state.  Where they intersect is the safe to act place

What worked was treating them as independent observers and reconciling them with set algebra. Not "which one is right?" but "which hosts appear in all three in a way that makes action safe?" The implementation was sorted text files and comm, join, awk, grep with Audubon queried through a CLI called loony and fleet-wide remote execution done through a service called Fleetexec.

Sorted files as the data plane

The dominant idea in the shell layer was: one host per line, sorted, on disk. Named sets — hadoop-workers, all, managed-puppet-failing, dead-reboot, watch-plans, dozens of others — were files in a working directory. Set difference was comm -23 a b. Set intersection was comm -12 a b. Joining inventory attributes to a group was join against a wide-format dump of the machine database, like a SQL join but using text columns instead of table columns.

A typical rekick candidate set, informally:

all
  - hadoop_exclude
  ∩ responsive
  ∩ puppet_ok
  → candidate-rekick

Three lines of pipeline, four files on disk, one decision. Every intermediate set was inspectable after the fact.

The reasons this held up:

  • Partial failure was free. If a run died halfway, the files on disk were still valid as inputs to the next run. The next invocation could skip expensive discovery steps whose outputs already existed.

  • Incident review was just ls and grep. When someone asked "why did we drain these ten hosts?", the answer was in the working directory. Not in a log you had to reconstruct — in the actual group files the decision was made from.

  • Diffing two runs showed drift directly. diff yesterday/dead-rekick today/dead-rekick was often the fastest way to understand what had changed in the fleet overnight.

  • The working directory was logged. Each pass of the outer loop teed its transcript to a file in the day's breakfix directory, so the full sequence of decisions — group sizes, hadoop-admin calls, Fleetexec fan-outs, warnings — was recoverable after the fact. The transcript plus the group files was enough to reconstruct what happened in a pass without having to re-run it.

APIs alone would have failed all of these. The file-backed approach wasn't primitive — it was load-bearing.

The Python / shell split

The automation ran in two languages and the boundary between them was deliberate. Shell did the high-volume set crunching with one large driver plus a pile of helpers. Python did anything where getting the semantics wrong would cause immediate operational harm, and was a layer wrapping the Hadoop HDFS and YARN CLIs and the fleetexec CLI.

The rule I ended up using to decide where a new piece of logic went: if violating it could corrupt data or take a cluster down, it belongs in Python and wants tests.

Python Shell
Test Could it corrupt data or take a cluster down? Would the next stage catch a bad output?
Kind of rule Correctness invariants Policy budgets, set algebra
Lives with Tests, type checking, semantic workflows Sorted files, comm/join/awk pipelines
Example "Don't rekick a worker still in HDFS service" "This cluster tolerates 6% of workers offline"
Example "Drain a host, wait for decommission, verify replication, release the ticket" "Set of failing-puppet ∩ responsive ∩ not-already-ticketed"

Conflating those two kinds of rule in treating policy caps as if they were safety invariants, or pushing safety checks down into shell glue, was one of the most consistent anti-patterns I saw. Keeping them in different files and different languages made the split visible.

Caps, floors, ceilings, and shuffling

The driver had per-cluster policy caps on every destructive operation: rekicks, burnins, firmware updates, reboots, drains. Percentages scaled to cluster size, with a floor of 1 and a ceiling of hundreds.

The floor and the ceiling did different jobs, which took me longer than I want to admit to articulate.

The floor existed because percentage calculations on a small cluster rounds to zero, and a cluster that never gets routine maintenance accumulates debt that eventually forces an emergency intervention — which then takes out a much larger fraction than the routine work ever would. Always allowing one node inverts that failure mode: small clusters take proportionally more disruption per pass but stay current. They were also mostly dev clusters so the alerting was tuned to be coarser for these.

The ceiling existed because the largest clusters could have their 6% budget work out to hundreds of hosts, and the coordination and observation cost of that many simultaneous drains exceeded what the automation and the human supervising it could usefully handle. The ceiling protected the operator, not the cluster. It was also a safety valve.

The caps answered "how many." They did not answer "which ones," and that distinction caused issues before it was fully realised. Sorted host lists correlate with physical topology; adjacent names tend to mean adjacent racks, same top-of-rack switch, same power distribution. head -n 5 on a sorted candidate pool reliably gave you five hosts in the same rack. The count cap was honoured; the blast radius was not.

The fix was simple: shuf -n instead of head -n, applied uniformly across every size-capped operation in the driver. Five hosts sampled from anywhere in the pool, not five adjacent ones. The cap bounded count; shuf bounded concentration. Either alone was insufficient.

The composition with idempotent refresh was a bonus. Over a handful of loop iterations, shuf spread the work across the candidate pool without any stateful round-robin — no cursor, no "last host picked" tracking, nothing for the volatile-file-refresh model to fight with. Randomness did for free what state would have made brittle.

On particular issues related to draining and dataloss, because of HDFS layout, we had to guarantee that maintenance was spread across semantic data groups such that not all three replica copies of a block were taken down at once; this was ensured by tracking drains against an Audubon attribute that recorded these groups. Given our placement constraints, taking all of a single group out at once was data-safe by HDFS data placement policy.

Policy as a classifier

The per-cluster policy wasn't a table in a configuration file. It was actually a small classifier in the driver: several percentage archetypes such as default, dense, small with some exceptions for clusters that didn't fit any archetype.

The archetype names encoded the reason, not just the number. "Dense" meant storage-heavy clusters where nodes were fat and jobs were rare, so a slightly higher tolerance was safe. "Small" meant experimental clusters where we wanted to signal conservatism even when the maths would have allowed more. One genuinely pathological cluster was hardcoded to max=1 because no archetype's percentage produced the answer we wanted for a cluster of that size; it was small but it was production critical.

The honest name for this pattern is reviewed-policy-in-code. The numbers were tuned from operating experience, checked into a shell script, version-controlled, change-reviewed like any other code, and blame-able down to the incident that motivated each adjustment. git log -p answered "when did the HBase drain cap change and why?"

The approach gave up three things: runtime adjustability without a deploy, self-service for cluster owners, and machine-readable policy introspection. For a tool operated by the same small group of SREs who maintained the script, those tradeoffs were fine. The co-location of policy and enforcement — the number was in the same file as the code that consumed it — was worth more than any of what we gave up.

"Do nothing" as a first-class outcome

In retrospect, the design choice that mattered most was that the automation was allowed to do nothing.

When the per-cluster drain budget was already consumed by existing breakfix work (dead hosts, failed disks, hosts mid-repair, plans in progress in Wilson) the driver computed zero remaining headroom and proposed zero new drains. When the hosts "ready to come back" group was empty, the outer loop exited its happy path instead of forcing a round of remediation.

This was not a special case. It was the pervasive design.

The scripts computed what was safe, acted within it, and if the safe set was empty, acted on the empty set. Recognising "I can't usefully do anything right now" and stopping turned out to be a stronger guarantee than "always make progress," and it was the property that made the whole system trustable enough to run in a loop between operator check-ins.

The per-pass summary counts were things an operator could scan in ten seconds to decide whether to keep the loop running, kill it, or page someone. A zero drain count on a pass meant the loop had looked at the state and concluded there was nothing safe to add — useful information, not a failure mode.

The human loop

All of the above ran inside a long-running outer loop, typically in a detached tmux session pinned to a particular datacenter. Each iteration invalidated volatile classification files, reran the scanner, fanned out on-node healthchecks over Fleetexec, printed summary counts, optionally submitted a new batch of drain work, then slept and repeated.

The per-datacenter framing mattered. Blast radius stayed geographic — a bad change to the driver running in one datacenter wouldn't simultaneously affect others. Detaching the session meant an operator losing a VPN connection didn't kill a half-completed pass. Keeping the transcript on disk meant the run could be inspected after the fact without having to recreate the state.

The human loop was: start the driver, let it run, read the summary counts once an hour or so, intervene when something looked wrong. "Wrong" was usually a count that was drifting in the unexpected direction — the dead-rekick group growing instead of shrinking, or draining stuck at the same number across three passes. The scripts produced enough signal for a human to notice the patterns that the scripts themselves weren't competent to diagnose.

What I'd do differently

A few things, with four years of distance.

The one large shell driver scanner was hundreds of lines too long by the end. Not because shell was the wrong language for set-algebra-over-files, but because once something is big enough that you can't hold it in your head, moving pieces of it into helpers with clear contracts buys more than the cost of the extra files. We did some of that extraction but not enough.

Commit messages on the policy constants were inconsistent. When the numbers were good, we knew why; when the numbers were old, we had to do archaeology. A one-line convention like "raise HBase drain cap from 3% to 5% — INCIDENT-XXXX showed recovery was bandwidth-bound" would have preserved the institutional memory that otherwise evaporated with the people who tuned the numbers.

The shuf pattern should have been documented as an explicit principle earlier. I arrived at it incrementally, probably after an head-style truncation kept hitting the same nodes. Stating "sort for set operations, shuffle before acting" as a rule up front would have saved us working it out the hard way.

The Python / shell boundary was mostly right but occasionally drifted. A few pieces of semantically-rich logic ended up in shell because that was where the surrounding code was, and a few pieces of set-crunching ended up in Python because that was where the surrounding code was, but that's ok. I don't recall either direction causing an incident, but both directions made the code harder to reason about at the boundary.

I would have likely leaned harder on the per-pass transcripts as teaching material. The logs were there, the group files were there, but walking a new SRE through a real captured run — "here's what the driver decided on this pass, here's why, here's what the counts looked like by the end" — was something we did ad hoc rather than systematically. The documentation was the code, for better or worse.

Thoughts

The patterns that carried the most weight - reconciliation as intersection, sorted files on disk, the safety / policy split, the floor / ceiling asymmetry, shuf before acting, "do nothing" as a legal output — are not specific to Hadoop, or to Twitter, or to the tooling generation the code was written in. They're what fleet reliability looks like when the automation itself becomes part of the system you have to keep reliable.

The broader version of those lessons is probably another post. This one was about what actually happened, in a specific place, at a specific time, with a specific set of constraints. The tools may be gone; the patterns transfer.

Update 2026-05-01: The companion post is published Text Files as Tables

Permalink

The LLM Gap in Staff+ Interviews

2026-04-20 12:00

I spent the first quarter of 2026 interviewing for Staff+ SRE and platform roles, from recruiter screens through full panels and executive rounds. Across all of it, at companies building AI infrastructure, I was surprised how rarely anyone asked how I use or would use LLMs to do my job.

The assumption layer

I went into the job process assuming everyone was using LLMs, on both sides of the table, and I found that nobody was really saying so out loud.

On the candidate side, I used LLMs for essentially every part of the preparation. The resume refresh was first. My CV had not been seriously updated in years, so I sat down with an LLM in a long interview-and-dictate loop: talk through a role, have it ask clarifying questions, tighten the wording, and cut anything more than five to ten years old.

I built a job-prep skill of my own. The personal part of it was what mattered. It knew the level I was targeting (Senior Staff / Principal IC), knew what that level actually means in terms of strategic and technical scope, and knew what my real background could and could not support.

Per opportunity, I went deeper: company trajectory, market position, competitors, and recent news. I reviewed their values, what their interview loop usually looks like, what kinds of STAR stories I needed ready, and where my experience tied to what they actually needed.

I never used any LLM as a prop during interviews. That was prep only.

On the company side, the signals were harder to read. AI notetakers showed up in almost every Zoom, Meet, and Teams call, transcribing and summarising in the background. That is increasingly just a default feature of meeting tools rather than evidence of anything in particular. Beyond that, I genuinely don't know how much LLM assistance was in the loop on their side, and nobody volunteered it.

So the shape of the exchange was: both sides heavily LLM-assisted, neither side really talking about it.

The live rule

In every loop I went through, the expectation was that I would not use any kind of AI during the interview itself. That is a reasonable rule and I agree with it. Using an LLM as a live prop mid-interview would feel like cheating, and it wouldn't measure anything useful.

What I found more interesting was what did not happen. I was not asked to walk through an LLM-assisted workflow. Nobody asked to see my tool setup. Nobody asked how I decide when to reach for a model and when not to.

As I wrote in The Harness is the Product, the work is no longer typing; it is operating the loop. Yet the interview format of coding screens and system design rounds has not shifted much from a few years ago. We spent 45+ minutes on failure modes and never discussed how those decisions change when some of the code is generated.

To be fair, interviews are designed for consistency and comparability, which makes rapid change difficult. Still, the gap was noticeable.

What they were actually testing

The theme that did come up, repeatedly, at this level was judgment.

Judgment across business and technical trade-offs. Judgment about when to push back on a roadmap. Judgment about what to invest in operationally and what to let slide. Judgment about how to grow a team without being a manager. All IC-shaped, all appropriate for the level.

That is the part that connects to AI, even though the interviewers did not usually frame it that way.

Coding in early 2026 is not mostly typing; it is operating an LLM coding harness: reviewing what it produces, deciding what to keep, what to rewrite, and when not to generate anything at all.

The machine produces working code.

The human decides whether it should exist.

That's the job now.

If, as I've argued, code is becoming the "new assembler," then asking a Staff Engineer to hand-write boilerplate is like asking a structural engineer to do load calculations without software. It's not that we can't; it's just not where the value is. The valuable skill is taste: architectural taste, reliability taste, and a sense for where a generated thing falls short.

The junior engineer gap

The place I did push this into the conversation was around mentoring.

The traditional path from junior to experienced engineer ran through writing a lot of code, breaking things, and slowly developing taste. If much of that code is now generated, then the junior engineer's job shifts toward reviewing and steering model output. They are being asked to provide judgment; the very thing they have the least scaffolding for.

You cannot judge generated code well if you have never written and debugged enough of it yourself to know what "good" looks like. You cannot review a system design if you have never watched one fail in production. You cannot decide when the model is confidently wrong if your own confidence is itself based on model output.

This is not going to fix itself. Senior engineers will need to step in.

Real mentoring, shadowing, pairing, and deliberately putting juniors in front of problems without the harness are required. We have to make them articulate why a design is good or bad before the model does it for them.

None of the interview loops probed this directly either, but it was a theme I kept pulling into the conversation, because I think it is one of the real Staff+ problems of the next few years.

Thoughts

A lot of stages, a lot of companies, a lot of conversations, and the most AI-shaped thing about any of it was that it rarely surfaced explicitly.

From an SRE perspective, silence can be a signal. Either the problem space is moving faster than interview materials can keep up, or the mismatch is the timeless gap between how people work and how they interview.

Either way, the bar did not move. It was already there. What has changed is that the parts of the job that used to be hidden under a lot of typing are now the only parts left worth testing for, and the interview format has not quite caught up.

Permalink

AI. SF. Crusoe

2026-04-13 19:00

Today I started as a Senior Staff Production Engineer at Crusoe Energy Systems.

I left Google in January, just after the sovereign cloud project I’d spent the last three years on went generally available for S3NS Cloud in December. It felt like the right time to move on. I then spent nearly three months interviewing across different companies, roles, and levels, which gave me the chance to be deliberate about what I wanted to do next.

I kept coming back to the same criteria: infrastructure, AI, San Francisco, and a company with real operational problems to solve. I wanted to work somewhere I could contribute technically, but also help raise the bar more broadly on reliability and production quality.

The timing

AI infrastructure is one of the most important buildouts happening right now, and a lot of the reliability patterns are still being worked out. The systems are getting bigger, the demand is intense, and the usage patterns are changing fast. Inference and software engineering are already shifting from human-rate requests to agent-rate requests, at high sustained volume, all day, every day.

I do not think the AI industry has fully figured out yet how to make that reliable at scale. That is a large part of what makes it interesting to me.

Crusoe also seems to be at an important stage as a company. It is well past the early survival phase and into the harder question of how to scale well. I think it's at the turning point where operational excellence becomes a differentiator and where reliability needs stronger engineering discipline.

The stack

Crusoe owns a much larger part of the stack than most companies do: energy sourcing, data center construction, GPU infrastructure, cloud platform, and managed AI services.

That matters because it means a customer-facing problem can often be traced all the way down to a physical cause: power, cooling, network, or software. Then you can fix it at the right layer instead of working around it from a distance.

I had that kind of environment at Twitter with the Hadoop fleet, and I have missed it. Systems design is interesting on its own, but there is something especially satisfying about reliability problems that also have a physical dimension.

The role

Crusoe is investing seriously in site resiliency and operational excellence. My role is a senior IC position in Production Engineering, focused on operational readiness reviews, reliability architecture reviews, disaster recovery testing, and helping set the bar for production quality.

That is exactly the kind of work I want to be doing. I have spent a long time seeing what works, what breaks, and what tends to get ignored until it hurts. The chance to help build a strong reliability practice is a big part of the appeal.

The people

I knew a few people at Crusoe from Twitter and elsewhere in the industry. The interview conversations felt more like working sessions than performance exercises, which I appreciated. I wanted to work with people who were engaged with the problems, curious about my past experience, and direct about what they needed.

After seven years at Twitter and three at Google, I have spent a decade working on infrastructure at large scale. Crusoe is a chance to do that again in a different setting: in San Francisco, in AI, and at a company where a lot of the practices are still being defined.

Put together, they were good reasons to say yes.

Permalink

Eight Coding LLM Tools, One Configuration

2026-03-16 11:23

I currently use eight coding LLM tools at various times: Claude Code, Codex, Cursor's CLI (agent, formerly cursor-agent), Gemini CLI, Amp, Copilot, OpenCode, and Kilo. Each tool has its own configuration format, its own mechanism for custom commands, and its own opinions about where settings live. I want the same behavior from all of them: no emojis in commit messages, run markdownlint on every markdown file, don't be sycophantic.

Getting that consistency across multiple tools on a dozen development machines turned out to be its own project. I started pulling it together in mid-2025 after the third time I fixed a guideline in one tool's config and forgot to update the others.

The problem

Coding LLM CLI tools are multiplying fast, and none of them have agreed on a configuration standard. Claude wants JSON settings and markdown commands with YAML frontmatter. Cursor wants its own JSON format and plain markdown. Gemini wants markdown files with TOML headers. They all have different mechanisms for custom commands, and different places to put project-level vs global rules. And they keep changing!

If you only use one tool on one machine, this is fine. I use several tools across a bunch of machines running Fedora, Gentoo, Debian, Ubuntu, and macOS. Every time a tool updates its config format or I want to change a rule, I was finding myself editing the same content in multiple places and inevitably getting drift and tired of this.

Single-source guidelines

The fix was straightforward. I keep one file, data/guidelines.md, that defines how I want all my coding LLM assistants to behave. My dotfiles system templates it into each tool's config format on install:

  • Claude gets them in ~/.claude/CLAUDE.md
  • Cursor gets them in ~/.cursor/cli-config.json
  • Gemini gets them in ~/.gemini/GEMINI.md

Change the guidelines once, run make install, and every tool picks them up.

Write once, generate three ways

Custom commands were trickier. I have a git-commit command that tells the LLMs how to structure commit messages: conventional commit format, no emojis, no "Generated by Tool" footers, imperative mood. The content is defined once but each tool wants a different wrapper format.

Claude wants YAML frontmatter:

---
allowed-tools: Bash(git *)
description: Make git commits
---
# Command definition

Look at all the git $ARGUMENTS changes...

Cursor and Codex want plain markdown in different places, while Gemini wants TOML. So there's a template per tool that wraps the shared content in the right format. The install process generates all of them from the single source.

The custom git-commit command is the main thing I use across all the tools to avoid hype and phrases I hate, and it's what keeps LLM-generated commits looking like they were written by a human who cares about their git history.

Not all tools support custom commands yet, or at least I haven't figured it all out. Currently Claude, Cursor, Gemini, and Codex get generated commands. The rest get the shared guidelines but not the command wrappers. The template approach means adding a new tool is just another wrapper when they catch up.

This pattern extends to longer prompt definitions that Claude calls "skills." Skills can be multi-file: the blog-post skill that was used to write this post includes a ~300-line style guide derived from analyzing dozens of posts spanning two decades of my blog, plus the prompt definition that references it. Other skills handle things like analyzing job descriptions or preparing for interviews.

Skills now deploy to both Claude Code and Codex from shared source data. Claude gets symlinks into ~/.claude/skills/; Codex gets real file copies into ~/.codex/skills/ with OpenAI-format metadata for skill discovery. For claude.ai's web interface, the install process builds zip archives directly.

The settings merge problem

One problem I hadn't anticipated: Claude's settings.json accumulates permission rules as you use it. Every time you approve "allow this tool to run git commands" or "allow writes to this directory" those get saved. If you naively overwrite the settings file with a template on every install, you lose all the permissions you've granted during a session.

The fix was a JSON merge strategy: when installing a templated JSON settings file, the script loads the existing file, unions and deduplicates the permissions.allow, permissions.deny, and permissions.ask arrays with the template's values, preserves any extra top-level keys, and writes the merged result. The template provides the baseline; local usage adds to it.

In practice, those arrays are the tool's "what am I allowed to run and touch?" rules, and preserving them avoids re-approving the same actions after every install.

code-aide: installing the tools themselves

Installing coding LLM CLIs on a dozen Linux and macOS machines is its own annoyance. Some need Node.js and npm. Others have their own native installer scripts. Cursor downloads a tarball directly. Each has different prerequisites, each updates on its own schedule, and some have changed their installation method since they launched.

I wrote code-aide to handle this. It's now open source, installable via uv tool install code-aide or pipx, with zero external dependencies and Python 3.11+ stdlib only. Tool definitions live in a JSON config file (tools.json) with a schema_version field, so adding a new tool means adding a JSON entry rather than editing Python code.

code-aide supports three installer definition types:

  1. npm tools get an npm_package name and optional min_node_version (Gemini, Codex, OpenCode, Kilo, Copilot)
  2. script tools get an install_url and install_sha256 (Claude, Amp)
  3. direct download tools get a tarball URL template with platform and architecture substitution (Cursor)

Those types describe how a tool is modeled in tools.json. The VIA column below is the install source detected on this specific machine, which can be brew or cask if that host was provisioned that way.

The snippets below are example output from my setup at the time of writing:

$ code-aide status -c
uv run code-aide status -c
TOOL      STATE   VERSION                 VIA       PATH
agent     ok      2026.02.27-e7d2ef6      download  /Users/.../.local/bin/agent
claude    newer   2.1.71                  script    /Users/.../.local/bin/claude
gemini    ok      0.32.1                  brew      /opt/homebrew/bin/gemini
amp       ok      0.0.1772734909-g2a936a  script    /Users/.../.local/bin/amp
codex     ok      0.111.0                 cask      /opt/homebrew/bin/codex
copilot   newer   0.0.423                 npm       /opt/homebrew/bin/copilot
opencode  newer   1.2.20                  brew      /opt/homebrew/bin/opencode
kilo      opt-in

Note: The agent row is Cursor's CLI; the binary started out as cursor-agent. The latest-version metadata still uses cursor, which is why the table below shows that name instead.

Some of these tools install via curl | bash and I'd rather not run a script that's changed since I last reviewed it, so for script-type installers the downloaded script is verified against a known SHA256 of a reviewed script, and will not run if it has changed. For direct_download tools like Cursor, the install script changes with every version, so SHA256 verification was dropped in favor of version-string comparison against the cached latest version.

Auto-migration

One thing I didn't anticipate: tools keep changing their installation method. Claude Code started as an npm package (@anthropic-ai/claude-code) and later shipped a native installer script. In my own setup, Cursor installs also moved from shell-script installs to direct tarball downloads managed by code-aide. If you keep older install paths around, things often still work, but the fleet drifts and upgrade behavior becomes less predictable. You also may end up with packaged installs as well as self installs.

code-aide 1.7.0+ detects these deprecated installs and auto-migrates. code-aide status warns you, code-aide upgrade handles the transition: remove the old install, run the new method, verify it worked. If something goes wrong, it tells you what to do manually rather than leaving you with a half-migrated mess.

Keeping up with upstream

The SHA256 hashes go stale, of course. The update-versions command handles that: for npm-installed tools it queries the registry for the latest version and publish date; for script-install tools it downloads the install script, computes the SHA256, and compares against the stored hash. Version extraction is custom per-tool, for example Cursor embeds YYYY.MM.DD-hash patterns in download URLs, Amp has a GCS version endpoint, others use VERSION= patterns in the script itself.

 code-aide update-versions
Checking 8 tool(s) for updates...

Tool      Check         Version                  Date        Status
--------  ------------  -----------------------  ----------  ------
cursor    script-url    2026.02.27-e7d2ef6       2026-02-27  ok
claude    npm-registry  2.1.71                   2026-03-06  ok
gemini    npm-registry  0.32.1                   2026-03-04  ok
amp       script-url    v0.0.1772937800-g3b2e3d  2026-03-08  ok
codex     npm-registry  0.111.0                  2026-03-05  ok
copilot   npm-registry  1.0.2                    2026-03-06  ok
opencode  npm-registry  1.2.21                   2026-03-07  ok
kilo      npm-registry  7.0.40                   2026-03-06  ok

Updated latest version info in ~/.config/code-aide/versions.json.
No installer checksum updates required (latest version metadata was refreshed).
Note: 'update-versions' checks upstream metadata, not your installed binary versions. Use 'code-aide status' and 'code-aide upgrade' for local installs.

code-aide uses a two-layer version model: bundled definitions ship with the package and contain install methods, URLs, and SHA256 checksums. A local cache at ~/.config/code-aide/versions.json stores the latest versions and dates from update-versions. The cache merges into the bundled data at load time, so you can track upstream changes without waiting for a new code-aide release.

When a script-install tool's SHA256 has changed, it flags the mismatch and can write the updated hash back with --yes, or interactively one at a time. Finding the right version endpoint for Amp took a couple of tries; the obvious ampcode.com/version URL returns HTML; the actual version lives at a GCS endpoint buried in the install script. That version string (v0.0.1774123456-gc0ffee) probably also tells you something about Amp's relationship with semantic versioning, if we even care about such things in this fast moving LLM world.

Prerequisites

code-aide also handles the Node.js dependency problem for npm-based install paths. The minimum Node.js version varies by tool (at time of writing: Gemini wants 20+, Codex and Copilot want 22+ on npm installs). If you use brew / cask / native installers for those tools, Node.js may not be required. code-aide install -p detects your system package manager (apt, dnf, pacman, emerge, and a few others) and installs Node.js and npm if they're missing.

What I'd do differently

The format fragmentation across tools is the real ongoing cost. I've had to update templates multiple times already because a tool changed where it looks for config files or switched its frontmatter format. There's no standard emerging; if anything, each new tool invents another format. The title of this post has changed numbers several times before publishing.

The single-source approach helps, but it only works because the semantic overlap between tools is high; they all want roughly the same information, just arranged differently. If the tools diverge in what they support rather than just how they format it, the shared-content model gets harder to maintain.

Testing has improved since the first version. code-aide has a pytest suite now, though some of the harder-to-test operations (upgrade, remove, prerequisite installation) are still on the TODO list. Progress, at least.

Numbers

  • 8 coding LLM tools managed (4 with custom commands, all 8 with guidelines)
  • 3 install types: npm, script, direct download
  • 0 external Python dependencies

Adding a new coding LLM tool is mostly: add a JSON entry to tools.json, and it shows up everywhere on next install. Unless they invent yet another install mechanism!

Thoughts

The interesting problem here isn't dotfiles management, that's a solved problem with many good tools. The new problem that coding LLM assistants have created is a new category of configuration that needs to stay synchronized: guidelines, custom commands, skills, and permissions, across tools that don't share any common format. I've written separately about why I think the harness layer matters more than the model; this is the practical side of that argument. The approach I describe here is straightforward enough for readers to replicate by pointing a coding LLM at this post.

There are still gaps, such as whether prompt style should vary by harness and model combination. I haven't tested that systematically yet, including whether it is necessary to SHOUT in one model's skill text to emphasise.

code-aide is at github.com/dajobe/code-aide and installable via uv tool install code-aide.

This follows my earlier post on Redland Forge, which covers using LLMs for the actual development work. A companion post on the dotfiles system that powers the templating is also published.

Permalink

Zero-Dependency Dotfiles for a Homelab

2026-03-16 11:23

I have a dozen development machines in my homelab: a mix of Fedora, Gentoo, Debian (stable and unstable), Ubuntu LTS, macOS, and a few Turing Pi nodes. I got tired of my configurations drifting apart, so I built a dotfiles management system in Python. No external dependencies, just str.format() templates and JSON config files. It manages shell configs, git settings, Kubernetes credentials, and the configuration for eight different coding LLM CLI tools. That last part turned out to be an interesting use case, but the foundation described here is what makes it work.

The problem

The usual dotfiles approach is a git repo full of symlinks and a bash script to wire them up. That works until you need the same .bashrc to behave differently on macOS versus Debian, or you need API keys templated into config files without committing them to git, or you want your coding LLM assistant to follow the same rules regardless of which tool you're using this week.

I wanted one repo, one install command, and consistent configuration everywhere.

The approach

The core started as a single Python script, dotfiles.py, which I began writing in September 2025. It has since been refactored into a package: dotfiles.py remains the CLI entrypoint, and dotfiles_lib/ holds a dozen modules (config, installer, renderer, generators, platform detection, and others) totaling around 3,800 lines. The split was motivated by code review and testing, in that a 2,500-line monolith is hard to reason about in diffs, and hard to unit-test without importing everything.

The bootstrap was straightforward: I copied the shell dotfiles from all my hosts into one tree of per-host files, pointed an LLM at the pile, and had it analyze them for commonalities and generate the initial files and templates. Most of the per-host differences turned out to be PATH entries and tool availability, which mapped cleanly to OS detection variables.

The tool reads a main JSON config that maps target files to their sources:

{
  ".bashrc": {
    "mode": "templated",
    "template": "bash/bashrc.tmpl"
  },
  ".gitignore_global": {
    "mode": "symlink",
    "source": "git/gitignore_global"
  },
  ".claude/settings.json": {
    "mode": "templated",
    "template": "templates/claude-settings.tmpl"
  }
}

There are three installation modes: symlink for files that don't vary, templated for files that need per-machine or per-secret customization, and copy for root user configs where you don't want a symlink back to a regular user's repo. A fourth mode obsolete is also available to mark files that should be cleaned up during install which is useful when tools get renamed or configs move (coding LLMs do this a lot).

Templates use Python's str.format(): no Jinja2, no dependencies. The template data comes from three sources: the JSON config, OS detection at install time, and a ~/.secrets.sh file that holds API keys and credentials. On install, the script parses ~/.secrets.sh, merges it with the computed template data, and renders everything.

Installation on any machine is:

make install

(make clean handles the build artifacts, cache directories, and other generated files.)

Secrets without the complexity

I didn't want a secrets manager dependency. The approach is a ~/.secrets.sh file that's never committed, with a simple KEY=value format. It's also sourced by shells. The script parses it, strips quotes, and makes the values available as template variables:

# ~/.secrets.sh
ANTHROPIC_API_KEY="sk-ant-..."
GEMINI_API_KEY="..."
GIT_EMAIL="dave@dajobe.org"
...

If a key exists as an environment variable, that takes precedence over the file.

This is also where things like KUBE_CA_DATA live: base64-encoded certificate authority data that gets templated into kubeconfig files without committing credentials. A separate script pulls the right variables out of a kubeconfig YAML file so I don't have to do it by hand when cluster certificates rotate.

A separate utility copies the secrets file to all dev hosts over SSH. I keep it mode 0600 and only sync it to machines I trust with those credentials.

Recursive config directories

The dotfiles system doesn't just handle flat config files. Some tools such as coding LLM CLIs want directory trees for commands, skills, and agents rather than a single config file, so the installer walks those directories recursively and deploys them the same way it does ordinary dotfiles.

A skill like blog-post isn't a single file, it's a subdirectory containing a prompt definition and supporting reference materials. The install script had to be extended to handle these recursive structures rather than just flat file listings.

Agent definitions use the same pattern. They live in agents-config.json, which specifies the model, allowed tools, and a reference to the markdown prompt content. At install time that metadata is combined with the prompt text to generate tool-specific agent files, with a parity test to make sure the Claude Code and Amp versions stay in sync.

The git-commit agent is a good example of why this is useful. It can run git add, git diff, git commit, and a few other git commands and nothing else, which means I can point it at a messy working tree and trust it not to get creative.

Skills and agents then deploy with whatever packaging each tool expects: symlinks, file copies, and zip archives depending on what the target supports. The tool-specific details are in the companion post.

Splitting out code-aide

Installing, updating, and checking versions of Claude Code, Cursor, Gemini CLI, and the rest eventually outgrew its corner of dotfiles.py and got extracted into a separate open source tool called code-aide. The dotfiles repo still bootstraps it during make install with a best-effort uv tool install, but the upstream-version churn now lives in its own project rather than bulking up the renderer and installer logic here.

Ghostty terminal support

I started playing with the Ghostty terminal emulator and discovered that its xterm-ghostty terminfo entry isn't installed on most of my remote hosts. SSH into a machine without it and you get missing or unsuitable terminal: xterm-ghostty from every ncurses-based tool. Which is an annoying bump.

The fix: vendor the Ghostty terminfo source file into the dotfiles repo and compile it into ~/.terminfo during make install using tic. The install script checks whether the entry already exists and skips the compilation if so. Ghostty's own config file is also templated and deployed.

Not glamorous, but it's the kind of thing that makes a dotfiles system earn its keep with one fix deployed everywhere, instead of manually installing terminfo on each host.

Multi-host deployment

With a dozen or so machines, running ssh host 'cd dev/dotfiles && git pull && make install on each one gets tedious. Instead I have a deploy-dotfiles utility that reads the hosts list from config and runs the install:

$ deploy-dotfiles
host1 ✔
host2 ✔
host3 ✔
host4 ✔
host5  (connection timed out)
...

This is not done in parallel; I considered it but the SSH overhead means it's fast enough in sequence, and sequential output is easier to read when something fails. It's fine for a small homelab.

After make install, a JSON receipt file is written to ~/.dotfiles-version.json recording the git commit, install timestamp, and hostname. A version subcommand shows the source HEAD alongside the installed receipt, flagging stale installs. The deploy-dotfiles utility has a --check mode that queries receipt files on all remote hosts without deploying, so I can see at a glance which machines are behind.

What I'd do differently next time

The str.format() template engine has its limits. Anything with literal curly braces (JSON templates, for instance) requires doubling every brace that isn't a variable. I have a 245-line JSON config full of doubled braces. A Jinja2-like syntax would be cleaner, but I'd have to either add a dependency or write a minimal template engine. For now, the doubled braces are ugly but functional. They're also a reliable way to make an LLM lose track of what it's editing.

Some kind of file-system-convention approach (drop a .symlink suffix on files you want symlinked) might reduce the config overhead, but I haven't hit enough pain to justify the rewrite.

The test suite now covers config validation, template rendering, file installation, agent generation, skill parity, the markdown formatter, version receipts, utils, and the CLI itself with close to 20 test modules, around 4,000 lines. A pre-submit script runs Black, mypy, and pytest through uv, and the Makefile has targets for each. It's not full CI yet (there's no pipeline triggered on push), but the local workflow catches most regressions before they're committed.

Numbers at a glance

  • 30+ dotfiles managed (15+ templated, 14 symlinked)
  • 3 agents generated for Claude Code and Amp
  • 2 skills deployed to Claude Code and Codex (blog-post, job-prep)
  • 8 coding LLM tools configured with code-aide
  • 12+ development hosts deployed to
  • ~20 test modules, ~4,000 lines
  • 0 external Python dependencies

The whole thing runs with make install and takes about a second. Zero dependencies means it works on a fresh machine with just Python 3.11, which every machine in my fleet already has.

Adding a new machine is: clone, create ~/.secrets.sh by hand, run make install.

Adding a new dotfile is: create the template or source file, add one entry to the config JSON, run make install. I just get an LLM to do those changes with a prompt like ingest ~/.newdotfile and manage it with dotfiles, review and approve.

Thoughts

This work is private but the approach is straightforward enough to replicate by pointing an LLM at this blog post. The interesting bits are the template data pipeline (secrets file + OS detection + JSON config merged at install time) and the zero-dependency constraint, not any particularly clever code.

The coding LLM tool configuration is covered in a companion post: Eight Coding LLM Tools, One Configuration.

Permalink

The Harness Is The Product (and Other Hot Takes)

2026-03-06 19:00

I've spent the last nine months using AI coding tools on my own projects such as Claude Code, Cursor, Gemini CLI, Amp, Codex and others. I'm currently between jobs, which means I have no corporate agenda and no stake in any of these companies.

I have opinions. Some of them might even survive the week.

Part 1: The Harness Is the Product

GPT-5.4 is very good with Cursor. Surprisingly good. I don't even see it showcased this well in ChatGPT, which is OpenAI's own product. That's a tell.

The most interesting thing happening in AI right now isn't the models, it's the harnesses acting as the integration layer: tool calling, UX, agentic orchestration. A great model in a mediocre harness loses to a good model in a great harness. Gemini's models are competitive but feel underwhelming because Google's tooling can't showcase them, which is presumably why they acquired Windsurf and relaunched it as Antigravity. Claude's models shine brightest through Claude Code. The engine matters, but nobody buys an engine.

Hot take: The model is the engine. The harness is the car.

This has implications for where moats form. For a while it looked like scale and training compute were the only defensible positions. That's still true at the absolute frontier, but below that line, models are converging fast enough that harness quality dominates the user experience. Cursor and Claude Code figured this out early. The companies that win will be the ones treating the model as a component and the harness as the product, which is a deeply uncomfortable position for labs that spent billions training the models.

It's worth being specific about what a modern harness actually does, because the shift is easy to miss. Early AI coding tools worked like this:

prompt → completion

You asked a question, got an answer, tried again if it was wrong.

Modern coding systems work like this:

observe repository  plan change  edit files  run tests  inspect errors  iterate

That loop is subtle but it changes everything. The system isn't generating code snippets; it's participating in a continuous cycle over a real project. It reads the codebase, modifies multiple files, runs commands, and adjusts based on results. Less autocomplete, more collaborator.

And here's the thing: a lot of the agentic stuff IS just this loop with different tools plugged in. An agent observes state, generates a command or script, runs it, inspects the output, decides what to do next. Even tasks that aren't obviously programming often reduce internally to "write some python, call an API, parse the result, continue." If you solve the coding harness, you've solved a large chunk of the general agentic problem. This is something Anthropic realized relatively recently and took advantage of with CoWork.

Hot take: Agents are mostly code-writing loops with tool access.

This also means the IDE is quietly becoming an agent runtime. Editors already provide everything agents need: structured projects, deterministic execution environments, version control, feedback loops. It's not a coincidence that the best agent experiences are happening inside coding tools or on CLIs rather than chat windows.

Hot take: The IDE is becoming the operating system for AI agents.

The Google tragedy

Google is the most painful case study: they have the research, the infrastructure, the talent, and arguably the best foundation model team on earth, yet they keep fumbling the integration layer. The Windsurf acquisition and Antigravity launch tells the story: Google paid $2.4 billion to license Windsurf's code and hire its founders, then launched Antigravity four months later.

That's a strange failure mode for the company that built Gmail, Maps, and Search. Something broke culturally.

I want Google to be good and honestly, they do have adjacent AI products that are very good from my experience. NotebookLM is great, AI search is free and genuinely useful. The whole Google Docs ecosystem works well with AI. Google's strength has always been horizontal platform plays, and those products reflect that.

But the coding-centric agentic future is a vertical integration game and Google keeps losing it. Their model quality isn't the problem; their harness is.

If harnesses are becoming the product, the next question is: who builds them?

Part 2: Open Source and the Harness Layer

The leading harnesses right now are proprietary. Cursor is proprietary. Claude Code is proprietary. Antigravity is a $2.4 billion proprietary fork. So: closed source wins?

Not so fast. It's worth noting that the model layer hasn't been won by open source either, despite the narrative. Open weights models from Meta and others (mostly Chinese labs) are competitive but the frontier is still closed, and Meta's stuff is clearly a strategic weapon against Google and OpenAI dressed up as generosity.

The harness layer is more interesting because it's more contested. OpenClaw blows a hole in the story. Formerly Clawdbot, then Moltbot, it went from 9,000 to 60,000+ GitHub stars in days and now sits over 250K. It's not a coding harness in the Cursor sense; it's a general agentic harness with message routing across WhatsApp, Telegram, Discord, and dozens of other channels, autonomous task execution, 50+ integrations, running 24/7 on your own hardware. OpenCode is doing something similar for the coding-specific case.

These projects are moving fast and quite arguably faster than their closed counterparts on raw feature velocity.

The tradeoff is risk. OpenClaw's attack surface is enormous. Security researchers have mapped it against every category in the OWASP Top 10 for Agentic Applications. There are documented cases of agents acting well beyond user intent; one created a dating profile autonomously, which is either impressive or terrifying depending on your perspective. Its creator, Peter Steinberger, joined OpenAI and the project is moving to an open source foundation. That could mean more institutional backing or it could mean founder departure stalls momentum, it's too early to tell.

So the real picture isn't "open source is winning" or "open source is losing." It's that closed harnesses and open source harnesses are optimizing on different axes:

  • Closed (Cursor, Claude Code): safety, polish, tight model integration
  • Open (OpenClaw, OpenCode): extensibility, speed, community velocity, accepting more risk

Both are viable today. The question is which axis matters more as agentic tools move from developers to everyone else. My guess: the closed harnesses win the mainstream because most people don't want to manage their own attack surface. But open source keeps pushing the bleeding edge, and ideas flow from bleeding edge to mainstream on roughly a three-month delay.

Hot take: The open source harness ecosystem is about three months ahead of commercial tools. The ideas show up there first; the polish shows up later in closed products.

Hot take: Models may become commodities. Harnesses are the product.

This might be the first major technology wave where open source doesn't clearly own the infrastructure layer, or it might not. Ask me again in a few weeks, when this take will be outdated.

Part 3: Code Is the New Assembler (and Other Predictions)

Code is becoming the new assembler. Nobody writes assembler anymore, but it didn't disappear, it just got generated and was below the surface. Code is heading the same way. The skill is shifting from "can you write code" to "can you specify intent precisely enough that code gets generated correctly." That's closer to systems architecture than traditional programming.

The agentic loop where a human specifies, model generates, harness orchestrates, human validates, is the new unit of work. This now applies well beyond coding, as any task that can be decomposed into tool calls and validation steps is fully in agentic territory. Code is just where it showed up first because code is the easiest thing to validate (it either runs or it doesn't, mostly).

The competence amplifier

Here's something I didn't expect. Over the past nine months I've shipped working tools and apps written in Go, JavaScript, and Postgres. I don't write Go or JavaScript, although I can read them. I've never administered Postgres in anger. But I have 25+ years of systems experience, and it turns out that's enough. I can read the generated code, spot architectural problems, evaluate whether the error handling makes sense, and steer the iteration loop. I can't write idiomatic Go from scratch but I can tell when the AI-generated Go is doing something stupid.

This is the real shape of "code as assembler." The AI handles the syntax and idiom; the human provides the judgment layer. My experience with distributed systems, failure modes, and operational patterns transfers directly even when I don't know the language. The harness doesn't replace expertise, it makes expertise portable across languages and frameworks in a way that wasn't possible before.

This has two implications.

  1. For experienced developers, your value shifts from "I know language X" to "I know how systems work." That's a bigger, more durable moat.
  2. For non-developers (product managers, designers, domain experts) the barrier to building working software just dropped dramatically. They don't need to learn Go or Python. They need to learn how to specify what they want clearly enough that the loop converges. That's a different skill, and a lot of people already have it without realizing.

Hot take: AI coding tools don't replace developers. They make systems thinking portable across any language or framework.

The model picker disappears

One near-term prediction: the model picker goes away. Nobody types http:// anymore. Nobody picks which CDN node serves their webpage. The system picks. Model selection is heading the same way.

The fact that I currently care whether I'm running Sonnet 4.6 or GPT-5.4 is a sign of immaturity, not a feature. In two years, maybe less, the harness routes tasks dynamically:

cheap model      routine edits, boilerplate
reasoning model  planning, architecture
coding model     implementation
verifier model   checking, testing

The user interacts with one interface. The model choice becomes an implementation detail, like which CPU core your thread runs on. That'll be a sign the ecosystem has grown up.

Hot take: The model menu will eventually disappear.

(The model picker sticking around for power users and experts is fine. I'm talking about the default experience.)

The rate of change problem

The uncomfortable corollary to all of this is that the rate of change is stupid fast. Expertise about specific model behavior expires in days to weeks. Any opinion formed about a model's capabilities on a given Tuesday is stale by the following Tuesday. Including the opinions in this post, presumably.

The durable skills are meta-skills: evaluating models, designing harnesses, specifying intent, thinking in systems. The specific knowledge of "Claude is good at X but bad at Y" or "GPT-5.4 handles long context better than..." is transient. It's useful for a week, maybe two, then something ships and the landscape shifts.

This favors a certain kind of engineer. The senior generalist who thinks in systems, evaluates tradeoffs, and adapts fast. Not the specialist who knows one tool deeply. This is convenient for me, I realize, but I think it's true regardless.

Hot take: The most valuable AI skill is no longer prompting. It's building the loop around the model.

Where this lands

I don't have a neat conclusion. These are hot takes and some of them will age badly. But the harness-as-product thesis feels durable to me, the open source picture is genuinely unsettled, and "code as assembler" is more a description of what's already happening than a prediction.

Interesting times.

Permalink

22 Years of Code, 2 Months of LLMs: The Redland Forge Story

2025-09-13 12:34

Twenty-two years ago, I wrote some Perl scripts to test Redland RDF library builds across multiple machines with SSH. Two months ago, I asked an LLM to turn those scripts into a modern Python application. The resulting Redland Forge application evolved from simple automation into a full terminal user interface for monitoring parallel builds - a transformation that shows how LLMs can accelerate development from years into weeks.

The Shell Script Years (2003-2023)

The project originated from the need to build and test Redland, an RDF library with language bindings for C, C#, Lua, Perl, Python, PHP, Ruby, TCL and others. The initial scripts handled the basic workflow: SSH into remote machines, transfer source code, run the autoconf build sequence, and collect results.

Early versions focused on the fundamentals: - Remote build execution via SSH - Basic timing and status reporting - Support for the standard autoconf pattern: configure, make, make check, make install - JDK detection and path setup for Java bindings - Cross-platform compatibility for various Unix systems and macOS

Over the years, the scripts grew more features: - Automatic GNU make detection across different systems - Berkeley DB version hunting (supporting versions 2, 3, and 4) - CPU core detection for parallel make execution - Dynamic library path management for different architectures - Enhanced error detection and build artifact cleanup

The scripts were pretty capable of handling everything from config.guess location discovery to compiler output integration into build summaries.

The Python Conversion (2024)

The script remained largely the same until 2024, when I decided to revisit it. It was time to move on from Perl and shell scripts and it seemed like a good opportunity to use the emerging LLM coding agents to do that with a simple prompt. This was relatively easy to do and I forget which LLM I used but it was probably Gemini.

The conversion to Python brought:

  • Type hints and modern Python 3 features.
  • Proper argument parsing with argparse instead of manual option handling
  • Pathlib for cross-platform file operations.
  • Structured logging with debug and info levels.
  • Better error handling and user feedback.

The user experience improved as well: - Intelligent color support that detects terminal capabilities. - Host file support with comment parsing. - Build summaries with success/failure statistics and emojis. I'm not sure if that's absolutely an improvement, but 🤷

Terminal User Interface (2025)

A year later, in July 2025, with LLM technology rapidly advancing almost weekly, I was inspired to make a big change to the tool by prompting to make it a full text user interface, with parallel execution of the builds visible interactively in the terminal.

Continuing from the Python foundation, the tool gained a full terminal user interface. The TUI could monitor multiple builds simultaneously, showing real-time progress across different hosts.

One of the first prompts was to identify what existing Python TUI and other classes should be used, and this quickly led to using blessed for TUI and paramiko for SSH.

A lot of the early work was making the TUI work properly on a terminal, where the drawn UI did not cause scrolling or overflows, and the text wrapping or truncation worked properly. After something worked, prompting the LLM to make unit tests for each of these was very helpful to avoid backsliding.

As it grew, the architecture became much more modular: - SSH connection management with parallel execution - A blessed-based terminal interface for responsive updates - Statistics tracking and build step detection - Keyboard input handling and navigation

Each of those was by prompting to refactor large classes, sometimes identifying which ones to attack by using a prompt to analyze the code state and identify candidates, and sometimes by running external code complexity tools; in this case Lizard

The features grew quickly at this stage: - Live progress updates based on event loop. - Adaptive layouts that resize with the terminal. - Automatic build phase detection (extract, configure, make, check, install). - Color-coded status indicators both as builds ran, and afterwards. - Host visibility management for large deployments so if the window was too small, you'd see a subset of hosts building in the window.

The design used established design patterns such as the observer pattern for state changes, strategy pattern for layouts, and manager (factory) pattern for connections. Most of these were picked by the LLM in use at the time with occasional guidance such as "make a configuration class"

Completing the application (September 2025)

The final phase built the tool into a more complete application and added release focus features and additional testing. The tool transformed from an internal development utility into something that could be shared and useful for anyone who had an autoconf project tarball and SSH.

Major additions included: - A build timing cache system with persistent JSON storage so it could store previous build times. - Intelligent progress estimation based on the cached times. - Configurable auto-exit functionality with countdown display. - Keyboard based navigation of hosts and logs with a full-screen host mode and interactive menus.

The testing at this point was quite substantial: - Over 400 unit tests covering all components. - Mock-based testing for external dependencies. - Integration tests and edge cases.

At this point it was doing the job fully and seemed complete, and of more broader use than just for Redland dev work.

Learnings

Redland Forge demonstrates how developer tools evolve. What started as pragmatic shell and perl scripts for a specific need grew into a sophisticated application. Each phase built on the previous, with the Python conversion serving as the catalyst that enabled the terminal interface.

It also demonstrates how LLMs in 2025 can act as a leverage multiplier to productivity, when used carefully. I did spend a lot of time pasting terminal outputs for debugging the TUI boxes and layout. I used lots of GIT commits and taggings when the tool worked; I even developed a custom command to make the commits in a way that I prefered, avoiding hype which some tend to do, but that's another story or blog post. When the LLMs made mistakes, I could always go back to the previous working GIT (git reset --hard), or ask it to try again which worked more than you'd expect. Or try a different LLM.

I found that coding LLMs can work on their own somewhat, depending on the LLM in question. Some regularly prompt for permissions or end their turn after some progress whereas others just keep coding without checking back with me. This allowed some semi asynchronous development where a bunch of work was done, then I reviewed its work and adjusted. I did review the code, since I know Python well enough.

The skill I think I learnt the most about was in writing prompts or what is now being called spec-driven development for much later larger changes. I described what I wanted to one LLM and made it write the markdown specification and sometimes asked a different LLM to review it for gaps, before one of them implemented it. I often asked the LLM to update the spec as it worked, since sometimes the LLMs crashed or hung or looped with the same output, and if the spec was updated, the chat could be killed and restarted. Sometimes just telling the LLM it was looping was enough.

The final application I'm happy with and it's nearly 100% written by the LLMs, including the documentation, tests, configuration, although that's 100% prompted by me, 100% tested by me and 100% of commits reviewed by me. After all, it is my name in the commit messages.


Disclaimer: I currently work for Google who make Gemini.

Permalink

Production Chaos

2025-02-01 11:00

Chaos happens a lot in production and in the associated roles such as Site Reliability Engineering (SRE). Day to day you can be dealing a scale of chaos from noise, interruptions, unknowns, mysteries all the way up to incidents, emergencies and disasters. If you are working in that space, you will have to deal with tradeoffs of risk, time, uncertainty and more. The "unknown unknowns" as Donald Rumsfeld put it or the 1-in-a-million events can happen regularly, if are operating a lot of code, data or systems.

If this is going to happen all the time, you need to have support around you, in particular a team, leadership and organization you can trust to support you whatever happens. You have to be able to relax even in the stressful environment, not worrying about your personal safety or career. This leads to the SRE best practice of blameless when things are failing; it's the fault of the system, not the person. There is no way that you are going to get people working their best, if they are going to get blamed for making mistakes. That way leads to hiding things, avoiding responsibility and a negative feedback loop where everyone avoids making things better.

If you have a culture of blame and fear, you are going to get the worst from your people. Which leads me to my experience working at Twitter when Phony Stark aka Space Karen aka Elmo Maga bought it. He did not trust his employees, did not support them, did not communicate with them, and indeed blamed them. He wanted and fostered a culture of fear and uncertainty.

It was so chaotic at the end I once had two managers message me the same hour that they were my new manager. I also I didn't know at the time that he was my manager for two days:

Picture showing table of my Twitter managers and dates with names redacted except for Elon Musk

Elon Musk is a negative example of how to manage and how to be a grown human. He has many character flaws and a Character Limit.

He is not an example to copy.

It's nauseating seeing him repeating this again at the US Government: Déjà Vu: Elon Musk Takes His Twitter Takeover Tactics to Washington (Gift link)

Permalink