Text Files as Tables : Dave Beckett

This is a companion post to Twitter Hadoop Breakfix, Four Years On. That post was about the architecture of the system as a whole. This one is about the particular part that did the most work per line of code: sorted text files, manipulated with comm, join, awk, grep, sort -u.

I would claim that this style is a small line-oriented relational algebra. Once you see that, a lot of operational shell stops looking like a pile of text tricks and starts looking like a database you happen to be able to read with cat.

One row per line

The breakfix system in the companion post used a working directory full of text files. Each file held one hostname per line. Sorted. That was the whole schema.

Some of the files the scanner maintained:

all
hadoop-workers
excluded
managed-puppet-failing
responsive
dead-rekick
breakfix-good
maintenance-in-service

Plain newline-separated text files. In most cases the entire line (row) or the first field is the key. That was enough structure for comm, join, sort, awk, and grep to behave like a tiny analytics engine, and the engine ran on anything that had a shell.

Sort first, then think in sets

comm and join expect sorted input. If the input is not sorted they produce silently wrong answers — not an obvious error, just bad data. Sort order is the contract needed here; the one thing a pipeline in this style has to respect.

Once the inputs are sorted, set algebra becomes a one-liner.

comm -12 A B — lines in both A and B (intersection)
comm -23 A B — lines in A that are not in B (set difference, i.e. an anti-join)

The output is itself sorted, so it composes and you can pipe a comm into another comm or into join without re-sorting.

From the breakfix retrospective, the computation to form candidate hosts to rekick (reinstall, put through the install loop) was roughly:

all
  - excluded                # comm -23
  ∩ responsive              # comm -12
  ∩ puppet_ok               # comm -12
  → candidate-rekick

Three input files, one output file, every step in a file and inspectable. The SQL translation is something like:

SELECT hostname
FROM   hosts
WHERE  hostname NOT IN (SELECT hostname FROM excluded)
  AND  hostname IN     (SELECT hostname FROM responsive)
  AND  hostname IN     (SELECT hostname FROM puppet_ok);

SQL is more declarative and has better names. The shell version materialises every step as a real file on disk, which you can wc -l, head, diff against yesterday's version, attach to a ticket, or hand to another operator without having to export anything. Inspectability in exchange for elegance is the repeating tradeoff in this style.

Columns arrive later

Once the classifier had done its set algebra on hostnames, the next step was usually to join inventory attributes to a group. The breakfix scanner kept a wide-format dump called all-data with one row per host, generated from the infrastructure query tool (loony), with fields for hostname, role, puppet branch, kernel, service ticket, Wilson lifecycle state, and so on.

awk turned that into a projection and filter:

awk -F'\t' '$6 == "managed" && $5 == "-" { print $1 }' all-data

That is

SELECT hostname FROM all_data
WHERE lifecycle = 'managed' AND service_ticket IS NULL;

with the column names explicit in SQL, and implied in plain text where each field is separated by whitespace.

This is where the style starts being both powerful and fragile. The schema is not written down anywhere, it's a positional API. Column 7 means something important and the maintainer six months later is the person who has to remember what. In practice this was handled by keeping the loony format strings in the scanner's config and treating the wide-dump format as a contract that nobody edited casually. The contract lived in convention but not in code.

join is exactly what it sounds like

The POSIX join command does an equi-join on a shared key. By default the key is the first field of each file, and as with comm, both inputs must be sorted on that key.

Given the hadoop-workers hostname list and the all-data wide dump, joining them gave two fields (host, role) projected from the inventory view, restricted to worker nodes:

join -1 1 -2 1 -o 1.1,1.2 all-data hadoop-workers > hadoop-workers-role

-1 1 -2 1 says "join on field 1 of both files", which is also the default; spelling it out makes the shape of the join obvious. -o 1.1,1.2 is the projection — pick fields 1 and 2 from the first file. When the key is not field one, the -1 N -2 M form is how you say so. It is fiddlier than SQL ON and SELECT. The semantics are the same.

That file was then the input to more filters, more joins, more set operations. Intermediate stages had names like managed-puppet-failing-workers and breakfix-good. Every one of them was a file on disk: inspectable, diffable, re-runnable. That's outside the scope of this post, but if you removed an output file the breakfix runner would notice and run only the work to compute the missing output file.

Distinct, union, group by

The other common relational operations have a similar shape.

sort -u is SELECT DISTINCT. Apply it to concatenated inputs and you get UNION:

sort -u hadoop-workers hbase-workers > all-workers

sort | uniq -c is GROUP BY plus COUNT(*):

awk '{print $5}' all-data | sort | uniq -c | sort -nr | head

That one counts hosts per puppet branch, sorts by count descending, takes the top few for a person to review. No database has to be involved. In this model, the database is the directory, the tables are the files and the query plan is the pipeline.

Where it breaks

The weaknesses are real and I lived through most of them.

If the inputs are not sorted on the right key, comm and join produce wrong answers. If a delimiter turns up inside a field, your columns shift. In a wide row, $17 is unmemorable and you have to go and read the script comments.

Empty fields were another quirk. In SQL a missing value is NULL; in whitespace-separated text it is ambiguous with a run of spaces or a collapsed column. The convention I settled on was - as the empty-field placeholder in files on disk: very readable at a glance, easy to grep -v '^-' or spot while scanning output. When a pipeline needed an unambiguous delimiter mid-stream I switched to @, which is illegal in hostnames and rare in filenames, picked up through awk -F@, sort -t@, join -t@, and IFS='@' read for shell loops. @-delimited data never landed in a file though, because it reads as very noisy to a person. Files got optimised for human reading; pipelines got optimised for parsing correctness. A bit messy, but reliable.

There is also a maintainability cliff. The retrospective describes the scanner that eventually grew to hundreds of lines of shell — that's the same code I'm talking about here, and the cliff is the same one. The pattern scaled beautifully from 5 lines to 50; somewhere between 50 and 500 it started fighting back. Not because shell was the wrong tool for line-oriented set algebra, but because once a program is big enough to need a mental model, the lack of named columns and typed relations becomes a cost rather than a freedom.

Why it survives anyway

Given all of that, why not just load everything into SQLite? Often, that's the better answer. Load the extracts into tables, name the columns, add indexes, write the query with JOIN, EXCEPT, and GROUP BY. For anything branching or long-lived, that's easier to review and safer to refactor.

The line-oriented style holds its ground because of three properties.

Portability. sh, awk, sort, comm, join, grep, sed are usually already there. No runtime, no driver, no data directory. A bastion host has them; a rescue image has them; a ten-year-old machine has them.

Inspectability. Every intermediate result is a named file on disk. When someone asks "why did we drain these ten hosts?" the answer is in the working directory — in the actual group files the decision was made from. An SRE who had never seen the code before could read the file names in sequence and reconstruct what the scanner had concluded and why.

The filesystem is the program counter. This is the one I underestimated at the time. Because every stage's output is a file, the pipeline's state is the working directory. You can stop a run with ^C, look at what's there, delete a file you don't trust, and resume. Partial progress is free. The companion post talks about "do nothing" as a first-class outcome of the automation; the file-backed pipeline is the operational equivalent at the implementation layer. Stopping is safe. Inspecting is free. Resuming costs nothing.

When to reach for which

Use sorted files with comm, join, and awk when:

Inputs are already logs, host lists, inventory dumps, or metric exports. You are not choosing the representation; you are receiving it.
You want every intermediate result visible as a file for audit, handoff, or post-hoc review.
The environment is minimal or you cannot rely on SQLite being there.
The pipeline is short enough to hold in your head.
The mental effort of using SQL is just too much for the job at hand.

Reach for SQLite or a real engine when:

The logic branches a lot or spans many steps.
You need correctness under messy delimiters, typed fields, or larger-than-RAM data with indexes.
One declarative query is easier for your team to review than a page of pipelines.
The schema is going to be edited by more than one person over more than one year.

The mental model on one screen

Once it clicks, the whole model fits in a screenful:

A file is a table.
A line is a row.
The first field is usually the key.
awk is selection and projection.
sort -u is distinct.
cat | sort -u is union.
comm -12 is intersection.
comm -23 is set difference.
join is an inner join.
sort | uniq -c is group by and count.

That is a composable, line-oriented relational algebra. It has sharp edges and almost no dependencies. Respect sort order, keep one row per record, name your intermediate files like tiny tables. What you have then is a query engine: portable, transparent and built out of POSIX tools that are already on the machine.

It's not the right tool for every job. It remains a legitimate tool, and in the right environment it's still the path of least resistance, especially when the inputs are already line-oriented files.

The breakfix post was about the architecture; this one is about the query engine hiding inside it.

Dave Beckett

Text Files as Tables

Links

Feeds