AI. SF. Crusoe : Dave Beckett

Today I started as a Senior Staff Production Engineer at Crusoe Energy Systems.

I left Google in January, just after the sovereign cloud project I’d spent the last three years on went generally available for S3NS Cloud in December. It felt like the right time to move on. I then spent nearly three months interviewing across different companies, roles, and levels, which gave me the chance to be deliberate about what I wanted to do next.

I kept coming back to the same criteria: infrastructure, AI, San Francisco, and a company with real operational problems to solve. I wanted to work somewhere I could contribute technically, but also help raise the bar more broadly on reliability and production quality.

The timing

AI infrastructure is one of the most important buildouts happening right now, and a lot of the reliability patterns are still being worked out. The systems are getting bigger, the demand is intense, and the usage patterns are changing fast. Inference and software engineering are already shifting from human-rate requests to agent-rate requests, at high sustained volume, all day, every day.

I do not think the AI industry has fully figured out yet how to make that reliable at scale. That is a large part of what makes it interesting to me.

Crusoe also seems to be at an important stage as a company. It is well past the early survival phase and into the harder question of how to scale well. I think it's at the turning point where operational excellence becomes a differentiator and where reliability needs stronger engineering discipline.

The stack

Crusoe owns a much larger part of the stack than most companies do: energy sourcing, data center construction, GPU infrastructure, cloud platform, and managed AI services.

That matters because it means a customer-facing problem can often be traced all the way down to a physical cause: power, cooling, network, or software. Then you can fix it at the right layer instead of working around it from a distance.

I had that kind of environment at Twitter with the Hadoop fleet, and I have missed it. Systems design is interesting on its own, but there is something especially satisfying about reliability problems that also have a physical dimension.

The role

Crusoe is investing seriously in site resiliency and operational excellence. My role is a senior IC position in Production Engineering, focused on operational readiness reviews, reliability architecture reviews, disaster recovery testing, and helping set the bar for production quality.

That is exactly the kind of work I want to be doing. I have spent a long time seeing what works, what breaks, and what tends to get ignored until it hurts. The chance to help build a strong reliability practice is a big part of the appeal.

The people

I knew a few people at Crusoe from Twitter and elsewhere in the industry. The interview conversations felt more like working sessions than performance exercises, which I appreciated. I wanted to work with people who were engaged with the problems, curious about my past experience, and direct about what they needed.

After seven years at Twitter and three at Google, I have spent a decade working on infrastructure at large scale. Crusoe is a chance to do that again in a different setting: in San Francisco, in AI, and at a company where a lot of the practices are still being defined.

Put together, they were good reasons to say yes.

Dave Beckett

AI. SF. Crusoe

Links

Feeds