Dave Beckett — Staff / Principal Engineer Reliability and Distributed Systems.
San Francisco, California, USA
dave@dajobe.org |
650-450-8421 |
www.dajobe.org |
github.com/dajobe |
www.linkedin.com/in/dajobe
Summary
Staff-level engineer with deep experience leading reliability, distributed systems, and data infrastructure at global scale. Proven track record designing SRE operating models, SLO frameworks, and partner-operated cloud infrastructure, including sovereign cloud deployments. Experienced at turning ambiguous, cross-organizational problems into durable systems, processes, and teams that scale. Improved SLO compliance and operational readiness across services used by hundreds of engineering teams.
Core Impact Areas
- Site Reliability Engineering: Strategy, governance, SLOs, SLIs, error budgets, incident response
- Distributed systems and data infrastructure at massive scale
- Cross-org technical leadership, mentorship and partner enablement
- Sovereign and regulated cloud platforms (GCP)
- Capacity planning, cost optimization, and operational tooling
Professional Experience
- Mar 2023 — Feb 2026: Google, Sunnyvale, California, USA
- Staff Systems Engineer (Mar 2023 — Feb 2026)
- Staff Systems Engineer (SRE) in the Site Reliability Engineering part of the Google Cloud Platform (GCP) organization working on Google Cloud Distributed (GCD) sovereign cloud infrastructure independently operated by S3NS in France. Joined the project one year in and worked through to general availability in December 2025 (Announcement).
- Key Contributions and Impact:
- Designed staffing and expertise model for 80+ person partner operations organization covering 300-400 services. Model adopted by Google and partner leadership to guide hiring and readiness.
- Created the High Expertise Services framework to classify services by complexity and operator training requirement, enabling focused training and resource prioritization.
- Closed systemic tooling gaps by driving cross-team ownership, documentation standards, and deployment validation for hundreds of operational tools improving readiness confidence across partner teams.
- Converted contractual obligations into operational SLOs, built metrics pipelines and integrated into GCP's SLO reporting. Delivered Phase 1 covering half the target SLO set giving partners and Google leadership real operational insight.
- Developed SLA response-time metrics for partner compliance (first response and follow-ups) that enabled data-driven compliance tracking and reducing manual reporting effort.
- Led partner enablement through structured office hours and interactive training, accelerating readiness and reducing dependency on lecture-only formats.
- Maintained hands-on operational leadership throughout early partner simulations and rotation support, reducing escalations. Covered critical on-call duties to relieve dev teams.
- May 2016 — Feb 2023: Twitter, San Francisco, California, USA
- Senior Staff Site Reliability Engineer - Data Platform (Jun 2021 — Feb 2023)
- Technical leader responsible for reliability and scalability of Twitter’s data platform, including some of the largest Hadoop clusters in the world, spanning on-prem and cloud environments.
- Key Contributions and Impact
-
- Led reliability and automation for one of the largest Hadoop platforms, materially improving stability and scalability while supporting rapid growth.
- Led incident response and postmortem processes across 5+ years of on-call, driving systemic reliability improvements.
- Designed and executed capacity optimization strategies that reduced costs by millions through forecasting and hardware tuning. Delivered 30% TCO improvement, 50% faster runtime see white paper
- Defined target hardware profiles and partnered with vendors to validate production-ready designs.
- Directed data platform Hadoop migration to cloud architectural design and rollout in partnership with GCP, securing resilient deployment and network patterns. Enabled using cloud data platforms in coordination with on-premise.
- Built automation for operational toil reduction (upgrades, fault detection, remediation) across bare-metal and hybrid fleets.
- Served as senior technical advisor on platform strategy, reliability posture, and long-term capacity planning.
- Created SRE onboarding education and presented it to 100s of engineers over 3 years, strengthening the SRE culture and practices.
- Mentored 2 senior engineers, with 1 promoted to staff.
- Rackspace: Senior software engineer operating Hadoop analytics over monitoring data with Ansible, HBase, Scala and Cascading
- Digg: Lead Software Engineer working full stack across Cassandra, Redis, API design and mobile web.
- Yahoo!: Principal Software Architect for large-scale media and data systems across News, Sports, Finance, Entertainment properties.
- W3C / Open Standards: Co-author and editor of major RDF / Turtle specifications. Co-authored W3C Recommendation with Tim Berners-Lee.
- Open Source: Founder of Redland RDF; long-standing Debian Project Developer.
- Academia and Research: Multi-year technical leadership in large-scale web infrastructure and metadata systems.
- Systems and Languages: C (25+ years), Python (10+ years), Go (reading / debugging). Polyglot background across a dozen languages.
- Distributed Systems: Hadoop (HDFS, YARN, Hive, HBase, ZK)
- SRE Practices: SLOs / SLIs design, governance, error budgets, incident analysis, capacity modeling
- Cloud and Infrastructure: GCP, sovereign cloud, Linux, Kubernetes
- Automation and Operations: Internal observability and monitoring platforms at Google and Twitter (metrics, alerting, dashboards, tracing). Deployment and change management for 10K+ node fleets. Fundamentals transferable to Prometheus, Datadog, Terraform, and similar.
- 1987-1990, University of Bristol
- BSc (Hons) Degree in Computer Science