Head of SRE / Bare Metal Servers - Hosting
1626718
Posted: 08/09/2025
- $450,0000 gross per year
- Bay Area, CA or Texas
- Permanent
- Telecoms
- IP Networking & Transmission
Our client, a high-growth startup building a next-generation GPU-powered compute platform and are targeting 300MW capacity by 2026, with the first 40MW site going live in January. We are seeking a Head of SRE / Bare Metal Servers to lead reliability, scalability, and automation across their high-performance compute infrastructure, also to build out their technical team from scratch.
If you are interested in this opportunity, we encourage you to apply today!
Responsibilities
- Own site reliability engineering (SRE) strategy across large-scale bare-metal GPU clusters.
- Drive automation for deployment, monitoring, and management of CPU, GPU, and storage systems.
- Partner with infrastructure, network, and data center teams to ensure ultra-high availability.
- Lead design and implementation of disaster recovery, incident response, and performance optimization strategies.
- Build, scale, and mentor a high-performing SRE team with full autonomy on hiring, tooling, and process design.
- Oversee vendor selection and system lifecycle management for bare-metal hardware.
Requirements
- Proven experience leading SRE or infrastructure reliability teams at hyperscale or neocloud environments.
- Hands-on expertise with bare-metal GPU/CPU clusters, HPC workloads, and automation tooling.
- Strong background in monitoring, observability, incident management, and system scaling.
- Familiarity with network automation, virtualization, and HPC orchestration.
- Prior work in organizations that deploy large scale AI infrastructure is highly preferred
- Experience with AI/ML infrastructure deployments highly desirable.
Salary:
- $450,0000 gross per year
Ben Davies
Director Global AI Infrastructure