Head of SRE / Bare Metal Servers - Hosting

1626718
  • $450,0000 gross per year
  • Bay Area, CA or Texas
  • Permanent
  • Telecoms
  • IP Networking & Transmission


Our client, a high-growth startup building a next-generation GPU-powered compute platform and are targeting 300MW capacity by 2026, with the first 40MW site going live in January. We are seeking a Head of SRE / Bare Metal Servers to lead reliability, scalability, and automation across their high-performance compute infrastructure, also to build out their technical team from scratch.

 If you are interested in this opportunity, we encourage you to apply today! 

Responsibilities 

  • Own site reliability engineering (SRE) strategy across large-scale bare-metal GPU clusters.
  • Drive automation for deployment, monitoring, and management of CPU, GPU, and storage systems.
  • Partner with infrastructure, network, and data center teams to ensure ultra-high availability.
  • Lead design and implementation of disaster recovery, incident response, and performance optimization strategies.
  • Build, scale, and mentor a high-performing SRE team with full autonomy on hiring, tooling, and process design.
  • Oversee vendor selection and system lifecycle management for bare-metal hardware.

Requirements

  • Proven experience leading SRE or infrastructure reliability teams at hyperscale or neocloud environments.
  • Hands-on expertise with bare-metal GPU/CPU clusters, HPC workloads, and automation tooling.
  • Strong background in monitoring, observability, incident management, and system scaling.
  • Familiarity with network automation, virtualization, and HPC orchestration.
  • Prior work in organizations that deploy large scale AI infrastructure is highly preferred 
  • Experience with AI/ML infrastructure deployments highly desirable.

Salary:

  • $450,0000 gross per year
Ben Davies Director Global AI Infrastructure

Apply for this role