Head of Platform/AI Cluster Management - System Integrator

1631148
  • $500,000 gross per year (Negotiable)
  • San Francisco, California, United States
  • Permanent
  • Enterprise


Ready to lead innovation at the intersection of platforms and artificial intelligence?

Join a pioneering technology company driving advancements in cloud, AI, and data-driven solutions across global markets. The organisation is recognised for fostering innovation, scalability, and collaboration through cutting-edge platforms that empower enterprises to evolve intelligently.

The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimisation of AI and platform initiatives. The role will focus on leading cross-functional teams, enhancing performance and scalability, and aligning technology strategy with long-term business goals.

Shape the future of intelligent platforms and transformative innovation. Apply now!


Responsibilities:

  • Own the scheduler/runtime layer (Slurm, Kubernetes, Ray), including multi-tenancy, quotas, and GPU/host fleet management.
  • Lead cluster operations across images, CI/CD, repair/health, performance/telemetry, and incident response.
  • Deliver platform services that ensure workload SLOs and reliable runtime execution.
  • Define and implement namespace/tenancy design, node health automation, golden images, admission controls, on-call runbooks, and go-live gates.
  • Collaborate closely with infra, SRE, and network teams to optimise workload placement and cluster efficiency.
  • Provide hands-on expertise in NCCL behaviours, placement strategies, and congestion signal management.


Requirements:

  • Deep expertise in cluster management, scheduling, and runtime environments for large-scale compute.
  • Hands-on background with Slurm, Kubernetes, Ray, or similar orchestration platforms.
  • Strong understanding of NCCL performance tuning, workload isolation, and congestion management.
  • Experience scaling multi-tenant, GPU-heavy clusters with strict SLOs.
  • Ability to thrive in a startup environment with full ownership over platform and cluster strategy.


Salary:

  • $500,000 gross per year (Negotiable)
Ben Davies Director Global AI Infrastructure

Apply for this role