Head of Platform/AI Cluster Management - System Integrator
- $500,000 gross per year (Negotiable)
- San Francisco, California, United States
- Permanent
- Enterprise
Ready to lead innovation at the intersection of platforms and artificial intelligence?
Join a pioneering technology company driving advancements in cloud, AI, and data-driven solutions across global markets. The organisation is recognised for fostering innovation, scalability, and collaboration through cutting-edge platforms that empower enterprises to evolve intelligently.
The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimisation of AI and platform initiatives. The role will focus on leading cross-functional teams, enhancing performance and scalability, and aligning technology strategy with long-term business goals.
Shape the future of intelligent platforms and transformative innovation. Apply now!
Responsibilities:
- Own the scheduler/runtime layer (Slurm, Kubernetes, Ray), including multi-tenancy, quotas, and GPU/host fleet management.
- Lead cluster operations across images, CI/CD, repair/health, performance/telemetry, and incident response.
- Deliver platform services that ensure workload SLOs and reliable runtime execution.
- Define and implement namespace/tenancy design, node health automation, golden images, admission controls, on-call runbooks, and go-live gates.
- Collaborate closely with infra, SRE, and network teams to optimise workload placement and cluster efficiency.
- Provide hands-on expertise in NCCL behaviours, placement strategies, and congestion signal management.
Requirements:
- Deep expertise in cluster management, scheduling, and runtime environments for large-scale compute.
- Hands-on background with Slurm, Kubernetes, Ray, or similar orchestration platforms.
- Strong understanding of NCCL performance tuning, workload isolation, and congestion management.
- Experience scaling multi-tenant, GPU-heavy clusters with strict SLOs.
- Ability to thrive in a startup environment with full ownership over platform and cluster strategy.
Salary:
- $500,000 gross per year (Negotiable)