AI/HPC System Performance Engineer
Job Description
The AI and HPC System Performance Engineer will join Meta's Network Infrastructure Engineering team to characterize end-to-end performance, identify bottlenecks, and optimize large-scale AI training and inference clusters. The role sits at the intersection of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.
Compensation
USD 154,000 - 217,000 per year
Location
Menlo Park, CA (onsite)
Experience
Minimum 6 years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.
Summary
AI/HPC System Performance Engineer on Meta's Network Infrastructure Engineering team responsible for end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. The role emphasizes the convergence of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.
Responsibilities
- Profile and benchmark AI training and inference workloads across expansive HPC clusters to identify bottlenecks in network, compute, and memory resources.
- Develop and maintain performance analysis frameworks and dashboards to monitor system-level metrics such as GPU utilization, network bandwidth, latency, and the efficiency of collective communication.
- Investigate and resolve performance regressions in distributed AI environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling.
- Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations.
- Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure.
- Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents.
- Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets.
- Lead technical design reviews for network and system architecture changes that affect AI workload performance, communicating trade-offs clearly to engineering and product stakeholders.
- Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices.
- Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack.
Requirements
- Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI.
- Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers.
- Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure.
- Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders.
- 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.
Technologies
- NCCL
- MPI
- PyTorch
- TensorFlow
- C++
Benefits
- Bonus
- Equity
- Benefits