The AI and HPC System Performance Engineer will join Meta's Network Infrastructure Engineering team to characterize end-to-end performance, identify bottlenecks, and optimize large-scale AI training and inference clusters. The role sits at the intersection of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.

Compensation

USD 154,000 - 217,000 per year

Location

Menlo Park, CA (onsite)

Experience

Minimum 6 years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.

Summary

AI/HPC System Performance Engineer on Meta's Network Infrastructure Engineering team responsible for end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. The role emphasizes the convergence of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.

Responsibilities

Profile and benchmark AI training and inference workloads across expansive HPC clusters to identify bottlenecks in network, compute, and memory resources.
Develop and maintain performance analysis frameworks and dashboards to monitor system-level metrics such as GPU utilization, network bandwidth, latency, and the efficiency of collective communication.
Investigate and resolve performance regressions in distributed AI environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling.
Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations.
Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure.
Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents.
Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets.
Lead technical design reviews for network and system architecture changes that affect AI workload performance, communicating trade-offs clearly to engineering and product stakeholders.
Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices.
Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack.

Requirements

Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI.
Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers.
Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure.
Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders.
6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.

Technologies

NCCL
MPI
PyTorch
TensorFlow
C++

Benefits

Bonus
Equity
Benefits

AI/HPC System Performance Engineer

Job Description

Compensation

Location

Experience

Summary

Responsibilities

Requirements

Technologies

Benefits

Similar Jobs

AI Engineer

Principal AI Engineer

AI Engineer - Forward Deployed Engineer Associate Manager

AI Data Engineer – Platform & Analytics

Distinguished AI Engineer

Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform)