EngineerJobs.io
← Back to all jobs

Job Description

The AI and HPC System Performance Engineer will join Meta's Network Infrastructure Engineering team to characterize end-to-end performance, identify bottlenecks, and optimize large-scale AI training and inference clusters. The role sits at the intersection of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.

Compensation

USD 154,000 - 217,000 per year

Location

Menlo Park, CA (onsite)

Experience

Minimum 6 years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.

Summary

AI/HPC System Performance Engineer on Meta's Network Infrastructure Engineering team responsible for end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. The role emphasizes the convergence of network fabric design, distributed computing, and AI workload behavior to maximize throughput and efficiency.

Responsibilities

  • Profile and benchmark AI training and inference workloads across expansive HPC clusters to identify bottlenecks in network, compute, and memory resources.
  • Develop and maintain performance analysis frameworks and dashboards to monitor system-level metrics such as GPU utilization, network bandwidth, latency, and the efficiency of collective communication.
  • Investigate and resolve performance regressions in distributed AI environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling.
  • Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations.
  • Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure.
  • Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents.
  • Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets.
  • Lead technical design reviews for network and system architecture changes that affect AI workload performance, communicating trade-offs clearly to engineering and product stakeholders.
  • Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices.
  • Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack.

Requirements

  • Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI.
  • Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers.
  • Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure.
  • Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders.
  • 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments.

Technologies

  • NCCL
  • MPI
  • PyTorch
  • TensorFlow
  • C++

Benefits

  • Bonus
  • Equity
  • Benefits

Similar Jobs

Get Job Alerts

New jobs delivered to your inbox.