EngineerJobs.io
← Back to all jobs
Annapurna Labs (U.S.) Inc.

Software Engineer I - AI/ML, AWS Neuron Distributed Training

Cupertino, CA $127k - $185k/yr Full time Posted 6d ago

Job Description

Annapurna Labs (U.S.) Inc. is seeking a Software Engineer I with a focus on AI/ML distributed training. The role centers on optimizing large-scale models for AWS Trainium, implementing mixed-precision strategies, and extending training frameworks within the Neuron ecosystem, in collaboration with hardware and AWS teams. This is an onsite position based in Cupertino, CA, offering a compensation range of USD 127,100 to 185,000 per year.

Responsibilities

  • Contribute to designing and implementing distributed training solutions for large-scale ML models running on Trainium instances.
  • Extend and optimize popular distributed training frameworks within the Neuron ecosystem, including FSDP, torchtitan, and Hugging Face libraries.
  • Develop and optimize mixed-precision and low-precision training techniques, covering BF16, FP8, and emerging formats to boost throughput while preserving accuracy and convergence quality.
  • Implement precision-aware training strategies, loss scaling, and careful gradient management to ensure stability across reduced-precision formats.
  • Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware.
  • Collaborate with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities.
  • Partner with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.

Requirements

  • Bachelor's degree or higher in computer science, computer engineering, or a related field (or equivalent bachelor's degree).
  • 1+ years of programming experience in at least one software language, including academic projects, internships, or research.
  • Experience with software development practices such as code reviews, source control, testing, and build processes.
  • Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow).

Technologies

  • FSDP
  • torchtitan
  • Hugging Face libraries
  • PyTorch
  • JAX
  • TensorFlow
  • Trainium
  • AWS Neuron
  • BF16
  • FP8

Benefits

  • Health insurance
  • 401(k) matching
  • Paid time off
  • Parental leave
  • Sign-on payments
  • Restricted stock units (RSUs)
  • Flexible Spending Accounts
  • Employee Assistance Program (EAP)
  • Mental Health Support
  • Adoption and Surrogacy Reimbursement

Similar Jobs

Get Job Alerts

New jobs delivered to your inbox.