Software Engineer I - AI/ML, AWS Neuron Distributed Training
Job Description
Annapurna Labs (U.S.) Inc. is seeking a Software Engineer I with a focus on AI/ML distributed training. The role centers on optimizing large-scale models for AWS Trainium, implementing mixed-precision strategies, and extending training frameworks within the Neuron ecosystem, in collaboration with hardware and AWS teams. This is an onsite position based in Cupertino, CA, offering a compensation range of USD 127,100 to 185,000 per year.
Responsibilities
- Contribute to designing and implementing distributed training solutions for large-scale ML models running on Trainium instances.
- Extend and optimize popular distributed training frameworks within the Neuron ecosystem, including FSDP, torchtitan, and Hugging Face libraries.
- Develop and optimize mixed-precision and low-precision training techniques, covering BF16, FP8, and emerging formats to boost throughput while preserving accuracy and convergence quality.
- Implement precision-aware training strategies, loss scaling, and careful gradient management to ensure stability across reduced-precision formats.
- Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware.
- Collaborate with hardware, compiler, and runtime teams to understand system constraints and unlock new capabilities.
- Partner with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.
Requirements
- Bachelor's degree or higher in computer science, computer engineering, or a related field (or equivalent bachelor's degree).
- 1+ years of programming experience in at least one software language, including academic projects, internships, or research.
- Experience with software development practices such as code reviews, source control, testing, and build processes.
- Experience with machine learning concepts and at least one ML framework (PyTorch, JAX, or TensorFlow).
Technologies
- FSDP
- torchtitan
- Hugging Face libraries
- PyTorch
- JAX
- TensorFlow
- Trainium
- AWS Neuron
- BF16
- FP8
Benefits
- Health insurance
- 401(k) matching
- Paid time off
- Parental leave
- Sign-on payments
- Restricted stock units (RSUs)
- Flexible Spending Accounts
- Employee Assistance Program (EAP)
- Mental Health Support
- Adoption and Surrogacy Reimbursement