Software Engineer- AI/ML, AWS Neuron Distributed Training - Performance Optimization
Job Description
Software Engineer on the AWS Neuron Distributed Training team focused on performance optimization for distributed training on Trainium, collaborating with PyTorch, JAX, and Neuron compiler/runtime to improve throughput and efficiency for large-scale models.
Location and compensation details: Cupertino, CA onsite. Salary range of USD 143,700 to 194,400 per year. Minimum experience: 3+ years. Education: Bachelor's degree in computer science or equivalent.
Responsibilities
- Lead efforts to optimize distributed training performance on Trainium, with a primary focus on maximizing training throughput, model FLOPs utilization, and overall efficiency across the Neuron software stack.
- Collaborate across PyTorch, JAX, and the Neuron compiler and runtime to enable and tune large-scale training workloads on the latest Trainium instances.
Requirements
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture experience (design patterns, reliability, and scaling) of new and existing systems
- Experience programming with at least one software programming language
- 3+ years of full software development life cycle experience, including coding standards, code reviews, source control management, build processes, testing, and operations
- Bachelor's degree in computer science or equivalent
Technologies
- PyTorch
- JAX
- AWS Neuron
- Neuron compiler
- Neuron runtime
- Trainium
- Trn3/Trn2/Trn1
- Inf2/Inf1 servers
Benefits
- Health insurance
- 401(k) matching
- Paid time off
- Parental leave
- Sign-on payments
- Restricted stock units (RSUs)
- Basic Life & AD&D insurance
- Supplemental life plans
- Employee Assistance Program (EAP)
- Mental health support
- Medical advice line
- Flexible Spending Accounts
- Adoption and surrogacy reimbursement coverage