About this role
The Software Development Engineer will lead the team in technical strategy, design, build, and operation of infrastructure services including provisioning and availability of AWS Trainium-based AI servers. This role requires expertise in architecting large-scale systems, building micro services, and cross-functional collaboration with several other teams such as capacity management, hardware engineering, and datacenter teams to manage AI/ML infrastructure. Key job responsibilities - Design and develop innovative technologies that power the infrastructure supporting AI workloads on Ultraservers - Lead technical projects establishing EC2 as the pioneer in cloud computing for AI/ML workloads across diverse applications including LLMs, multimodal systems, and emerging model architectures. - Collaborate with various teams to influence architecture of provisioning systems and improve to operate at scale and efficiently. - Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels. About the team The EC2 UltraServer Provisioning team is a high-performing engineering organization responsible for delivering AWS Trainium-based UltraServers infrastructure at scale. We manage end-to-end provisioning workflows from host ingestion through testing, repair, and recovery.