About this role
<div><div style="padding:10.0px 0.0px;border:1.0px solid transparent"><div style="font-size:16.0px;word-wrap:break-word"><H2 style="font-size:1.0em;margin:0.0px">Job Summary</H2> </div><div><p>We're looking for a Site Reliability Engineer, focused on building and operating the data and AI/ML infrastructure platform that powers NetApp's cloud-native data services. You'll work at the intersection of software engineering and infrastructure operations — designing systems for reliability, driving automation, and ensuring our platforms meet the highest availability standards for customers worldwide.</p> <p>This is an infrastructure-focused SRE role, you'll own the reliability of large-scale Kubernetes clusters (including GPU workloads), streaming data pipelines (Kafka), and analytical compute infrastructure (Spark, Dremio) across hybrid-cloud and multi-cloud environments.</p></div></div><div style="padding:10.0px 0.0px;border:1.0px solid transparent"><div style="font-size:16.0px;word-wrap:break-word"><H2 style="font-size:1.0em;margin:0.0px">Job Requirements</H2> </div><div><ul> <li>5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering roles</li> <li>Extensive experience with Linux (RHEL/CentOS), including shells, filesystems, kernel tuning, networking, and performance optimization</li> <li>Deep expertise with Kubernetes at scale, including cluster administration, troubleshooting, networking, storage, RBAC, and lifecycle management (on-premises and Rancher Kubernetes)</li> <li>Hands-on experience operating GPU workloads on Kubernetes, including NVIDIA GPU Operator, device plugins, scheduling, and resource management</li> <li>Strong experience managing Confluent Kafka in production, including operations, monitoring, performance tuning, and disaster recovery</li> <li>Experience operating Apache Spark and/or Dremio, including cluster management, job scheduling, scaling, and performance optimization</li> <li>Proficiency in Infrastructure as Code using Terraform, Helm, and GitOps workflows with ArgoCD/FluxCD</li> <li>Proficiency in scripting and automation using Shell, Ansible, and Python, with a strong automation-first mindset</li> <li>Experience with scheduling and orchestration tools such as cron jobs and Apache Airflow</li> <li>Deep familiarity with monitoring and observability tools, including Dynatrace, Grafana, and Prometheus</li> <li>Solid understanding of SQL and NoSQL databases, including operations, backup, and monitoring</li> <li>Experience designing and maintaining CI/CD pipelines and release processes</li> <li>Expertise in AWS cloud platforms and hybrid-cloud integration</li> <li>Strong systems thinking, with an understanding of how infrastructure design choices impact failure modes, scalability, and recovery</li> <li>Strong incident management skills and post-mortem facilitation experience</li> <li>Excellent written communication skills for design documents, runbooks, post-mortems, and operational documentation</li> </ul> <div> <p><strong>Nice to Have</strong></p> <ul> <li>Knowledge of Generative AI tools and frameworks, including the application of AI-based predictive analytics and automation in infrastructure operations</li> <li>Familiarity with ML platforms such as Kubeflow, MLflow, and Ray, as well as AI/ML training infrastructure</li> <li>Experience with Kafka Streams, ksqlDB, or Apache Flink</li> </ul> </div></div></div><div style="padding:10.0px 0.0px;border:1.0px solid transparent"><div style="font-size:16.0px;word-wrap:break-word"><H2 style="font-size:1.0em;margin:0.0px">Education</H2> </div><div><ul type="disc"> <li style="font-family:arial, helvetica, sans-serif">5-8 years of relevant experience.</li> <li style="font-family:arial, helvetica, sans-serif">Bachelor of Science Degree in Computer Science, Electrical Engineering, or a related field; a Master’s Degree is preferred. </li> </ul></div></div></div>