This engineer is expected to lead by example through hands-on contributions, deep technical expertise, and cross-team influence, particularly in the area of infrastructure bootstrap orchestration and automation at scale.

Key Responsibilities:

Platform Ownership & Reliability:

Own the end-to-end lifecycle (design, provisioning, upgrades, and decommissioning) of core platform components, including:

Cloud infrastructure primitivesKubernetes clusters and cluster servicesNetworking, ingress, and service discoveryService Mesh and supporting data-plane componentsEnsure platform components are resilient by design, applying SRE principles such as:

Fault isolation and graceful degradationCapacity planning and saturation controlReduced operational toil and clear failure modesContinuously assess and mitigate reliability risks, proactively improving platform stability and operational readiness.Infrastructure Bootstrap & Automation Leadership:

Lead the design and implementation of infrastructure bootstrap orchestration, including:

Automated cluster and environment provisioningDeterministic, repeatable platform bring-up and teardownDependency-aware orchestration across cloud, network, and Kubernetes layersDrive a strong Infrastructure-as-Code and GitOps-first approach, ensuring:

Platform components are reproducible and auditableChanges are automated, testable, and reversibleManual intervention is minimized or eliminatedIdentify automation gaps and lead initiatives that significantly reduce human effort, onboarding time, and operational risk.SRE Practices & Operational Excellence:

Apply and promote SRE practices across the platform, including:

Clear ownership and runbooks for platform componentsParticipation in on-call rotation as a platform reliability escalation pointIncident response, post-incident reviews, and problem managementImprove platform operability by:

Simplifying day-2 operationsStandardizing upgrade and rollback strategiesReducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR)Ensure platform operations align with security, compliance, and internal control requirements.This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Strong hands-on experience with:

Public Cloud platforms (AWS preferred, Azure)Kubernetes at scale, previous experience administrating productive Kubernetes environments Service Mesh technologies (e.g., Istio preferred, App Mesh, Linkerd)Strong understanding of:

Observability tooling and Golden Signals conceptsIncident management concepts and oncall operationsInfrastructure as Code (e.g., Terraform)Cloud-Native containerized micro-services architectureStrong collaboration and communication skills. Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.

Sr. Site Reliability Engineer @ Visa

About this role

Skills

Ready to apply?