workable

MLOps Support Team Lead @ CloudFactory

Nairobi, KenyaOnsiteFull-timePosted 1 days ago

Opens on workable

About this role

At CloudFactory, we are a mission-driven team passionate about unlocking the potential of AI to transform the world. By combining advanced technology with a global network of talented people, we make unusable data usable, driving real-world impact at scale.

More than just a workplace, we’re a global community founded on strong relationships and the belief that meaningful work transforms lives. Our commitment to earning, learning, and serving fuels everything we do as we strive to connect one million people to meaningful work and build leaders worth following.

Our Culture

At CloudFactory, we believe in building a workplace where everyone feels empowered, valued, and inspired to bring their authentic selves to work. We are:

Mission-Driven: We focus on creating economic and social impact.People-Centric: We care deeply about our team’s growth, well-being, and sense of belonging.Innovative: We embrace change and find better ways to do things together.Globally Connected: We foster collaboration between diverse cultures and perspectives.If you’re passionate about innovation, collaboration, and making a real impact, we’d love to have you on board!

Role Summary

As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory’s MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.

You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.

Responsibilities:

Service Ownership & Reliability

Own the operational performance of all production ML systems and pipelinesEnsure reliability, availability, and supportability across client and internal MLOps workloadsEstablish and enforce SLAs, SLOs, and operational standardsAct as the escalation point for major incidents and service degradationTeam Leadership & Delivery

Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)Define shift patterns, on-call rotations, and coverage modelsSet clear expectations, performance metrics, and development plansFoster a strong operational culture focused on accountability and continuous improvementIncident Management & RCA

Own incident response processes, including triage, communication, and resolutionEnsure high-quality Root Cause Analysis (RCA) and follow-through on corrective actionsDrive reduction in repeat incidents through structured problem managementImprove time to detect (TTD) and time to resolve (TTR) metricsMonitoring, Observability & MLOps Maturity

Drive implementation and evolution of monitoring across:pipelines and data flowsinfrastructure and computemodel performance and driftEnsure visibility extends beyond system health to model accuracy, bias, and data integrityPartner with Engineering to improve instrumentation, logging, and alertingSupport Model & Process Design

Define and evolve the MLOps support operating modelClearly establish boundaries between Support, Engineering, and external partnersBuild and maintain runbooks, playbooks, and escalation pathsStandardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)Stakeholder & Partner Management

Act as the primary operational interface for:Engineering teamsPlatform OperationsExternal partners Reduce reliance on individuals by formalizing ownership and knowledge sharingProvide clear communication during incidents and service updatesContinuous Improvement & Scaling

Identify trends in incidents and operational inefficienciesDrive improvements in:automationalert qualityself-healing capabilitiesSupport onboarding of new MLOps projects into a standardized support modelContribute to building MLOps as a scalable, repeatable service offeringReporting & Service Health

Define and track key operational metrics:incident volume and severitySLA adherencesystem uptime and reliabilitySupport regular service reviews and model health reportingProvide leadership visibility into risks, trends, and improvement areasRequirements

Must Have skills (required)

Proven experience in operations leadership, SRE, DevOps, or platform support environmentsStrong understanding of production support models, incident management, and escalation frameworksExperience leading or mentoring technical support or operations teamsWorking knowledge of ML systems in production, including:pipelines and batch processingmodel lifecycle and deploymentcommon failure modesStrong analytical and troubleshooting skills in complex environmentsExperience with monitoring and observability toolsProficiency in:SQLPython or scripting (Bash)Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarityStrong stakeholder management and communication skillsNice To Have skills (Preferred)

Experience supporting AI/ML platforms at scaleFamiliarity with tools such as:DatabricksMLflowGrafanaPower BINew RelicExposure to model monitoring (drift, bias, performance validation)Experience working with external partners or vendors in delivery modelsUnderstanding of cloud platforms (AWS, GCP, Azure)Experience with containerized environments (Docker / Kubernetes)Background in building or scaling support functions from early-stage to maturityGeneral Requirements

Strong service ownership mindset — takes accountability for outcomes, not just activityCalm, structured, and decisive during incidentsAbility to balance operational delivery with strategic improvementPassion for building reliable, trustworthy AI/ML systemsHighly collaborative across Engineering, Platform, and Delivery teamsFocus on reducing risk related to:model performancebiasdata integrityCommitment to documentation, knowledge sharing, and eliminating single points of failureBenefits

At CloudFactory, we believe that work should be more than just a job, it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!

Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work. We can’t wait to meet you!

Skills

Operations

Ready to apply?

Install the ResuMinder extension and we'll auto-fill the application in seconds — no rewriting.

Get the extension →