greenhouse

Incident Management Lead @ Forward

AustinOnsiteFull-timePosted 8 days ago

Opens on greenhouse

About this role

About the Role Incidents are inevitable. How fast you detect them, how quickly you act, and whether the organization actually learns from them - that is what separates payments companies that scale from ones that spiral.

Forward processes payments for thousands of merchants across dozens of partner platforms. When something breaks - a submission failure blocks merchants mid-onboarding, a processing outage hits a partner's book, a compliance flag freezes accounts at scale - the impact lands on real businesses in real time. The question is not whether incidents will happen. It is whether Forward detects them in minutes or hours, resolves them with coordination or chaos, and fixes the root cause or patches the symptom.

The Incident Management Lead owns that answer.

This is a modern role for a modern problem. You will build an AI-assisted incident intelligence layer that gives Forward signal before issues become incidents, run coordinated response when they do, and drive the post-incident work that makes the organization genuinely more resilient - not just less embarrassed. You will own the closed loop between reactive resolution and proactive prevention: the governance function that ensures Forward gets faster and smarter after every incident rather than repeating the same failures.

This role sits at the intersection of Engineering, GTM, Support, and Operations, and is directly accountable to the CRCO. When things go wrong at Forward, you are the named owner - before, during, and after.

Key Responsibilities Build the Detection Layer

Design and operate a proactive monitoring and alerting infrastructure: SLO burn-rate alerting, synthetic health checks, deployment risk scoring, and real-time anomaly detection across submission, processing, and compliance pipelines. Build and maintain AI-assisted signal intelligence: use AIOps platforms (PagerDuty, Incident.io, or equivalent) to correlate alerts, suppress noise, and surface high-confidence incident precursors before they manifest as partner escalations. Own the governance loop over Support: review support ticket themes and escalation patterns on a weekly cadence to identify systemic issues before they cross into incident territory. Establish alert-to-noise discipline: define what a true signal looks like for each incident type, tune alerting thresholds, and drive Alert-to-Noise Ratio above 80% - the team acts on signals, not volume. Build and maintain the runbook library: pre-written, AI-augmented playbooks for the most common incident classes - submission failures, processing outages, ACH return spikes, TM system failures, compliance freezes - so the first 15 minutes of every incident are not spent figuring out who does what.

Run Incident Response

Serve as the named incident owner when an incident is declared - responsible for coordinating Engineering, Support, GTM, and Operations from detection through resolution. Declare incidents using a consistent severity framework (Sev-1 through Sev-3) with defined, documented SLAs for each tier. Drive MTTD (Mean Time to Detect) and MTTA (Mean Time to Acknowledge) toward P1 targets: detection under 5 minutes, acknowledgment under 15 minutes for Sev-1 and Sev-2. Manage communications during incidents on a defined cadence: internal stakeholder updates, partner-facing status, and merchant-level communications where required - proactive, not reactive. Classify merchant and partner impact in real time: GPV-at-risk, number of affected merchants, partner SLA implications, and any regulatory reporting obligations under DORA or card network rules. Use AI-assisted investigation tooling to compress diagnosis time: automated root cause hypothesis generation, timeline reconstruction, and runbook suggestion reduce the first 30 minutes of investigation to seconds.

Drive Post-Incident Learning

Own the post-incident review (PIR) for every Sev-1 and Sev-2: complete root cause analysis, contributing factor mapping, timeline reconstruction, and remediation item ownership - delivered within 48 hours. Track remediation commitments to closure - architectural fixes, tooling gaps, process changes, and partner education. Not just documented: done. Verified. Closed. Produce partner-facing incident summaries for high-impact events: clear, factual, and accountable. Where DORA or card network reporting obligations apply, own those submissions on deadline. Build and maintain the incident knowledge base: a searchable, AI-indexed record of every incident, RCA, and remediation action that the full team can learn from and reference. Track Incident Recurrence Rate as a primary quality signal. A repeated incident is a failed post-incident review.

Quantify Partner and Merchant Impact

Develop and maintain a GPV-at-risk classification framework: when an incident fires, the team knows immediately which partners and merchants are affected, what volume is at risk per hour, and what the SLA clock looks like. Build per-partner SLA attainment reporting: monthly scorecards showing incident frequency, MTTR, and resolution quality by partner - inputs into GTM conversations and partner health reviews. Own merchant reachability during incidents: ensure that communication channels, escalation paths, and support routing remain operational when the systems they depend on are not. Prepare and maintain required regulatory reporting: DORA major incident classification (4-hour reporting threshold), card network operational incident disclosures, and FCA Operational Resilience documentation where applicable.

Build the Tooling and AI Layer

Own and evolve the incident management tooling stack: AIOps platform (PagerDuty AIOps, Incident.io, FireHydrant, or equivalent), incident communication tooling, and integration into Forward's engineering observability layer. Build AI-assisted incident workflows: automated triage, runbook execution, stakeholder notification, and post-incident report generation - reducing manual coordination overhead measurably within 90 days. Partner with Engineering to integrate deployment risk scoring and change failure rate monitoring into the release process - so high-risk deployments trigger elevated alerting before they reach production. Maintain SLO dashboards and error budget tracking for Forward's core merchant-facing surfaces: submission flow, payment processing, bank linking, and TM decisioning.

Required Qualifications

4+ years in incident management, site reliability engineering, technical program management, or engineering operations - with direct ownership of production incident response. Experience running cross-functional incident response: coordinating Engineering, Support, and business stakeholders under pressure with clear, structured communication. Hands-on experience with modern incident management platforms: PagerDuty, Incident.io, FireHydrant, Rootly, Blameless, or equivalent AIOps tooling. Strong analytical mindset: comfortable with dashboards, error logs, SQL-based data pulls, and identifying patterns across support ticket data to find systemic signals. Strong written communication: produces clear, concise incident summaries, RCAs, and partner communications on a tight timeline. Experience building incident management infrastructure from scratch - severity frameworks, runbooks, post-mortem templates, SLO definitions. Familiarity with SLO/SLA frameworks, error budget concepts, and alert fatigue management.

Preferred Qualifications

Payments, fintech, or financial services domain experience: processing failures, compliance flags, card network reporting, and partner escalation dynamics. Familiarity with DORA regulatory incident reporting obligations or equivalent financial services operational resilience frameworks. Experience integrating AI tooling into incident workflows: automated RCA, alert correlation, or runbook execution. SQL proficiency for pulling operational data to support incident diagnosis and post-incident analysis. Background in platform engineering or SRE at a payments or financial services company. Experience managing merchant or partner communications during production outages at scale.

What Success Looks Like Detection and Response

MTTD (Mean Time to Detect): under 5 minutes for Sev-1, under 15 minutes for Sev-2. MTTA (Mean Time to Acknowledge): under 15 minutes for Sev-1, under 30 minutes for Sev-2. MTTR (Mean Time to Resolve): under 1 hour for Sev-1, under 3 hours for Sev-2. Incidents are declared within 15 minutes of first confirmed signal.

Quality and Recurrence

Incident Recurrence Rate below 10%: repeated incidents are a process failure, not a norm. Change Failure Rate below 5%: incidents caused by deployments trend toward DORA Elite tier. Alert-to-Noise Ratio above 80%: every alert the team responds to is a real signal. Every Sev-1 and Sev-2 has a completed post-incident review within 48 hours. Remediation items close on schedule.

Partner and Merchant Impact

GPV-at-risk is quantified in real time during every incident. Per-partner MTTR trends down quarter-over-quarter. Partner communications go out within 15 minutes of incident declaration. DORA and card network reporting obligations are met with zero missed deadlines.

Prevention

Support ticket pattern reviews happen weekly. At least one systemic prevention initiative is in-flight at all times. The incident log shows a declining trend in repeat incident types quarter-over-quarter. AI tooling is embedded in at least two core incident workflows within 90 days, with measurable reduction in coordination overhead.

What We Offer

Competitive salary and equity package. Comprehensive health, dental, and vision benefits. Flexible work arrangements and generous PTO. Learning & development budget for conferences, courses, and certifications. A direct line to building the incident management function at a high-growth payments company from the ground up.

Skills

Operations

Ready to apply?

Install the ResuMinder extension and we'll auto-fill the application in seconds — no rewriting.

Get the extension →