Site Reliability Engineering Manager

Lead, NOC & Incident Management – Data Center Operations

Location: Austin, TX

Employment Type: Full-Time

Salary: $200,000 – $300,000 base + equity

Benefits: Full Benefits Package

Our client is a fast-growing AI infrastructure company building next-generation data centers and cloud-scale systems. They are seeking a Lead, NOC & Incident Management to establish and lead a cross-functional operations center (NOC) and incident management function, ensuring reliable monitoring and response across the company’s infrastructure portfolio, including datacenter facilities, network backbone, and platform services.

This is a hands-on operational leadership role combining strategic process development with technical credibility. The successful candidate will build 24/7 monitoring and triage capabilities, operationalize incident management frameworks, and drive a culture of proactive, consistent operational excellence.

Key Responsibilities

NOC Build & Operations

Stand up a cross-functional operations center from scratch, including staffing models, handoff processes, KPIs, and quality standards
Select and onboard MSP partners for Tier 1 coverage
Ensure qualified monitoring coverage 24/7 for all critical alerts

Incident Management Execution

Create, deploy, and operationalize structured incident management frameworks
Manage on-call rotations, run incident bridges for SEV0/SEV1 events, and lead post-incident reviews
Partner with internal teams to continuously refine incident response processes

Operational Readiness

Maintain runbook quality assurance and tabletop exercises for new infrastructure domains
Onboard new domains (Facilities, Network, Systems) into NOC coverage aligned with datacenter launches

Cross-Functional Orchestration

Build operational partnerships across Network Ops, DC Ops, Systems/Platform, and Security teams
Define clear Tier 1 → Tier 2 escalation criteria and ensure the NOC acts as a force multiplier for engineering teams

Vendor & Carrier Ticket Management

Establish processes for full lifecycle management of carrier and vendor tickets
Track, enforce SLAs, escalate as needed, and maintain documentation for all vendor interactions

Metrics & Continuous Improvement

Define and track operational metrics (MTTA, MTTR, escalation rate, false positives, runbook coverage)
Produce operational reports and use data to reduce alert noise, improve runbooks, and shorten incident response times

Qualifications

5+ years in network operations, infrastructure operations, or site reliability roles with NOC leadership experience
Deep experience with structured incident response: severity classification, escalations, incident bridges, post-incident reviews, and RCA workflows
Technical breadth across infrastructure domains: network, facilities, and platform services
Proven ability to build operational processes, runbooks, and training programs from scratch
Strong cross-team influence without direct authority
Customer SLA mindset with focus on reliable 24/7 operations
Comfortable operating in a fast-paced, high-growth environment

Preferred Experience

Experience at hyperscale or large-scale infrastructure companies or telcos
Hands-on with incident management tools (incident.io, PagerDuty, Opsgenie, ServiceNow)
MSP/vendor selection, onboarding, and management experience
Familiarity with datacenter facilities operations, BMS/SCADA alerts, and carrier/ISP processes
Startup experience in high-growth environments

Compensation & Benefits