Lead, NOC & Incident Management – Data Center Operations
Location:
Austin, TX
Employment Type:
Full-Time
Salary:
$200,000 – $300,000 base + equity
Benefits:
Full Benefits Package
Our client is a fast-growing AI infrastructure company building next-generation data centers and cloud-scale systems. They are seeking a
Lead, NOC & Incident Management
to establish and lead a cross-functional operations center (NOC) and incident management function, ensuring reliable monitoring and response across the company’s infrastructure portfolio, including datacenter facilities, network backbone, and platform services.
This is a
hands-on operational leadership role
combining strategic process development with technical credibility. The successful candidate will build 24/7 monitoring and triage capabilities, operationalize incident management frameworks, and drive a culture of proactive, consistent operational excellence.
Key Responsibilities
NOC Build & Operations
-
Stand up a cross-functional operations center from scratch, including staffing models, handoff processes, KPIs, and quality standards
-
Select and onboard MSP partners for Tier 1 coverage
-
Ensure qualified monitoring coverage 24/7 for all critical alerts
Incident Management Execution
-
Create, deploy, and operationalize structured incident management frameworks
-
Manage on-call rotations, run incident bridges for SEV0/SEV1 events, and lead post-incident reviews
-
Partner with internal teams to continuously refine incident response processes
Operational Readiness
-
Maintain runbook quality assurance and tabletop exercises for new infrastructure domains
-
Onboard new domains (Facilities, Network, Systems) into NOC coverage aligned with datacenter launches
Cross-Functional Orchestration
-
Build operational partnerships across Network Ops, DC Ops, Systems/Platform, and Security teams
-
Define clear Tier 1 → Tier 2 escalation criteria and ensure the NOC acts as a force multiplier for engineering teams
Vendor & Carrier Ticket Management
-
Establish processes for full lifecycle management of carrier and vendor tickets
-
Track, enforce SLAs, escalate as needed, and maintain documentation for all vendor interactions
Metrics & Continuous Improvement
-
Define and track operational metrics (MTTA, MTTR, escalation rate, false positives, runbook coverage)
-
Produce operational reports and use data to reduce alert noise, improve runbooks, and shorten incident response times
Qualifications
-
5+ years in network operations, infrastructure operations, or site reliability roles with NOC leadership experience
-
Deep experience with structured incident response: severity classification, escalations, incident bridges, post-incident reviews, and RCA workflows
-
Technical breadth across infrastructure domains: network, facilities, and platform services
-
Proven ability to build operational processes, runbooks, and training programs from scratch
-
Strong cross-team influence without direct authority
-
Customer SLA mindset with focus on reliable 24/7 operations
-
Comfortable operating in a fast-paced, high-growth environment
Preferred Experience
-
Experience at hyperscale or large-scale infrastructure companies or telcos
-
Hands-on with incident management tools (incident.io, PagerDuty, Opsgenie, ServiceNow)
-
MSP/vendor selection, onboarding, and management experience
-
Familiarity with datacenter facilities operations, BMS/SCADA alerts, and carrier/ISP processes
-
Startup experience in high-growth environments
Compensation & Benefits
-
Base salary of $200,000 – $300,000
-
Equity participation from day one
-
Health, dental, and vision insurance
-
Retirement plan aligned with U.S. norms
-
Generous PTO policy