Qureos

Find The RightJob.

Site Reliability Engineering Manager

Lead, NOC & Incident Management – Data Center Operations

Location: Austin, TX

Employment Type: Full-Time

Salary: $200,000 – $300,000 base + equity

Benefits: Full Benefits Package


Our client is a fast-growing AI infrastructure company building next-generation data centers and cloud-scale systems. They are seeking a Lead, NOC & Incident Management to establish and lead a cross-functional operations center (NOC) and incident management function, ensuring reliable monitoring and response across the company’s infrastructure portfolio, including datacenter facilities, network backbone, and platform services.


This is a hands-on operational leadership role combining strategic process development with technical credibility. The successful candidate will build 24/7 monitoring and triage capabilities, operationalize incident management frameworks, and drive a culture of proactive, consistent operational excellence.


Key Responsibilities


NOC Build & Operations

  • Stand up a cross-functional operations center from scratch, including staffing models, handoff processes, KPIs, and quality standards
  • Select and onboard MSP partners for Tier 1 coverage
  • Ensure qualified monitoring coverage 24/7 for all critical alerts

Incident Management Execution

  • Create, deploy, and operationalize structured incident management frameworks
  • Manage on-call rotations, run incident bridges for SEV0/SEV1 events, and lead post-incident reviews
  • Partner with internal teams to continuously refine incident response processes

Operational Readiness

  • Maintain runbook quality assurance and tabletop exercises for new infrastructure domains
  • Onboard new domains (Facilities, Network, Systems) into NOC coverage aligned with datacenter launches

Cross-Functional Orchestration

  • Build operational partnerships across Network Ops, DC Ops, Systems/Platform, and Security teams
  • Define clear Tier 1 → Tier 2 escalation criteria and ensure the NOC acts as a force multiplier for engineering teams

Vendor & Carrier Ticket Management

  • Establish processes for full lifecycle management of carrier and vendor tickets
  • Track, enforce SLAs, escalate as needed, and maintain documentation for all vendor interactions

Metrics & Continuous Improvement

  • Define and track operational metrics (MTTA, MTTR, escalation rate, false positives, runbook coverage)
  • Produce operational reports and use data to reduce alert noise, improve runbooks, and shorten incident response times


Qualifications

  • 5+ years in network operations, infrastructure operations, or site reliability roles with NOC leadership experience
  • Deep experience with structured incident response: severity classification, escalations, incident bridges, post-incident reviews, and RCA workflows
  • Technical breadth across infrastructure domains: network, facilities, and platform services
  • Proven ability to build operational processes, runbooks, and training programs from scratch
  • Strong cross-team influence without direct authority
  • Customer SLA mindset with focus on reliable 24/7 operations
  • Comfortable operating in a fast-paced, high-growth environment


Preferred Experience

  • Experience at hyperscale or large-scale infrastructure companies or telcos
  • Hands-on with incident management tools (incident.io, PagerDuty, Opsgenie, ServiceNow)
  • MSP/vendor selection, onboarding, and management experience
  • Familiarity with datacenter facilities operations, BMS/SCADA alerts, and carrier/ISP processes
  • Startup experience in high-growth environments


Compensation & Benefits

  • Base salary of $200,000 – $300,000
  • Equity participation from day one
  • Health, dental, and vision insurance
  • Retirement plan aligned with U.S. norms
  • Generous PTO policy

© 2026 Qureos. All rights reserved.