The
Operations Architect
defines and governs the operational model for enterprise platform capabilities delivered by multiple vendors, ensuring solutions are production-ready, observable, secure, and supportable at scale. The role designs end-to-end service management practices (SLOs/SLAs, monitoring, incident/change/problem management, DR, and capacity/cost controls) and ensures operational requirements are embedded from design through delivery.
Working with platform/cloud, security, and solution architects, as well as vendor teams and operations teams, the architect drives operations readiness reviews, creates runbooks and support processes, and enables a consistent, efficient operating model across cloud-agnostic deployments.
Duties & Responsibilities
-
Define operational architecture and service management model across capabilities (ITIL-aligned where applicable).
-
Establish observability standards: metrics/logs/traces/audits, OpenTelemetry instrumentation, dashboarding, alerting, and anomaly detection.
-
Define SLOs/SLAs/OLAs, error budgets, and operational KPIs; ensure vendors deliver evidence and meet acceptance gates.
-
Design incident management workflows (triage, escalation, RCA), integrate with ITSM, and standardize runbooks/playbooks.
-
Define change and release management practices (CAB inputs, deployment rings, canary/rollback, feature flags coordination).
-
Establish resiliency and DR requirements: backup/restore patterns, RPO/RTO targets, DR testing cadence, and failover runbooks.
-
Define capacity, performance, and availability engineering processes (load testing, scaling policies, GPU/TPU capacity planning).
-
Implement security operations integration: SIEM/SOAR alignment, alert routing, vulnerability/patch management SLAs.
-
Define FinOps operational controls: tagging standards, showback/chargeback, budgets, anomaly detection, cost optimization playbooks.
-
Lead operational readiness and handover: L1/L2/L3 training, reverse-shadowing, SOPs, and post-go-live stabilization plans.
Skills & Abilities
-
Strong expertise in operating cloud-native platforms: SRE/ITIL practices, reliability engineering, and service management.
-
Ability to turn NFRs into measurable SLOs, monitoring, and operational acceptance criteria.
-
Solid understanding of observability stacks and telemetry design (OTel, APM, SIEM integration).
-
Experience designing DR/BCP, backup strategies, and operational test plans in regulated environments.
-
Proven capability to drive operational standardization across multiple vendors and teams.
Education & Background
-
Bachelor’s degree in
Computer Science, Information Technology, Cybersecurity
, or related field; Master’s degree highly preferred.
-
8+ years in operations architecture, SRE, DevOps leadership, or service management for enterprise platforms.
-
Experience running production systems on Azure plus exposure to at least one other cloud (GCP/AWS) and hybrid setups.
-
Experience with ITSM tooling and processes (incident/change/problem, CMDB), including KPI/SLA reporting.
-
Proven experience with monitoring/APM and security operations integration (SIEM, vulnerability management).
-
Certifications desirable: ITIL, SRE-related training, Azure/AWS/GCP ops certs, Kubernetes CKA/CKS (optional).
Preferred Tools / Soft Skills
Preferred Tools
-
Observability/APM: OpenTelemetry, Dynatrace/Datadog, Prometheus/Grafana/Loki/Tempo (as applicable)
-
ITSM & operations: ServiceNow (or equivalent), CMDB, PagerDuty/Opsgenie-style on-call tooling
-
Security & cloud ops: Microsoft Sentinel, Defender for Cloud, Azure Monitor/Log Analytics, Kubernetes tooling
Soft Skills
-
Calm, structured leadership during incidents and high-pressure escalations
-
Strong facilitation skills for readiness reviews, RCAs, and cross-vendor alignment
-
Clear documentation and operational discipline (runbooks, SOPs, checklists)
-
Continuous improvement mindset and ability to drive measurable reliability gains
-
Strong collaboration and influencing skills across engineering, security, and vendor teams