The IT Operations Manager is responsible for the stability, availability, performance, and security of the company’s cloud environments. This role leads day-to-day IT operations across infrastructure, cloud platforms, end-user services, monitoring, incident management, and vendor relationships. The ideal candidate is a hands-on leader with strong technical depth, proven people management experience, and the ability to translate operational metrics into actionable insights while continuously improving service delivery.
Job Responsibilities
Operational Leadership
-
Ensure high availability and performance of the cloud and on-prem environments.
-
Establish and enforce operational standards, runbooks, and escalation procedures.
-
Drive continuous improvement in reliability, automation, and operational efficiency.
-
Vendor management of RouteOne’s managed services provider to ensure service level agreement (SLA) commitments related to uptime, resource availability, incident response, change control, redundancy, etc., are met.
Incident & Problem Management
-
Lead incident response for high severity outages; ensure rapid restoration and clear communication.
-
Facilitate root cause analysis (RCA) and drive corrective and preventive actions.
-
Oversee change management to reduce risk and unplanned downtime.
Team Leadership & Development
-
Manage and mentor IT Operations engineers, database administrators, and on‑call resources.
-
Build a culture of accountability, documentation, and knowledge sharing.
-
Conduct performance reviews, career development plans, and skills growth initiatives.
-
Coordinate on‑call rotations and workload balancing.
Monitoring, Automation & Tooling
-
Own monitoring, alerting, and observability platforms (e.g., CloudWatch, NewRelic, OEM, Grafana/Prometheus, LogicMon).
-
Partner with Cloud, DevOps, and Security teams to support scalable and secure architectures.
-
Ensure proactive detection of performance, capacity, and security issues.
Security, Compliance & Risk
-
Partner with Security teams to support vulnerability management, patching, and audit readiness.
-
Ensure operational compliance with internal policies and external regulations by maintaining safety, security, and privacy standards throughout all areas of responsibility.
-
Ensure backup, disaster recovery, and business continuity plans are tested and maintained.
-
Participate in security incidents and post‑incident remediation activities.
Knowledge
-
Strong operational experience supporting AWS production environments.
-
Strong understanding of Windows and Linux server administration concepts.
-
Strong understanding of AWS operational models, shared responsibility, and regional availability concepts.
-
High availability and resiliency concepts: Multi-AZ, failover, storage, backups, DR.
-
Networking fundamentals: DNS, DHCP, TCP/IP, Load balancers, firewalls, VPNs.
-
Identity and access management (AD, SSO, MFA).
-
Proven ability to lead teams supporting 24×7 cloud operations.
Skills
-
Proficient in Microsoft Office products, including but not limited to: Word, PowerPoint, Excel, Outlook, and Visio.
-
Excellent verbal and written communication skills.
-
Disciplined, detail-oriented, and well organized with a strong background in operational methodology.
-
Solid analytical and troubleshooting skills to quickly determine root causes of problems and drive towards solutions.
Abilities
-
Ability to foster a collaborative and collegial atmosphere within a dynamic and fast-paced work environment.
-
Leading SEV1 / major incidents calmly and decisively.
-
Assessing operational risk of changes.
-
Understanding blast radius and dependencies.
-
Knowing when to stop a change or roll back.
-
Identifying systemic risk patterns.
-
Preventing repeat incidents, not just fixing symptoms.
-
Ability to manage time and multiple priorities.
Other Essential Requirements
-
Bachelor's degree in Computer Science, Information Systems, or other related field, or equivalent work experience.
-
8+ years’ experience in management, operations, and leadership.