Find The RightJob.

Engineer, Systems Reliability

System Reliability Engineer (SRE)

The System Reliability Engineer (SRE) is responsible for ensuring the availability, performance, scalability, and reliability of Client's customer‑facing digital web platforms. This role partners closely with Digital Web, Platform, and Product teams to support high‑traffic web experiences, proactively prevent incidents, and continuously improve operational excellence.

The SRE applies engineering practices to operations, focusing on automation, monitoring, incident management, and resiliency across modern cloud‑native environments.

Key Responsibilities

Platform Reliability & Operations

Ensure 24x7 availability and performance of Client's Digital Web applications
Monitor system health using tools such as Grafana, Splunk, and AppDynamics
Proactively identify and remediate reliability risks before customer impact
Support high‑volume traffic events, releases, and promotions

Incident Management

Own the end‑to‑end incident lifecycle: detection, triage, mitigation, and resolution
Lead or participate in major incident calls with clear communication and accountability
Perform root cause analysis (RCA) and drive corrective and preventive actions
Document post‑incident reviews and track follow‑up actions to closure

Automation & Engineering

Build and maintain automation to reduce manual operational work
Support CI/CD pipelines and safe deployment practices
Partner with development teams to improve resiliency, scalability, and fault tolerance
Apply SRE principles such as error budgets, SLIs, and SLOs where applicable

Cloud & Container Platforms

Support cloud‑native platforms including Kubernetes (TKE), AWS, and PCF‑based services
Assist with platform migrations, upgrades, and performance tuning
Validate deployments across non‑prod and production environments

Cross‑Functional Collaboration

Work closely with Digital Web, API, Platform, and Infrastructure teams
Participate in design reviews to ensure operational readiness
Provide reliability guidance during feature development and releases

Required Qualifications

Technical Skills

Experience supporting web‑scale, customer‑facing digital platforms
Strong knowledge of Linux, networking fundamentals, and distributed systems
Hands‑on experience with monitoring and alerting tools (Grafana, Splunk, AppDynamics)
Experience with Kubernetes and containerized applications
Familiarity with CI/CD pipelines and deployment automation

Experience

3+ years of experience in Site Reliability Engineering, DevOps, or Production Support
Experience supporting mission‑critical systems with strict SLAs
Proven ability to handle high‑severity production incidents

Soft Skills

Clear and calm communication during incidents
Strong problem‑solving and troubleshooting skills
Comfortable working in a fast‑paced, always‑on digital environment
Ability to collaborate across engineering, product, and operations teams

Preferred Qualifications

Experience supporting telecom or large‑scale consumer digital platforms
Exposure to AEM or modern web frameworks
Experience with infrastructure as code (Terraform or similar)
Prior experience supporting digital platforms

The base salary for this position is $92,250-$128,000 plus incentives that align with individual and company performance. Actual salaries will vary based on work location, qualifications, skills, education, experience, and competencies. Benefits available to eligible employees in this role include medical, dental, and vision insurance, comprehensive employee assistance program, 401(k) retirement plan, paid time off and holidays and paid learning days.

The deadline to apply for this position is: 3/16/2026. This position is for an existing, immediate vacancy. We are currently seeking to fill this role with an individual who can start as soon as possible.

#Linkedin

Similar jobs