Qureos

Find The RightJob.

Engineer, Systems Reliability

System Reliability Engineer (SRE)

The System Reliability Engineer (SRE) is responsible for ensuring the availability, performance, scalability, and reliability of Client's customer‑facing digital web platforms. This role partners closely with Digital Web, Platform, and Product teams to support high‑traffic web experiences, proactively prevent incidents, and continuously improve operational excellence.

The SRE applies engineering practices to operations, focusing on automation, monitoring, incident management, and resiliency across modern cloud‑native environments.

Key Responsibilities

Platform Reliability & Operations

  • Ensure 24x7 availability and performance of Client's Digital Web applications
  • Monitor system health using tools such as Grafana, Splunk, and AppDynamics
  • Proactively identify and remediate reliability risks before customer impact
  • Support high‑volume traffic events, releases, and promotions

Incident Management

  • Own the end‑to‑end incident lifecycle: detection, triage, mitigation, and resolution
  • Lead or participate in major incident calls with clear communication and accountability
  • Perform root cause analysis (RCA) and drive corrective and preventive actions
  • Document post‑incident reviews and track follow‑up actions to closure

Automation & Engineering

  • Build and maintain automation to reduce manual operational work
  • Support CI/CD pipelines and safe deployment practices
  • Partner with development teams to improve resiliency, scalability, and fault tolerance
  • Apply SRE principles such as error budgets, SLIs, and SLOs where applicable

Cloud & Container Platforms

  • Support cloud‑native platforms including Kubernetes (TKE), AWS, and PCF‑based services
  • Assist with platform migrations, upgrades, and performance tuning
  • Validate deployments across non‑prod and production environments

Cross‑Functional Collaboration

  • Work closely with Digital Web, API, Platform, and Infrastructure teams
  • Participate in design reviews to ensure operational readiness
  • Provide reliability guidance during feature development and releases

Required Qualifications

Technical Skills

  • Experience supporting web‑scale, customer‑facing digital platforms
  • Strong knowledge of Linux, networking fundamentals, and distributed systems
  • Hands‑on experience with monitoring and alerting tools (Grafana, Splunk, AppDynamics)
  • Experience with Kubernetes and containerized applications
  • Familiarity with CI/CD pipelines and deployment automation

Experience

  • 3+ years of experience in Site Reliability Engineering, DevOps, or Production Support
  • Experience supporting mission‑critical systems with strict SLAs
  • Proven ability to handle high‑severity production incidents

Soft Skills

  • Clear and calm communication during incidents
  • Strong problem‑solving and troubleshooting skills
  • Comfortable working in a fast‑paced, always‑on digital environment
  • Ability to collaborate across engineering, product, and operations teams

Preferred Qualifications

  • Experience supporting telecom or large‑scale consumer digital platforms
  • Exposure to AEM or modern web frameworks
  • Experience with infrastructure as code (Terraform or similar)
  • Prior experience supporting digital platforms

The base salary for this position is $92,250-$128,000 plus incentives that align with individual and company performance. Actual salaries will vary based on work location, qualifications, skills, education, experience, and competencies. Benefits available to eligible employees in this role include medical, dental, and vision insurance, comprehensive employee assistance program, 401(k) retirement plan, paid time off and holidays and paid learning days.

The deadline to apply for this position is: 3/16/2026. This position is for an existing, immediate vacancy. We are currently seeking to fill this role with an individual who can start as soon as possible.

#Linkedin

© 2026 Qureos. All rights reserved.