System Reliability Engineer (SRE)
The SRE applies engineering practices to operations, focusing on automation, monitoring, incident management, and resiliency across modern cloud‑native environments.
- Ensure 24x7 availability and performance of Client's Digital Web applications
- Monitor system health using tools such as Grafana, Splunk, and AppDynamics
- Proactively identify and remediate reliability risks before customer impact
- Support high‑volume traffic events, releases, and promotions
- Own the end‑to‑end incident lifecycle: detection, triage, mitigation, and resolution
- Lead or participate in major incident calls with clear communication and accountability
- Perform root cause analysis (RCA) and drive corrective and preventive actions
- Document post‑incident reviews and track follow‑up actions to closure
- Build and maintain automation to reduce manual operational work
- Support CI/CD pipelines and safe deployment practices
- Partner with development teams to improve resiliency, scalability, and fault tolerance
- Apply SRE principles such as error budgets, SLIs, and SLOs where applicable
- Support cloud‑native platforms including Kubernetes (TKE), AWS, and PCF‑based services
- Assist with platform migrations, upgrades, and performance tuning
- Validate deployments across non‑prod and production environments
- Work closely with Digital Web, API, Platform, and Infrastructure teams
- Participate in design reviews to ensure operational readiness
- Provide reliability guidance during feature development and releases
- Experience supporting web‑scale, customer‑facing digital platforms
- Strong knowledge of Linux, networking fundamentals, and distributed systems
- Hands‑on experience with monitoring and alerting tools (Grafana, Splunk, AppDynamics)
- Experience with Kubernetes and containerized applications
- Familiarity with CI/CD pipelines and deployment automation
- 3+ years of experience in Site Reliability Engineering, DevOps, or Production Support
- Experience supporting mission‑critical systems with strict SLAs
- Proven ability to handle high‑severity production incidents
- Clear and calm communication during incidents
- Strong problem‑solving and troubleshooting skills
- Comfortable working in a fast‑paced, always‑on digital environment
- Ability to collaborate across engineering, product, and operations teams
- Experience supporting telecom or large‑scale consumer digital platforms
- Exposure to AEM or modern web frameworks
- Experience with infrastructure as code (Terraform or similar)
- Prior experience supporting digital platforms
The base salary for this position is $92,250-$128,000 plus incentives that align with individual and company performance. Actual salaries will vary based on work location, qualifications, skills, education, experience, and competencies. Benefits available to eligible employees in this role include medical, dental, and vision insurance, comprehensive employee assistance program, 401(k) retirement plan, paid time off and holidays and paid learning days.
The deadline to apply for this position is: 3/16/2026. This position is for an existing, immediate vacancy. We are currently seeking to fill this role with an individual who can start as soon as possible.
#Linkedin