Qureos

FIND_THE_RIGHTJOB.

Job Title Site Reliability & Observability Engineer

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Zeta Corp is looking for Site Reliability & Observability Engineer for one of our USA client which belong to eLearning industry

Design, implement, and manage the enterprise-wide observability stack (APM, metrics, logs, and traces) across Azure and containerized workloads.

  • Deploy and maintain monitoring tools to ensure full-stack visibility.
  • Build standardized dashboards, alerts, and KPIs for key services and business applications.
  • Develop and maintain automation for telemetry data collection, alert configuration, and dashboard provisioning.
  • Ensure coverage for application, infrastructure, and end-user experience monitoring across all environments.

Reliability Engineering

  • Define and maintain Service-Level Objectives (SLOs), Service-Level Indicators (SLIs), and Error Budgets in partnership with DevOps and Development teams.
  • Implement automated incident detection, alerting, and response playbooks to reduce MTTR.
  • Analyze recurring incidents and drive permanent fixes and reliability improvements.
  • Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.

Performance & Cost Optimization

  • Establish performance baselines and track resource utilization across cloud and container infrastructure.
  • Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
  • Monitor and optimize monitoring metrics ingestion, Azure Log Analytics, and storage costs to balance visibility with efficiency.

Incident Management & Postmortems

  • Serve as a key responder during major incidents, providing data-driven insights and remediation coordination.
  • Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
  • Build dashboards and analytics to identify leading indicators of failure and performance degradation.
  • Improve operational playbooks to accelerate detection and recovery.

Automation & Continuous Improvement

  • Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
  • Continuously evaluate emerging observability tools and practices for adoption.
  • Advocate for reliability and monitoring best practices across engineering teams.

Required Skills

  • 5+ years of experience in Site Reliability, Observability, or DevOps Engineering roles.
  • Strong hands-on experience with observability tools such as Datadog, New Relic, Grafana, ELK/EFK, or equivalent.
  • Deep understanding of metrics, tracing, and logging concepts and their correlation across distributed systems.
  • Experience implementing Synthetics and RUM monitoring for frontend performance.
  • Experience defining and managing SLOs, SLIs, and Error Budgets.
  • Solid grasp of Azure infrastructure, Kubernetes (AKS), and container monitoring.
  • Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
  • Excellent analytical and communication skills; able to translate complex data into actionable insights.

Job Type: Full-time

Application Question(s):

  • experience in Site Reliability, Observability, or DevOps Engineering roles.?
  • Which observability tools you have experience ?
  • Experience in Azure infrastructure, Kubernetes (AKS), and container monitoring.
  • What is your expected salary ?

Work Location: Remote

© 2025 Qureos. All rights reserved.