Zeta Corp is looking for Site Reliability & Observability Engineer for one of our USA client which belong to eLearning industry
Design, implement, and manage the enterprise-wide observability stack (APM, metrics, logs, and traces) across Azure and containerized workloads.
- Deploy and maintain monitoring tools to ensure full-stack visibility.
- Build standardized dashboards, alerts, and KPIs for key services and business applications.
- Develop and maintain automation for telemetry data collection, alert configuration, and dashboard provisioning.
- Ensure coverage for application, infrastructure, and end-user experience monitoring across all environments.
Reliability Engineering
- Define and maintain Service-Level Objectives (SLOs), Service-Level Indicators (SLIs), and Error Budgets in partnership with DevOps and Development teams.
- Implement automated incident detection, alerting, and response playbooks to reduce MTTR.
- Analyze recurring incidents and drive permanent fixes and reliability improvements.
- Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.
Performance & Cost Optimization
- Establish performance baselines and track resource utilization across cloud and container infrastructure.
- Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
- Monitor and optimize monitoring metrics ingestion, Azure Log Analytics, and storage costs to balance visibility with efficiency.
Incident Management & Postmortems
- Serve as a key responder during major incidents, providing data-driven insights and remediation coordination.
- Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
- Build dashboards and analytics to identify leading indicators of failure and performance degradation.
- Improve operational playbooks to accelerate detection and recovery.
Automation & Continuous Improvement
- Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
- Continuously evaluate emerging observability tools and practices for adoption.
- Advocate for reliability and monitoring best practices across engineering teams.
Required Skills
- 5+ years of experience in Site Reliability, Observability, or DevOps Engineering roles.
- Strong hands-on experience with observability tools such as Datadog, New Relic, Grafana, ELK/EFK, or equivalent.
- Deep understanding of metrics, tracing, and logging concepts and their correlation across distributed systems.
- Experience implementing Synthetics and RUM monitoring for frontend performance.
- Experience defining and managing SLOs, SLIs, and Error Budgets.
- Solid grasp of Azure infrastructure, Kubernetes (AKS), and container monitoring.
- Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
- Excellent analytical and communication skills; able to translate complex data into actionable insights.
Job Type: Full-time
Application Question(s):
- experience in Site Reliability, Observability, or DevOps Engineering roles.?
- Which observability tools you have experience ?
- Experience in Azure infrastructure, Kubernetes (AKS), and container monitoring.
- What is your expected salary ?
Work Location: Remote