Job Title Site Reliability & Observability Engineer

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Zeta Corp is looking for Site Reliability & Observability Engineer for one of our USA client which belong to eLearning industry

Design, implement, and manage the enterprise-wide observability stack (APM, metrics, logs, and traces) across Azure and containerized workloads.

Deploy and maintain monitoring tools to ensure full-stack visibility.
Build standardized dashboards, alerts, and KPIs for key services and business applications.
Develop and maintain automation for telemetry data collection, alert configuration, and dashboard provisioning.
Ensure coverage for application, infrastructure, and end-user experience monitoring across all environments.

Reliability Engineering

Define and maintain Service-Level Objectives (SLOs), Service-Level Indicators (SLIs), and Error Budgets in partnership with DevOps and Development teams.
Implement automated incident detection, alerting, and response playbooks to reduce MTTR.
Analyze recurring incidents and drive permanent fixes and reliability improvements.
Support the transition toward zero-downtime deployments by validating performance and stability during rollout stages.

Performance & Cost Optimization

Establish performance baselines and track resource utilization across cloud and container infrastructure.
Work with DevOps and Development teams to identify performance bottlenecks and recommend optimizations.
Monitor and optimize monitoring metrics ingestion, Azure Log Analytics, and storage costs to balance visibility with efficiency.

Incident Management & Postmortems

Serve as a key responder during major incidents, providing data-driven insights and remediation coordination.
Lead root cause analysis (RCA) and ensure postmortem action items are implemented.
Build dashboards and analytics to identify leading indicators of failure and performance degradation.
Improve operational playbooks to accelerate detection and recovery.

Automation & Continuous Improvement

Contribute to CI/CD pipeline integrations for instrumentation validation and canary monitoring.
Continuously evaluate emerging observability tools and practices for adoption.
Advocate for reliability and monitoring best practices across engineering teams.

Required Skills

5+ years of experience in Site Reliability, Observability, or DevOps Engineering roles.
Strong hands-on experience with observability tools such as Datadog, New Relic, Grafana, ELK/EFK, or equivalent.
Deep understanding of metrics, tracing, and logging concepts and their correlation across distributed systems.
Experience implementing Synthetics and RUM monitoring for frontend performance.
Experience defining and managing SLOs, SLIs, and Error Budgets.
Solid grasp of Azure infrastructure, Kubernetes (AKS), and container monitoring.
Familiarity with CI/CD pipelines and integrating monitoring into deployment workflows.
Excellent analytical and communication skills; able to translate complex data into actionable insights.

Job Type: Full-time

Application Question(s):

Work Location: Remote

Similar jobs

Scroll Tab

Lahore, Pakistan

1 day ago

Invictus Solutions Pvt Ltd

Faisalabad, Pakistan

1 day ago

MeezoTech

Karachi, Pakistan

7 days ago