Find The RightJob.

Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE) — Job Description

Summary

Design, build, and operate scalable, reliable, and observable systems by applying software engineering practices to infrastructure and operations; reduce toil, improve system reliability, and enable rapid, safe deployments.

Key responsibilities

Reliability engineering: Define and maintain service-level objectives (SLOs), service-level indicators (SLIs), and error budgets; drive reliability improvements.
Incident management: Lead incident response, perform on-call duties, coordinate root-cause analysis (RCA), and implement post-incident fixes to prevent recurrence.
Automation &* tooling:* Automate operational tasks (provisioning, deployments, scaling, failover), build runbooks, and develop internal tools to eliminate manual toil.
Monitoring &* observability:* Implement metrics, logging, tracing, and alerting; ensure effective dashboards and alerts with actionable thresholds.
Capacity &* performance:* Plan capacity, conduct load/stress testing, tune performance, and forecast resource needs to meet SLOs.
CI/CD &* release engineering:* Build and maintain CI/CD pipelines, deployment strategies (blue/green, canary), and rollback procedures to enable safe releases.
Infrastructure as Code (IaC): Define and manage infrastructure using IaC tools (Terraform, CloudFormation, Pulumi) and configuration management (Ansible, Chef).
Cloud &* platform engineering:* Design and operate cloud-native infrastructure (AWS, GCP, Azure) and platform services (Kubernetes, service meshes); optimize costs and resilience.
Security &* compliance:* Collaborate with security teams to implement secure configurations, secrets management, and compliance controls in production systems.
On-call &* runbook maintenance:* Maintain runbooks, run periodic game days/chaos testing, and participate in on-call rotation to ensure rapid incident resolution.
Collaboration &* mentorship:* Work with developers, QA, and product teams to embed reliability practices, mentor engineers, and improve deployment and observability culture.

Qualifications

Education: Bachelor’s degree in Computer Science, Engineering, or equivalent experience (or relevant certifications).
Experience: 3–7+ years in SRE, DevOps, systems engineering, or related roles (adjust per seniority).
Technical skills: Strong Linux/Unix administration, networking fundamentals, scripting (Python, Go, Bash), and familiarity with containers and orchestration (Docker, Kubernetes).
Tools & platforms: Experience with cloud providers (AWS/GCP/Azure), IaC (Terraform/CloudFormation), CI/CD (Jenkins/GitHub Actions/GitLab CI), monitoring/observability (Prometheus, Grafana, ELK, Jaeger), and incident management tools (PagerDuty).
Soft skills: Strong troubleshooting, communication, and collaboration skills; ability to perform under incident pressure.

Competencies & attributes

Systems-thinking and automation-first mindset.
Strong analytical and debugging abilities.
Proactive, ownership-driven, and customer-focused.
Ability to balance reliability, velocity, and cost.
Continuous learner with an interest in tooling and process improvement.

Pay: QAR121.16 - QAR324.88 per hour

Work Location: In person

Similar jobs

Site Reliability Engineer (SRE)

Prime Gate

Riyadh, Saudi Arabia

about 14 hours ago

Term of use Privacy policy