Qureos

Find The RightJob.

Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE) — Job Description

Summary

Design, build, and operate scalable, reliable, and observable systems by applying software engineering practices to infrastructure and operations; reduce toil, improve system reliability, and enable rapid, safe deployments.

Key responsibilities

  • Reliability engineering: Define and maintain service-level objectives (SLOs), service-level indicators (SLIs), and error budgets; drive reliability improvements.
  • Incident management: Lead incident response, perform on-call duties, coordinate root-cause analysis (RCA), and implement post-incident fixes to prevent recurrence.
  • Automation &* tooling:* Automate operational tasks (provisioning, deployments, scaling, failover), build runbooks, and develop internal tools to eliminate manual toil.
  • Monitoring &* observability:* Implement metrics, logging, tracing, and alerting; ensure effective dashboards and alerts with actionable thresholds.
  • Capacity &* performance:* Plan capacity, conduct load/stress testing, tune performance, and forecast resource needs to meet SLOs.
  • CI/CD &* release engineering:* Build and maintain CI/CD pipelines, deployment strategies (blue/green, canary), and rollback procedures to enable safe releases.
  • Infrastructure as Code (IaC): Define and manage infrastructure using IaC tools (Terraform, CloudFormation, Pulumi) and configuration management (Ansible, Chef).
  • Cloud &* platform engineering:* Design and operate cloud-native infrastructure (AWS, GCP, Azure) and platform services (Kubernetes, service meshes); optimize costs and resilience.
  • Security &* compliance:* Collaborate with security teams to implement secure configurations, secrets management, and compliance controls in production systems.
  • On-call &* runbook maintenance:* Maintain runbooks, run periodic game days/chaos testing, and participate in on-call rotation to ensure rapid incident resolution.
  • Collaboration &* mentorship:* Work with developers, QA, and product teams to embed reliability practices, mentor engineers, and improve deployment and observability culture.

Qualifications

  • Education: Bachelor’s degree in Computer Science, Engineering, or equivalent experience (or relevant certifications).
  • Experience: 3–7+ years in SRE, DevOps, systems engineering, or related roles (adjust per seniority).
  • Technical skills: Strong Linux/Unix administration, networking fundamentals, scripting (Python, Go, Bash), and familiarity with containers and orchestration (Docker, Kubernetes).
  • Tools & platforms: Experience with cloud providers (AWS/GCP/Azure), IaC (Terraform/CloudFormation), CI/CD (Jenkins/GitHub Actions/GitLab CI), monitoring/observability (Prometheus, Grafana, ELK, Jaeger), and incident management tools (PagerDuty).
  • Soft skills: Strong troubleshooting, communication, and collaboration skills; ability to perform under incident pressure.

Competencies & attributes

  • Systems-thinking and automation-first mindset.
  • Strong analytical and debugging abilities.
  • Proactive, ownership-driven, and customer-focused.
  • Ability to balance reliability, velocity, and cost.
  • Continuous learner with an interest in tooling and process improvement.

Pay: QAR121.16 - QAR324.88 per hour

Work Location: In person

© 2026 Qureos. All rights reserved.