Qureos

Find The RightJob.

Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE) — Job Description

Summary

  • Build and run highly available, scalable, and observable services by applying software engineering practices to operations and infrastructure.

Core responsibilities

  • Reliability engineering: Design, implement, and maintain systems to meet defined SLAs/SLOs/SLIs; reduce toil through automation and engineering solutions.
  • Monitoring & observability: Build and maintain metrics, logging, tracing, and alerting (Prometheus, Grafana, ELK, Jaeger); define meaningful alerts and reduce alert fatigue.
  • Incident management: Lead incident response, runbooks, post‑mortems, root‑cause analysis, and remediation to prevent recurrence.
  • Automation & tooling: Automate deployments, provisioning, scaling, recovery, and routine operational tasks using IaC and CI/CD.
  • Capacity & performance: Perform capacity planning, load testing, performance tuning, and cost optimization for services and infrastructure.
  • Platform engineering: Develop and maintain self‑service platform components (k8s clusters, service mesh, CI/CD pipelines) to enable developer productivity.
  • Reliability-focused development: Collaborate with dev teams to design fault‑tolerant systems, implement retries, timeouts, circuit breakers, and graceful degradation.
  • Security & compliance: Implement secure configurations, manage secrets, and ensure compliance controls for production systems.
  • Chaos engineering & resilience testing: Run experiments to validate system resilience and improve failure recovery processes.
  • Documentation & runbooks: Maintain runbooks, runbooks, runbooks; document operational procedures, on‑call playbooks, and runbook automation.
  • Mentorship & culture: Advocate SRE best practices, mentor engineers on reliability, and drive reliability KPIs across the org.

Typical duties (day‑to‑day)

  • Triage alerts, resolve incidents, and perform root‑cause analysis.
  • Build automation for deployments, infra provisioning, and remediation.
  • Implement and iterate SLOs, track error budgets, and influence release decisions.
  • Improve observability dashboards and alerting thresholds.
  • Run capacity reviews and optimize resource usage/costs.
  • Review designs and code for reliability and operational readiness.
  • Participate in on‑call rotation and conduct post‑mortems after incidents.
  • Create and maintain runbooks and automation playbooks.
  • Collaborate with developers to reduce toil and improve service quality.
  • Lead platform improvements (k8s upgrades, tooling, CI/CD enhancements).

Required qualifications

  • Education/experience: Bachelor’s in CS or equivalent; 3–5+ years in systems, SRE, DevOps, or site ops roles (senior roles typically require more).
  • Technical skills: Strong Linux systems knowledge, networking fundamentals, distributed systems concepts, and debugging production issues.
  • Tools & platforms: Experience with Kubernetes, Docker, Terraform/CloudFormation, CI/CD (GitHub Actions/Jenkins), monitoring (Prometheus, Grafana), logging/tracing (ELK/Fluentd, Jaeger), and service meshes (Istio/linkerd) as relevant.
  • Cloud & infra: Proven experience with AWS/Azure/GCP, managed k8s (EKS/GKE/AKS), load balancers, autoscaling, and storage systems.
  • Scripting & programming: Proficient in at least one language (Python, Go, Ruby, or Bash) for automation and tooling.
  • Practices: Familiar with SRE concepts: SLAs/SLOs/SLIs, error budgets, runbooks, chaos engineering, and blameless post‑mortems.
  • Soft skills: Strong troubleshooting, communication, incident leadership, and collaboration abilities.

Preferred attributes

  • Experience with large‑scale distributed systems, service-oriented architectures, and performance tuning.
  • Background in platform engineering, developer tooling, or building internal developer platforms.
  • Familiarity with security tooling, compliance standards, and secrets management (Vault).
  • Certifications or training in cloud platforms or Kubernetes beneficial.

Pay: QAR160.21 - QAR321.10 per hour

Work Location: In person

© 2026 Qureos. All rights reserved.