Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE) — Job Description

Summary

Build and run highly available, scalable, and observable services by applying software engineering practices to operations and infrastructure.

Core responsibilities

Reliability engineering: Design, implement, and maintain systems to meet defined SLAs/SLOs/SLIs; reduce toil through automation and engineering solutions.
Monitoring & observability: Build and maintain metrics, logging, tracing, and alerting (Prometheus, Grafana, ELK, Jaeger); define meaningful alerts and reduce alert fatigue.
Incident management: Lead incident response, runbooks, post‑mortems, root‑cause analysis, and remediation to prevent recurrence.
Automation & tooling: Automate deployments, provisioning, scaling, recovery, and routine operational tasks using IaC and CI/CD.
Capacity & performance: Perform capacity planning, load testing, performance tuning, and cost optimization for services and infrastructure.
Platform engineering: Develop and maintain self‑service platform components (k8s clusters, service mesh, CI/CD pipelines) to enable developer productivity.
Reliability-focused development: Collaborate with dev teams to design fault‑tolerant systems, implement retries, timeouts, circuit breakers, and graceful degradation.
Security & compliance: Implement secure configurations, manage secrets, and ensure compliance controls for production systems.
Chaos engineering & resilience testing: Run experiments to validate system resilience and improve failure recovery processes.
Documentation & runbooks: Maintain runbooks, runbooks, runbooks; document operational procedures, on‑call playbooks, and runbook automation.
Mentorship & culture: Advocate SRE best practices, mentor engineers on reliability, and drive reliability KPIs across the org.

Typical duties (day‑to‑day)

Triage alerts, resolve incidents, and perform root‑cause analysis.
Build automation for deployments, infra provisioning, and remediation.
Implement and iterate SLOs, track error budgets, and influence release decisions.
Improve observability dashboards and alerting thresholds.
Run capacity reviews and optimize resource usage/costs.
Review designs and code for reliability and operational readiness.
Participate in on‑call rotation and conduct post‑mortems after incidents.
Create and maintain runbooks and automation playbooks.
Collaborate with developers to reduce toil and improve service quality.
Lead platform improvements (k8s upgrades, tooling, CI/CD enhancements).

Required qualifications

Education/experience: Bachelor’s in CS or equivalent; 3–5+ years in systems, SRE, DevOps, or site ops roles (senior roles typically require more).
Technical skills: Strong Linux systems knowledge, networking fundamentals, distributed systems concepts, and debugging production issues.
Tools & platforms: Experience with Kubernetes, Docker, Terraform/CloudFormation, CI/CD (GitHub Actions/Jenkins), monitoring (Prometheus, Grafana), logging/tracing (ELK/Fluentd, Jaeger), and service meshes (Istio/linkerd) as relevant.
Cloud & infra: Proven experience with AWS/Azure/GCP, managed k8s (EKS/GKE/AKS), load balancers, autoscaling, and storage systems.
Scripting & programming: Proficient in at least one language (Python, Go, Ruby, or Bash) for automation and tooling.
Practices: Familiar with SRE concepts: SLAs/SLOs/SLIs, error budgets, runbooks, chaos engineering, and blameless post‑mortems.
Soft skills: Strong troubleshooting, communication, incident leadership, and collaboration abilities.

Preferred attributes

Experience with large‑scale distributed systems, service-oriented architectures, and performance tuning.
Background in platform engineering, developer tooling, or building internal developer platforms.
Familiarity with security tooling, compliance standards, and secrets management (Vault).
Certifications or training in cloud platforms or Kubernetes beneficial.

Pay: QAR160.21 - QAR321.10 per hour

Work Location: In person

Similar jobs

Site Reliability Engineer (SRE)

Prime Gate

Riyadh, Saudi Arabia

about 12 hours ago

Term of use Privacy policy