Site Reliability Engineer (SRE) — Job Description
Summary
- Build and run highly available, scalable, and observable services by applying software engineering practices to operations and infrastructure.
Core responsibilities
- Reliability engineering: Design, implement, and maintain systems to meet defined SLAs/SLOs/SLIs; reduce toil through automation and engineering solutions.
- Monitoring & observability: Build and maintain metrics, logging, tracing, and alerting (Prometheus, Grafana, ELK, Jaeger); define meaningful alerts and reduce alert fatigue.
- Incident management: Lead incident response, runbooks, post‑mortems, root‑cause analysis, and remediation to prevent recurrence.
- Automation & tooling: Automate deployments, provisioning, scaling, recovery, and routine operational tasks using IaC and CI/CD.
- Capacity & performance: Perform capacity planning, load testing, performance tuning, and cost optimization for services and infrastructure.
- Platform engineering: Develop and maintain self‑service platform components (k8s clusters, service mesh, CI/CD pipelines) to enable developer productivity.
- Reliability-focused development: Collaborate with dev teams to design fault‑tolerant systems, implement retries, timeouts, circuit breakers, and graceful degradation.
- Security & compliance: Implement secure configurations, manage secrets, and ensure compliance controls for production systems.
- Chaos engineering & resilience testing: Run experiments to validate system resilience and improve failure recovery processes.
- Documentation & runbooks: Maintain runbooks, runbooks, runbooks; document operational procedures, on‑call playbooks, and runbook automation.
- Mentorship & culture: Advocate SRE best practices, mentor engineers on reliability, and drive reliability KPIs across the org.
Typical duties (day‑to‑day)
- Triage alerts, resolve incidents, and perform root‑cause analysis.
- Build automation for deployments, infra provisioning, and remediation.
- Implement and iterate SLOs, track error budgets, and influence release decisions.
- Improve observability dashboards and alerting thresholds.
- Run capacity reviews and optimize resource usage/costs.
- Review designs and code for reliability and operational readiness.
- Participate in on‑call rotation and conduct post‑mortems after incidents.
- Create and maintain runbooks and automation playbooks.
- Collaborate with developers to reduce toil and improve service quality.
- Lead platform improvements (k8s upgrades, tooling, CI/CD enhancements).
Required qualifications
- Education/experience: Bachelor’s in CS or equivalent; 3–5+ years in systems, SRE, DevOps, or site ops roles (senior roles typically require more).
- Technical skills: Strong Linux systems knowledge, networking fundamentals, distributed systems concepts, and debugging production issues.
- Tools & platforms: Experience with Kubernetes, Docker, Terraform/CloudFormation, CI/CD (GitHub Actions/Jenkins), monitoring (Prometheus, Grafana), logging/tracing (ELK/Fluentd, Jaeger), and service meshes (Istio/linkerd) as relevant.
- Cloud & infra: Proven experience with AWS/Azure/GCP, managed k8s (EKS/GKE/AKS), load balancers, autoscaling, and storage systems.
- Scripting & programming: Proficient in at least one language (Python, Go, Ruby, or Bash) for automation and tooling.
- Practices: Familiar with SRE concepts: SLAs/SLOs/SLIs, error budgets, runbooks, chaos engineering, and blameless post‑mortems.
- Soft skills: Strong troubleshooting, communication, incident leadership, and collaboration abilities.
Preferred attributes
- Experience with large‑scale distributed systems, service-oriented architectures, and performance tuning.
- Background in platform engineering, developer tooling, or building internal developer platforms.
- Familiarity with security tooling, compliance standards, and secrets management (Vault).
- Certifications or training in cloud platforms or Kubernetes beneficial.
Pay: QAR160.21 - QAR321.10 per hour
Work Location: In person