Site Reliability Engineer (SRE) — Job Description
Summary
Design, build, and operate scalable, reliable, and observable systems by applying software engineering practices to infrastructure and operations; reduce toil, improve system reliability, and enable rapid, safe deployments.
Key responsibilities
- Reliability engineering: Define and maintain service-level objectives (SLOs), service-level indicators (SLIs), and error budgets; drive reliability improvements.
- Incident management: Lead incident response, perform on-call duties, coordinate root-cause analysis (RCA), and implement post-incident fixes to prevent recurrence.
- Automation &* tooling:* Automate operational tasks (provisioning, deployments, scaling, failover), build runbooks, and develop internal tools to eliminate manual toil.
- Monitoring &* observability:* Implement metrics, logging, tracing, and alerting; ensure effective dashboards and alerts with actionable thresholds.
- Capacity &* performance:* Plan capacity, conduct load/stress testing, tune performance, and forecast resource needs to meet SLOs.
- CI/CD &* release engineering:* Build and maintain CI/CD pipelines, deployment strategies (blue/green, canary), and rollback procedures to enable safe releases.
- Infrastructure as Code (IaC): Define and manage infrastructure using IaC tools (Terraform, CloudFormation, Pulumi) and configuration management (Ansible, Chef).
- Cloud &* platform engineering:* Design and operate cloud-native infrastructure (AWS, GCP, Azure) and platform services (Kubernetes, service meshes); optimize costs and resilience.
- Security &* compliance:* Collaborate with security teams to implement secure configurations, secrets management, and compliance controls in production systems.
- On-call &* runbook maintenance:* Maintain runbooks, run periodic game days/chaos testing, and participate in on-call rotation to ensure rapid incident resolution.
- Collaboration &* mentorship:* Work with developers, QA, and product teams to embed reliability practices, mentor engineers, and improve deployment and observability culture.
Qualifications
- Education: Bachelor’s degree in Computer Science, Engineering, or equivalent experience (or relevant certifications).
- Experience: 3–7+ years in SRE, DevOps, systems engineering, or related roles (adjust per seniority).
- Technical skills: Strong Linux/Unix administration, networking fundamentals, scripting (Python, Go, Bash), and familiarity with containers and orchestration (Docker, Kubernetes).
- Tools & platforms: Experience with cloud providers (AWS/GCP/Azure), IaC (Terraform/CloudFormation), CI/CD (Jenkins/GitHub Actions/GitLab CI), monitoring/observability (Prometheus, Grafana, ELK, Jaeger), and incident management tools (PagerDuty).
- Soft skills: Strong troubleshooting, communication, and collaboration skills; ability to perform under incident pressure.
Competencies & attributes
- Systems-thinking and automation-first mindset.
- Strong analytical and debugging abilities.
- Proactive, ownership-driven, and customer-focused.
- Ability to balance reliability, velocity, and cost.
- Continuous learner with an interest in tooling and process improvement.
Pay: QAR121.16 - QAR324.88 per hour
Work Location: In person