A Site Reliability Engineer applies software engineering principles to infrastructure and operations, ensuring system reliability, scalability, and performance. SREs bridge development and operations, automating workflows, managing incidents, and maintaining uptime across production environments at scale.
- Monitor and maintain reliability of critical production systems.
- Automate infrastructure tasks to eliminate operational toil.
- Lead incident response and conduct post-incident reviews.
- Define and track SLIs, SLOs, and error budgets.
- Build and maintain CI/CD pipelines and deployment strategies.
- Implement observability using metrics, logs, and traces.
- Collaborate with developers to embed reliability in design.
- Conduct chaos engineering experiments to identify system weaknesses.
- Proficiency in Python, Go, or Bash scripting languages.
- Expertise in AWS, GCP, or Azure cloud platforms.
- Strong knowledge of Kubernetes, Docker, and containerization.
- Experience with Prometheus, Grafana, and Datadog monitoring.
- Infrastructure as Code skills using Terraform or CloudFormation.
- Strong problem-solving, communication, and cross-team collaboration.
Note: Salary depends on experience and skills and is paid in local currency.
Date Posted
March 26, 2026
Offered Salary:
2515000 - 3655000 / year
Expiration date
December 12, 2028
Qualification
Bachelor Degree