Site Reliability Engineer Remote 6 MonthsWe are seeking a Site Reliability Engineer to use software engineering principles to automate IT operations, ensuring system reliability, scalability, and uptime. They bridge development and operations by managing infrastructure, monitoring performance (SLIs/SLOs), responding to incidents, and conducting postmortems to prevent future failures.
Core Responsibilities:- Automation: Writing code to automate manual, repetitive tasks such as system provisioning, configuration management, and patching.
- Incident Response & On-Call: Managing, troubleshooting, and resolving production outages, often participating in on-call rotations.
- Monitoring & Observability: Setting up tools to monitor system health, performance, and user experience, ensuring critical alerts are actionable.
- Capacity Planning & Scaling: Ensuring infrastructure can handle traffic growth and implementing redundancy to maintain high availability.
- Blameless Postmortems: Analyzing the root causes of failures to learn and improve systems, rather than assigning blame.
- Collaboration: Working with developers to ensure code is reliable, performant, and deployable, often advocating for quality over speed.
Required Skills and Qualifications:- Technical Expertise: Proficiency in coding/scripting (e.g., Python, Go) and familiarity with CI/CD tools.
- Infrastructure Skills: Strong knowledge of cloud platforms (AWS, Google Cloud Platform, Azure), Linux, networking, and containerization (Kubernetes).
Preferred Experience:- 2-3 years in SRE, DevOps, or Software Engineering roles
For applications and inquiries, contact: hirings@openkyber.com