We are seeking a Senior Site Reliability Engineer to join our team and ensure the reliability, scalability, and efficiency of our systems.
You will work closely with developers and operations teams to support seamless user experiences and meet client expectations. In this role, you will deploy, maintain, and automate infrastructure and application environments while driving continuous improvement in operational practices. If you have a strong background in SRE and cloud technologies, we encourage you to apply and contribute to our mission of delivering high-quality solutions.
Responsibilities
-
Collaborate with development, security, quality, and operations teams to apply site reliability engineering practices
-
Define and maintain reliability, availability, and performance targets for services and applications
-
Troubleshoot and resolve infrastructure and application issues promptly
-
Implement monitoring systems to track infrastructure and application reliability
-
Manage service level objectives, error budgets, and incident management processes
-
Automate repetitive tasks to reduce operational toil
-
Support capacity planning and performance optimization efforts
-
Conduct postmortem analyses to identify and address root causes of incidents
-
Drive continuous improvement initiatives in operational procedures and reliability engineering
Requirements
-
Bachelor’s degree in computer science, engineering, or related field
-
Proven experience with cloud platforms such as AWS, GCP, or Azure
-
Experience implementing site reliability engineering practices including SLO/SLI, error budgets, postmortems, toil reduction, capacity planning, and incident management
-
Knowledge of Python or similar scripting/programming languages
-
Strong background in monitoring tools and techniques
-
Proficiency with continuous integration and continuous delivery tools, infrastructure as code, and configuration management
-
Solid knowledge of container orchestration technologies such as Kubernetes and Docker
-
Strong written and verbal English communication skills (B2+)
Nice to have
-
Expertise in deployment and management of large language models including retrieval-augmented generation (RAG)
-
Certifications in Kubernetes, AWS, GCP, Azure, or related technologies
-
Experience in DevOps practices and tools
-
Knowledge of AI/ML model management including deployment, monitoring, and maintenance
We offer
-
CONTINUOUS UPSKILLING, LEARNING & DEVELOPMENT
-
Diversity of tasks and projects
-
Assessment center for objective review of competency level
-
Personal development plan
-
Mentoring programs and leadership development
-
Certification and professional development support
-
Access to learning platforms including more than 2,500 internal courses and the LinkedIn Learning library with 20,000+ courses
-
English courses taught by certified teachers
-
CORPORATE BENEFITS
-
Extra leave days
-
Referral bonuses
-
COMPENSATION PACKAGE
-
Competitive compensation paid in USD
-
Regular salary and performance reviews
-
MEDICAL & HEALTHCARE
-
Private health insurance
-
Well-being events
-
WORKING ENVIRONMENT
-
Recreation areas and kitchens
-
Tea, coffee, and snacks
-
Well-being events
-
Sports equipment and game consoles
-
IT Equipment
-
Microsoft's Software Assurance Home Use Program (HUP)
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.