Find The RightJob.

Senior Site Reliability Engineer

🌍About SAQAYA

SAQAYA is a fast-growing international technology consultancy operating across the UK, Spain, and Egypt.

We partner with leading technology companies to build high-impact digital platforms by connecting exceptional engineers with ambitious product teams.

Our mission is simple: build outstanding technology by empowering outstanding people..

💼About the Product Environment

Our client operates within the financial market data space, delivering mission-critical production systems that demand high availability, stability, and performance.

The engineering culture emphasizes automation, reliability, and continuous improvement. You will work in a collaborative environment where infrastructure quality directly impacts product success and user trust.

This is a high-impact role where you will help shape production environments, improve deployment processes, and champion reliability best practices across teams.

🧩The Role

We’re looking for a Site Reliability Engineer who is passionate about automation, infrastructure design, and maintaining highly available production systems.

You will play a key role in ensuring system stability, improving deployment workflows, and collaborating closely with development and data teams to support evolving product requirements.

This role reports to the Head of Engineering.

If you enjoy solving complex operational challenges, automating everything possible, and driving long-term reliability improvements — this role is for you.

🔧What You’ll Be Working On

🚀 Production Reliability & Monitoring

Monitor and support production environments to ensure high availability and performance
Proactively identify operational issues before they impact users
Handle incident response, troubleshooting, and root cause analysis
Participate in an on-call rotation to ensure uptime

⚙️ Infrastructure & Automation

Build and maintain automation scripts for cloud-based deployments
Design and implement production infrastructure
Apply infrastructure-as-code practices (Terraform, Ansible, Puppet)
Improve deployment processes to promote stable and regular releases
Focus on automating repetitive operational tasks

📊 Observability & Performance

Implement and manage monitoring and logging solutions (Prometheus, Grafana, Nagios)
Define and monitor Service Level Objectives (SLOs) and SLAs
Manage capacity planning and performance optimization
Apply disaster recovery strategies, backups, and redundancy planning
Work with error budgets to balance innovation and reliability

🤝 Cross-Functional Collaboration