End Date
Sunday 24 May 2026
We Support Flexible Working – Click here for more information on flexible working options
Flexible Working Options
Hybrid Working
Job Description Summary
We are looking for an experienced Senior Site Reliability Engineer to join our Cloud Enabling team, playing a key role in strengthening the resiliency, availability, and security of large-scale platforms. You will support high‑throughput, Kubernetes‑based systems serving millions of customers, while driving improvements in cloud infrastructure, monitoring, and CI/CD practices. This role requires strong hands-on expertise in SRE, cloud-native architectures, and automation across hybrid and public cloud environments. You will act as a technical leader, defining SLAs/SLOs, improving incident management, and enabling operational excellence at scale. The position offers an opportunity to innovate using modern technologies, including AI-driven tooling, within a large and complex enterprise environment.
Job Description
Job Titel: Senior Site Reliability Engineer
Location: Hyderabad
Position: Full time
Years of experience: 6 to 14
About this opportunity
We're seeking an experienced Site Reliability Engineer to join the Cloud Enabling team within the Personalised Experiences and Communication Platform. This role is crucial in maturing our SRE capability and contributing to the resiliency, availability and security of our infrastructure and software. The ideal candidate will have a strong background in one or multiple fields including SRE, software engineering, data engineering or AI/MLOps. In addition, the candidate will have experience supporting applications at scale, serving high-throughput, having had built and supported complex hybrid-cloud architectures. The candidate is also expected to have worked extensively with Kubernetes-based workloads, networking and monitoring/logging solutions. An engineering mindset and experience working with large complex organisations are preferable.
What you’ll do:
-
Support systems that serve millions of customers and billions of requests monthly, ensuring their availability, scalability and resiliency
-
Act as a key technical individual contributor within PEC and liaising with SRE guilds, driving improvements to our cloud deployments, monitoring solutions, CI/CD pipelines and optimising cost
-
Drive innovation by exploring new technologies and methodologies to improve our SRE capabilities, including exploring AI tooling and automation opportunities
-
Experience with managing high-throughput systems in production to deliver customer value that extends past POCs
-
Hands-on technical expertise with implementing SLAs/SLOs/SLIs for a range of software and data teams
-
Implementing tooling that allows the business to perform triage of incidents more efficiently, have more granular alerting, well-defined runbooks and auto-resolving mechanisms
-
Act as a subject matter expert in engineering conversations relating to site reliability engineering, fostering a culture of continued learning and development within and across our lab.
Why Lloyds Banking Group
We're on an exciting transformation journey and there could not be a better time to join us. The investments we're making in our people, data, and technology are leading to innovative projects, fresh possibilities and countless new ways for our people to work, learn, and thrive.
What you’ll need
-
Hands-on proven experience of software development, testing, monitoring, and operational stability at scale.
-
Production experience with k8s and monitoring tools such as Datadog/Dynatrace/etc.
-
Proven experience and knowledge of automation and CI/CD and best practices
-
Proven experience of running postmortems, defining SLAs/SLIs/SLOs and participating in support rotas
-
Coding/scripting experience developed in a commercial/industry setting (python/bash)
-
Database knowledge, streaming and batch operations and designing APIs
-
Proficient with Kubernetes (ideally microservice architectures using istio service mesh)
-
Extensive experience of Cloud native solutions (ideally Google Cloud).
-
Good understanding of cloud storage, networking, and resource provisioning.