Overview
We are looking for a hands on Lead Site Reliability Engineer to own the reliability, observability, and automation of our Azure and hybrid (Azure Stack / on-prem) platforms. You will lead SRE practices for our AI, data, and application services, drive a cloud agnostic DevSecOps toolchain, and partner with engineering, data, and security teams to ensure our platforms are secure, scalable, and cost efficient. This role is ideal for a senior engineer with 10+ years of experience who can combine deep technical expertise with strong leadership and coaching skills.
Inception, a G42 company, is the region's leading innovator of AI powered domain specific as well as industry agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer - transforming data and compute infrastructure into real world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact. For more information, please visit .
Responsibilities
- Own SLOs/SLIs and overall reliability for key Azure and on prem platforms (data, AI/ML, and business critical applications).
- Plan and optimise capacity, performance, and cost for compute, storage, networking, and GPU/accelerator workloads.
- Build and maintain observability (metrics, logs, traces, dashboards, alerts) using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms.
- Lead automation of infrastructure and operations using Terraform, Bicep, Ansible, and scripting (Python, PowerShell, Bash/Go); drive self healing and runbook driven operations.
- Operate Azure, Azure Stack, and on prem Kubernetes/AKS clusters; ensure secure, resilient hybrid connectivity, identity, and access across environments.
- Lead P0/P1 incident response, on call rotations, communication, and blameless post mortems; drive long term fixes and reliability improvements.
- Use ITSM and DevSecOps tools (e.g. cloud agnostic CI/CD, ServiceNow, Jira, ManageEngine, security scanning and policy as code) to manage change, incidents, and compliance.
- Provide technical leadership and mentoring to SREs and platform engineers; collaborate with data, AI/ML, application, and security teams to design for reliability and security from day one.
Qualifications Skills & Experience
- 10+ years in SRE/DevOps/platform engineering roles, including 5+ years designing and running workloads on Microsoft Azure at scale.
- Strong experience with Azure Data and AI services, including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services.
- Deep hands on skills with containers and Kubernetes (AKS or equivalent), including autoscaling, upgrades, and production operations.
- Proficiency with Infrastructure as Code (Terraform, Bicep, Ansible) and scripting/programming in Python and/or PowerShell (Go/Bash a plus).
- Solid understanding of observability practices and tools (metrics, logs, traces) and experience implementing monitoring and alerting in production.
- Proven track record implementing SRE practices (SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation).
- Familiarity with hybrid networking, identity, and security (ExpressRoute/VPN, private endpoints, Azure AD, key management).
- Experience working within Agile/Scrum and ITIL processes; exposure to ISO 27001 and external audits is an advantage.
- Excellent communication and stakeholder management skills, with a proven ability to lead, mentor, and influence cross functional teams.
What Success Looks Like
- 99.9%+ availability for core platforms and customer facing services.
- Fast and predictable incident handling (MTTD and MTTR targets met).
- End to end observability with meaningful, low noise alerting across Azure and on prem environments.
- Significant reduction in manual toil through automation and self service (target 50% reduction over time).
- Documented and tested DR/BCP for key AI, data, and application platforms.
What We Look For
Performance driven, inquisitive minds with agility to adapt to ambiguity, eager to build meaningful collaborations with stakeholders, bias for action, and passion for the AI space are a strong fit for Inception.
What Working At Inception Offers
- Culture: Open, diverse and inclusive environment with a global vision encouraging personal growth and focusing on ground breaking, industry first innovations.
- Career: Outstanding learning, development & growth opportunities via structured training programs and innovative, high tech projects.
- Rewards: Competitive remuneration package with a host of perks including healthcare, education support, leave benefits and more.
Job Details
Role Level: Mid Level
Work Type: Full Time
Country: United Arab Emirates
City: Abu Dhabi
Company Website:
Job Function: Information Technology (IT)
Industry: Technology, Information and Internet