Application Infrastructure Observability Engineer (BANKING Only)

Abu Dhabi, United Arab Emirates

Job Description: Application, Microservices, and Infrastructure Observability EngineerOverall Objectives

Ensure comprehensive, end-to-end visibility into the health, performance, and reliability of applications, microservices, and infrastructure across on-premise and cloud environments.
Implement and manage modern observability tools to support real-time insights, distributed tracing, and predictive analytics for early issue detection and resolution.
Drive incident prevention, reduce Mean Time to Resolution (MTTR), and enhance system resilience through data-driven monitoring, automated alerts, and root cause analysis.
Collaborate with DevOps, Development, and Infrastructure teams to foster a performance-centric culture in high-transaction environments.

Role-Specific Responsibilities

Design, implement, and maintain observability solutions across applications, microservices, and infrastructure using tools such as Prometheus, Grafana, Dynatrace, and OpenTelemetry.
Leverage telemetry data (logs, metrics, traces) to identify and troubleshoot issues across compute, network, storage, and application layers.
Enable distributed tracing and service mapping to diagnose performance bottlenecks and inter-service dependencies in microservices architectures.
Support performance engineering by optimizing code-level performance, transaction processing, and infrastructure scalability during peak loads or major releases.
Define and implement automated remediation triggers and escalation paths to minimize manual intervention and improve incident response times.

General Functional Responsibilities

Ensure compliance with enterprise standards and regulatory frameworks (e.g., GDPR, PSD2) for monitoring and data collection.
Collaborate with infrastructure, application, and security teams to enhance data ingestion, correlation, and observability maturity (progressing from reactive to predictive monitoring).
Participate in post-incident reviews and performance retrospectives to identify trends, reduce MTTR, and improve overall reliability.
Provide out-of-hours support (L1/L2) for critical incidents as part of a rotating on-call schedule.

Required Skills & Qualifications

Strong expertise in observability platforms: Prometheus, Grafana, Dynatrace, OpenTelemetry, ELK/EFK Stack.
Proficiency in cloud platforms: AWS, Azure, or GCP, including cloud-native monitoring services.
Hands-on experience with Kubernetes, Docker, and containerized microservices environments.
Solid understanding of CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, Azure DevOps).
Strong knowledge of infrastructure monitoring (compute, storage, network) and application performance monitoring (APM).
Familiarity with scripting and automation: Python, Bash, PowerShell, or Go.
Experience with incident management tools (PagerDuty, Opsgenie, ServiceNow) and alerting frameworks.
Good understanding of ITIL processes, incident response, and root cause analysis.
Strong communication and collaboration skills to work effectively with cross-functional teams.
Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent practical experience).