Job Description: Application, Microservices, and Infrastructure Observability EngineerOverall Objectives
- Ensure comprehensive, end-to-end visibility into the health, performance, and reliability of applications, microservices, and infrastructure across on-premise and cloud environments.
- Implement and manage modern observability tools to support real-time insights, distributed tracing, and predictive analytics for early issue detection and resolution.
- Drive incident prevention, reduce Mean Time to Resolution (MTTR), and enhance system resilience through data-driven monitoring, automated alerts, and root cause analysis.
- Collaborate with DevOps, Development, and Infrastructure teams to foster a performance-centric culture in high-transaction environments.
Role-Specific Responsibilities
- Design, implement, and maintain observability solutions across applications, microservices, and infrastructure using tools such as Prometheus, Grafana, Dynatrace, and OpenTelemetry.
- Leverage telemetry data (logs, metrics, traces) to identify and troubleshoot issues across compute, network, storage, and application layers.
- Enable distributed tracing and service mapping to diagnose performance bottlenecks and inter-service dependencies in microservices architectures.
- Support performance engineering by optimizing code-level performance, transaction processing, and infrastructure scalability during peak loads or major releases.
- Define and implement automated remediation triggers and escalation paths to minimize manual intervention and improve incident response times.
General Functional Responsibilities
- Ensure compliance with enterprise standards and regulatory frameworks (e.g., GDPR, PSD2) for monitoring and data collection.
- Collaborate with infrastructure, application, and security teams to enhance data ingestion, correlation, and observability maturity (progressing from reactive to predictive monitoring).
- Participate in post-incident reviews and performance retrospectives to identify trends, reduce MTTR, and improve overall reliability.
- Provide out-of-hours support (L1/L2) for critical incidents as part of a rotating on-call schedule.
Required Skills & Qualifications
- Strong expertise in observability platforms: Prometheus, Grafana, Dynatrace, OpenTelemetry, ELK/EFK Stack.
- Proficiency in cloud platforms: AWS, Azure, or GCP, including cloud-native monitoring services.
- Hands-on experience with Kubernetes, Docker, and containerized microservices environments.
- Solid understanding of CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, Azure DevOps).
- Strong knowledge of infrastructure monitoring (compute, storage, network) and application performance monitoring (APM).
- Familiarity with scripting and automation: Python, Bash, PowerShell, or Go.
- Experience with incident management tools (PagerDuty, Opsgenie, ServiceNow) and alerting frameworks.
- Good understanding of ITIL processes, incident response, and root cause analysis.
- Strong communication and collaboration skills to work effectively with cross-functional teams.
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent practical experience).
✅ Key Tools & Technologies (Highlighted):
- Prometheus | Grafana | Dynatrace | OpenTelemetry
- AWS | Azure | GCP
- Kubernetes | Docker
- CI/CD (Jenkins, GitLab, GitHub Actions, Azure DevOps)
- Scripting (Python, Bash, Go, PowerShell)
- APM, Telemetry (Logs, Metrics, Traces), Distributed Tracing
Job Type: Contract
Contract length: 12 months