Job Purpose
(SRE) Observability Engineer is responsible for ensuring the reliability, availability, scalability, and performance of critical systems and applications. Work closely with cross-functional teams to ensure comprehensive digital monitoring and tracing capabilities to gain actionable proactive insights into system performance to detect anomalies & reduce MTTR’s.
Key Result Areas
• Collaborate with engineering, operations, and other stakeholders to understand enterprise architecture, monitoring requirements & performance goals.
• Identify and define key performance indicators (KPIs) metrics, diagnose issues, and proactively identify areas for optimization.
• Develop and implement observability frameworks, tools, and processes to enable comprehensive monitoring, logging, and tracing of systems and applications.
• Ensure the availability, scalability, and reliability of infrastructure and deployment environments.
• Implement and manage monitoring and observability tools(AppDynamics/DataDog/Splunk/ELK/Sentry etc) to gain insights into system performance and health.
• Provide timely and accurate reports on application performance, highlighting key insights and trends.
• Collaborate with digital squads to implement performance improvements, including code optimizations and infrastructure adjustments.
• Offer guidance and training to end-users and internal teams on best practices for APM and optimizing application performance.
Operating Environment, Framework and Boundaries, Working Relationships
• Member of Digital team responsible of bringing latest tools and innovations in observability domain
• Provide recommendations in monitoring systems, logging frameworks, and distributed tracing platforms
• Manage and deliver key KPI metrics across enterprise architecture and perform trend analysis
• Deliver proactive monitoring framework in infrastructure & digital experience monitoring domain
Problem Solving
• Provide proactive approaches to monitoring problems by utilizing existing observability tools and domain expertise.
• In-depth knowledge of application performance metrics, monitoring, and troubleshooting.
• Providing expertise in Problem detection, Isolation & RCA during incident management with relevant data and artifacts from observability tools & corresponding systems
Decision Making Authority & Responsibility
- • Provide recommendations for best practices of observability and SRE
• Design and implement monitoring solutions
Knowledge, Skills and Experience
• Overall, around 8+ years of experience with IT Infrastructure, Applications
• 3-5 years of hands-on experience in Observability and continuous integration.
• 2 years of programming background in Java or relevant technologies
• Knowledge of cloud infrastructure (Azure) and cluster management tools like Kubernetes
• Strong communication skills with ability to align the organization on complex technical decisions
• Bachelor's or master’s degree in Information Technology, Computer Science, or a related quantitative discipline