Associate should be proficient in creating and managing interactive Grafana dashboards for real-time visualization of key performance indicators (KPIs) and system health metrics. Should have strong knowledge on building Prometheus query/search for troubleshooting and collect metrics. Having expertise in PromQL and LogQL for querying, analyzing, and aggregating time-series data.
Strong knowledge of Unix and Python scripting. Good to have knowledge/experience of Thanos .
Installation and Configuration: Install, configure, and scale Prometheus instances, including related exporters, and set up Grafana for visualization and data source management.
Metrics Collection and Integration: Integrate Prometheus exporters with existing applications and systems to collect relevant metrics from various sources.
Dashboard and Query Development: Design and develop custom Grafana dashboards and write PromQL queries to visualize key metrics and system performance.
Alerting: Design and manage alerting rules, escalation workflows, and alert thresholds to enable proactive issue identification.
Data Analysis and Visualization: Analyze trends and data to identify performance bottlenecks, optimize systems, and present meaningful insights and reports to stakeholders.
System Reliability: Ensure the accuracy, reliability, and performance of metrics collection and alerts, conduct testing, and validate metrics across services.
Collaboration: Work with DevOps and infrastructure teams to ensure seamless integration and scalability of monitoring systems.
Maintenance and Optimization: Troubleshoot issues, optimize query performance, and contribute to data retention and archiving strategies.
Monitoring Coverage: Continuously improve and expand monitoring coverage to meet evolving system and business needs.
Salary Range- $105,000-$115,000 a year
#LI-SP3
#LI-VX1