Job Description: Service Reliability Analyst (Network Infrastructure)
Position: Service Reliability Analyst
Experience: 3–6 Years
Contract Duration: 6 Months
Budget: Up to 13 LPA
Location: Bangalore
Shift: 24/7 Rotational Shifts
Job Overview
We are seeking a highly skilled Service Reliability Analyst to ensure the stability, performance, and observability of network systems. This role blends traditional NOC responsibilities with modern AI Ops practices. The ideal candidate will work in a 24/7 operational model, proactively detect anomalies, leverage AI/ML-driven insights, and contribute to continuous service availability and reliability improvements.
Key Responsibilities
- Incident Management & Troubleshooting: Handle incidents across LAN, WAN, VPN, SD-WAN, Data Center, and Cloud Networks (AWS, Azure, GCP).
- AI Ops & Observability: Implement and optimize AI Ops tools such as Dynatrace, LogicMonitor, Datadog, Splunk for proactive anomaly detection, alert correlation, and automation.
- Collaboration: Work closely with engineering teams to onboard services, fine-tune alerts, and improve end-to-end observability.
- Root Cause Analysis (RCA): Lead deep-dive RCAs, incident reviews, and implement preventive measures.
- Metrics & Dashboards: Build dashboards and track KPIs such as MTTR, latency, packet loss, and service availability.
- Automation & Infrastructure as Code: Contribute to automation initiatives using Ansible, Terraform; develop and maintain remediation playbooks.
- Predictive Reliability: Optimize AI/ML models for telemetry analysis and predictive issue detection.
- Operational Flexibility: Work in a 24/7 rotational shift pattern with readiness to support critical escalations.
Required Skills & Experience
- Network Expertise: Strong hands-on skills in TCP/IP, BGP, OSPF, DNS, DHCP, SD-WAN.
- Cloud Networking: Experience with AWS, Azure, GCP networking services.
- Observability & Monitoring: Proficiency with Dynatrace, LogicMonitor, Datadog, Splunk.
- Programming & Scripting: Hands-on in Python, Java, .NET, Node.js, JavaScript, Ansible.
- Infrastructure Automation: Experience with Ansible (playbooks) and Terraform.
- ITSM Tools: Familiarity with ServiceNow, Jira Service Management.
- Incident Leadership: Ability to lead incident response and conduct RCA reports.
- Certifications (Preferred): Cisco CCNA/CCNP, CompTIA Network+ (DevNet certification is a plus).
Nice to Have
- Exposure to High Performance Computing (HPC) or cloud-native services.
- Experience in telemetry/observability data analysis.
- Strong interest in automation, AI Ops, and DevOps practices.
Job Types: Full-time, Permanent
Work Location: In person