Dear Candidate,
Kindly find below JD
Location - Chennai, Bangalore, Hyderabad, Pune and Noida
Experience - 7+
Primary Monitoring & Incident Response
-
Provide 24×7 monitoring of Azure infrastructure (compute, network, storage) using tools such as Azure Monitor, Splunk, DynaTrace, and custom dashboards.
-
Respond to alerts and triage P1/P2 escalations via ServiceNow war rooms, performing initial diagnosis and remediation where possible.
-
Incident / Change / Exception process adherence.
Capacity & Availability Management
-
Identify scaling opportunities with virtual machines or service as required and identify zone-redundancy patterns for performance.
-
Keep track of capacity forecasts and proactively identify performance bottlenecks.
Backup & Restore Operations
-
Execute frequent backups (Azure Backup, NetApp Snapshots) and perform basic restore tasks to ensure business continuity.
-
Conduct routine backup verifications/tests to confirm data integrity.
Access & Permissions Management
-
Maintain Azure/NetApp file shares, setting up and adjusting access controls and AD group permissions according to organizational policy.
-
Perform periodic identity and access reviews to ensure principle of least privilege.
Logging & Metrics Oversight
-
Oversee monitoring agents (e.g., Splunk, DynaTrace, Azure Alerts, SystemPulse), ensuring they are up-to-date and generating the right alerts/metrics for L2 to act upon.
-
Collaborate with L3 to fine-tune alert thresholds and logging when chronic issues emerge.
Basic Performance Testing
-
Execute routine performance checks (e.g., load or stress tests) in coordination with L3 teams when potential service degradation is suspected.
-
Document and escalate consistent performance anomalies.
SKILL SET & STAFFING CONSIDERATIONS
-
Comfortable reading and troubleshooting logs/metrics (Splunk, DynaTrace, Azure Monitor).
-
Familiar with Azure Backup services, basic restore procedures, and file share permissions.
-
Proficiency in ticketing systems (ServiceNow), collaborating with other technical teams for escalations.
-
Sufficient knowledge to follow runbooks and standard operating procedures (SOPs).
-
Documentation of standard operating procedures and IaC changes should be continuously updated in a central repository (e.g., Git repos).
-
Familiarity with Epic implementations (on-prem / cloud)