Job Title
L3 Support Engineer – Agentic AI, Automation & Reliability (Full‑Stack Support)
Role Overview
As an L3 Support Engineer – Agentic AI, Automation & Reliability, you will play a critical role in ensuring the stability, performance, and continuous improvement of AIM’s cloud‑based and distributed systems. Operating as a senior escalation point, you will own high‑severity (P1/P2) production incidents end to end—driving rapid troubleshooting, remediation, root cause analysis, and long‑term prevention across applications, integrations, and cloud infrastructure.
This role goes beyond traditional support. You will actively design, operate, and improve AI‑driven and automated support workflows, including agent‑based ticket triage, LLM‑assisted diagnostics, and self‑healing runbooks. Working closely with global teams and North American stakeholders, you will combine deep technical expertise with strong communication skills to lead major incident bridges, produce clear RCAs, and mentor L1/L2 engineers in adopting automation‑first and AI‑assisted operating practices.
Location
Remote (Pakistan)
Work Hours
8:00 AM – 5:00 PM Eastern Time, with participation in a global on‑call rotation for critical incidents.
About AIM
AIM is a Canadian technology company that helps organizations modernize their systems through advanced API management, cloud engineering, security solutions, and full-stack software development. Our teams work across North America and globally, delivering stable, scalable, and secure digital platforms for enterprise clients.
We take pride in being hands-on, collaborative, and focused on delivering real results for our clients. As we grow, we are expanding our marketing team to strengthen our brand presence and support our next stage of growth.
Core Technical Skills
- Strong troubleshooting skills across applications, infrastructure, and integrations, with ownership of P1/P2 incidents end‑to‑end (detection, mitigation, RCA, and prevention).
- Solid understanding and practical application of ITIL processes (Incident, Problem, Change Management) in an ITSM tool such as Jira Service Management, ServiceNow, or ManageEngine.
- Scripting and automation skills in at least one of: Python (preferred), PowerShell, or Bash, with examples of automating repetitive operational tasks (ticket handling, health checks, log analysis, etc.).
- Experience working with APIs (REST, Graph API) and integrating systems and workflows using APIs and webhooks.
- Working knowledge of a major cloud platform, preferably Microsoft Azure (compute, storage, networking, identity, monitoring/alerts). Experience with AWS or GCP is acceptable if you are willing to ramp up on Azure.
Agentic AI & Automation Skills
Must‑Have
- Practical experience designing, configuring, or operating AI‑driven or agent‑based workflows (e.g., autonomous ticket triage, virtual agents, or LLM‑assisted runbooks).
- Understanding of prompt engineering basics, how AI agents call tools/APIs, and how context/memory is managed in such systems.
- Awareness of AI risks (hallucinations, unsafe actions) and how to implement guardrails, human‑in‑the‑loop controls, and governance policies.
Nice‑to‑Have
- Familiarity with Retrieval‑Augmented Generation (RAG), vector databases, semantic search, or multi‑agent orchestration frameworks.
Technology Stack (Exposure Expected)
- Cloud: Microsoft Azure (preferred), and/or AWS/GCP.
- ITSM: Jira Service Management (preferred), ManageEngine, ServiceNow, or similar.
- Observability: Azure Monitor, Datadog, Splunk, Prometheus, or equivalent tools for logs, metrics, traces, and alerting.
- Bonus: Knowledge of containers and orchestration (Docker, Kubernetes) is an asset but not mandatory.
Soft Skills & Operating Expectations
- Excellent written and verbal English communication, able to lead major incident bridges and produce clear incident reports and RCAs for North American stakeholders.
- Strong ownership mindset; comfortable operating across L1/L2/L3 when needed, while driving automation and self‑healing to reduce manual workload.
- Ability to mentor L1/L2 engineers in using AI‑driven tools and adopting automation‑first practices.
- Comfortable working permanently 9–5 EST from Pakistan and participating in an on‑call rotation for after‑hours incidents as part of a global support model.
Minimum Experience
- 5–8 years in Production Support, Support Engineering, or Site Reliability Engineering, including at least 3 years handling L2/L3 escalations in cloud or distributed systems.
- Proven experience working with international customers (North America or Europe) and operating in shift‑based or evening/night schedules.
- Hands‑on experience in environments where AI‑driven or automated workflows are used for support, operations, or reliability.
Preferred Qualifications
- Certifications in ITIL, Azure/AWS/GCP, or AI/ML disciplines.
- Experience in managed services or SaaS environments with multi‑tenant architectures.
- Familiarity with compliance and security frameworks such as SOC 2 and ISO 27001.