We are looking for a seasoned AI/AIOps Engineer with deep expertise in building intelligent operational systems powered by modern LLMs, data engineering, and automation frameworks. The ideal candidate will lead the design and deployment of AI-driven solutions for event correlation, anomaly detection, incident prediction, and operational insights. This role requires strong technical leadership, hands-on development capability, and a proven track record of delivering measurable improvements in IT operations.
Responsibilities:
-
Design, develop, and deploy AI/ML solutions for event correlation, log analysis, root cause prediction, and observability enhancement.
-
Build and optimize LLM-powered systems, including RAG pipelines, MoE models, and vector-search architectures.
-
Implement multi-model orchestration layers or MCP frameworks to manage diverse AI components and workflows.
-
Collaborate with engineering, DevOps, and SRE teams to integrate AIOps capabilities into operational environments.
-
Develop scalable APIs and backend services using Python and FastAPI.
-
Leverage statistics, ML algorithms, and analytical techniques to generate actionable operational insights.
-
Work with MLOps and AIOps toolchains to automate model deployment, monitoring, and maintenance.
-
Evaluate system performance, drive continuous improvement initiatives, and deliver quantifiable operational benefits.
-
Prepare documentation, architecture diagrams, and best practices for AI implementations.
Qualifications:
-
Bachelor’s or Master’s degree in Computer Science, Data Science, or a related technical field.
-
5+ years of experience in AI/ML, Data Engineering, or AIOps domains.
-
Demonstrated success implementing LLM-based solutions for log/event correlation, incident prediction, or operational intelligence.
-
Strong understanding of AIOps concepts including event correlation, anomaly detection, incident prediction, and topology mapping.
-
Hands-on experience with RAG workflows, MoE models, and vector databases (FAISS, Pinecone, Milvus).
-
Proficient in Python, FastAPI, and AI integration frameworks such as LangChain, LlamaIndex, or Transformers.
-
Experience building MCP or multi-model orchestration layers.
-
Solid grounding in data science, statistics, and core ML algorithms.
-
Familiarity with AIOps platforms like Moogsoft, BigPanda, Dynatrace, or IBM Watson AIOps.
-
Proven impact with measurable outcomes (e.g., reduced alert noise, faster incident resolution).