Position Summary
The AI/ML Support Automation Analyst will be a key member of the KSL AI Support Team, focusing on MLOps
infrastructure, container orchestration, and workflow automation at a supercomputing scale. Working under the
AI/ML Support Team Lead, this role is responsible for developing and maintaining secure, OCI-compliant container
images, robust CI/CD pipelines, and cloud-native MLOps workflows that enable researchers to efficiently deploy and
manage AI/ML workloads. The Analyst will bridge the gap between cutting-edge Kubernetes-based infrastructure
and the diverse needs of the research community, contributing to governance, technical enablement, and
community development initiatives.
Major Responsibilities
1 MLOps and Container Development
- Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions
for all types of inquiries.
- Maintain high customer service standards in dealing with and responding to user issues and questions.
- Develop and maintain secure, OCI-compliant, and HPC-ready AI/ML and data science software container
images
- Design and implement robust MLOps workflows and pipelines at supercomputing scale
- Develop and maintain CI/CD pipelines for reproducible infrastructure and workflow deployment
- Design and deploy APIs for AI/ML services and inference endpoints
- Implement and manage Kubernetes-based orchestration, including CNI, CSI, and service mesh
configurations and optimization
- Deploy and maintain container registries (Harbor) and model registries (MLFlow, Kubeflow Model
Registry)
2 Governance and Compliance Support
- Assist in computational readiness reviews for AI research projects
- Assist in AI model and artifact control reviews to ensure compliance with institutional standards
- Provide consultation to users on efficient resource usage for AI/ML and MLOps workflows
- Ensure container images and workflows comply with security policies and best practices
- Support the implementation of usage monitoring and reporting systems
3 Performance and Benchmarking
- Perform performance debugging and tuning of MLOps and cloud-native workflows
- Develop and maintain AI/ML and MLOps workload benchmarks for procuring new systems
- Create and maintain regression testing workloads for existing clusters
- Deploy and maintain observability and resource monitoring stacks using Prometheus, Grafana, NVIDIA
DCGM, and Grafana Loki
- Contribute to technology evaluation and benchmarking exercises for future infrastructure investments
4 Training and Documentation
- Create comprehensive training content for users on MLOps platforms, Kubernetes, and containerization
- Develop and maintain high-quality user documentation for automation tools and workflows
- Support the delivery of workshops on CI/CD, container orchestration, and MLOps best practices
- Contribute to knowledge transfer initiatives within the KAUST research community
- Provide one-on-one consultation to researchers on efficient use of automation infrastructure
Personal Requirements
Competencies
- Demonstrated experience developing robust and complex MLOps pipelines
- Hands-on experience with API design and deployment
- Experience developing robust and portable CI/CD pipelines for reproducible infrastructure and workflow
deployment
- Experience supporting researchers or working in academic/research computing settings preferred
- Technical Skills - Essential
- Kubernetes: Strong expertise in Kubernetes, Container Network Interface (CNI), Container Storage
Interface (CSI), and Service Mesh
- MLOps: Experience developing and maintaining MLOps pipelines and workflows
- CI/CD: Proficiency in building CI/CD pipelines for infrastructure and application deployment
- Containerization: Experience building secure, OCI-compliant container images
- API Development: Experience in API design, development, and deployment
- Programming: Proficiency in Python; experience with Go, Bash scripting
- Linux: Strong Linux/Unix systems administration skills
- Technical Skills - Desired
- Experience with ArgoCD, Airflow, DASK, Spark for workflow orchestration
- Experience with Kubeflow, KServe, and Seldon for ML serving and pipelines
- Experience deploying and maintaining observability stacks (Prometheus, Grafana, NVIDIA DCGM, Grafana
Loki)
- Knowledge of Model Context Protocol (MCP) and agentic frameworks
- Experience deploying inference services at scale
- Experience deploying and maintaining container registries (Harbor) and model registries (MLFlow,
Kubeflow Model Registry, Artifact Hub)
- Experience with GitOps practices and Infrastructure as Code (Terraform, Ansible)
- Experience with HPC schedulers (SLURM) and HPC-cloud integration
- Strong problem-solving and analytical abilities
- Excellent written and verbal communication skills in English
- Customer service mindset with patience for supporting diverse skill levels
- Ability to work independently and as part of a collaborative team
- Strong documentation and knowledge-sharing practices
- Cultural sensitivity for working in an international environment
Preferred Qualifications
- Experience in national laboratories or major research computing facilities
- Experience with GPU scheduling and resource management in Kubernetes
- Background in DevOps or Site Reliability Engineering (SRE)
- Contributions to open-source cloud-native or MLOps projects
- Publications or presentations on MLOps, Kubernetes, or automation topics
- Knowledge of Saudi Arabia's Vision 2030 and national AI initiatives
- Additional certifications: AWS/Azure/GCP, Terraform, NVIDIA DLI
Qualifications
- Bachelor's or master’s degree in computer science, Data Science, Computational Science, Artificial
Intelligence, or a related field
- Certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application
Developer), CKS (Certified Kubernetes Security Specialist), or CNPE (Certified Cloud Native Platform
Engineer) are highly valued
Experience
- Minimum of 2 years of relevant experience