About the Role
The AI/ML Support Automation Analyst will be a key member of the KSL AI Support Team, focusing on MLOps infrastructure, container orchestration, and workflow automation at a supercomputing scale.
This role is responsible for developing and maintaining secure, OCI-compliant container images, robust CI/CD pipelines, and cloud-native MLOps workflows that enable researchers to efficiently deploy and manage AI/ML workloads. The Analyst will bridge the gap between cutting-edge Kubernetes-based infrastructure and the diverse needs of the research community, contributing to governance, technical enablement, and community development initiatives.
Responsibilities
MLOps and Container Development
-
Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions for all types of inquiries.
-
Maintain high customer service standards in dealing with and responding to user issues and questions.
-
Develop and maintain secure, OCI-compliant, and HPC-ready AI/ML and data science software container images.
-
Design and implement robust MLOps workflows and pipelines at supercomputing scale.
-
Develop and maintain CI/CD pipelines for reproducible infrastructure and workflow deployment.
-
Design and deploy APIs for AI/ML services and inference endpoints.
-
Implement and manage Kubernetes-based orchestration, including CNI, CSI, and service mesh configurations and optimization.
-
Deploy and maintain container registries (Harbor) and model registries (MLFlow, Kubeflow Model Registry).
Governance and Compliance Support
-
Assist in computational readiness reviews for AI research projects.
-
Assist in AI model and artifact control reviews to ensure compliance with institutional standards.
-
Provide consultation to users on efficient resource usage for AI/ML and MLOps workflows.
-
Ensure container images and workflows comply with security policies and best practices.
-
Support the implementation of usage monitoring and reporting systems.
Performance and Benchmarking
-
Perform performance debugging and tuning of MLOps and cloud-native workflows.
-
Develop and maintain AI/ML and MLOps workload benchmarks for procuring new systems.
-
Create and maintain regression testing workloads for existing clusters.
-
Deploy and maintain observability and resource monitoring stacks using Prometheus, Grafana, NVIDIA DCGM, and Grafana Loki.
-
Contribute to technology evaluation and benchmarking exercises for future infrastructure investments.
Training and Documentation
-
Create comprehensive training content for users on MLOps platforms, Kubernetes, and containerization.
-
Develop and maintain high-quality user documentation for automation tools and workflows.
-
Support the delivery of workshops on CI/CD, container orchestration, and MLOps best practices.
-
Contribute to knowledge transfer initiatives within the KAUST research community.
-
Provide one-on-one consultation to researchers on efficient use of automation infrastructure.
Qualifications
-
Bachelor's or master’s degree in computer science, Data Science, Computational Science, Artificial Intelligence, or a related field.
-
Certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application Developer), CKS (Certified Kubernetes Security Specialist), or CNPE (Certified Cloud Native Platform Engineer) are highly valued.
Required Skills
-
Demonstrated experience developing robust and complex MLOps pipelines.
-
Hands-on experience with API design and deployment.
-
Experience developing robust and portable CI/CD pipelines for reproducible infrastructure and workflow deployment.
-
Experience supporting researchers or working in academic/research computing settings preferred.
Technical Skills - Essential
-
Kubernetes: Strong expertise in Kubernetes, Container Network Interface (CNI), Container Storage Interface (CSI), and Service Mesh.
-
MLOps: Experience developing and maintaining MLOps pipelines and workflows.
-
CI/CD: Proficiency in building CI/CD pipelines for infrastructure and application deployment.
-
Containerization: Experience building secure, OCI-compliant container images.
-
API Development: Experience in API design, development, and deployment.
-
Programming: Proficiency in Python; experience with Go, Bash scripting.
-
Linux: Strong Linux/Unix systems administration skills.
Technical Skills - Desired
-
Experience with ArgoCD, Airflow, DASK, Spark for workflow orchestration.
-
Experience with Kubeflow, KServe, and Seldon for ML serving and pipelines.
-
Experience deploying and maintaining observability stacks (Prometheus, Grafana, NVIDIA DCGM, Grafana Loki).
-
Knowledge of Model Context Protocol (MCP) and agentic frameworks.
-
Experience deploying inference services at scale.
-
Experience deploying and maintaining container registries (Harbor) and model registries (MLFlow, Kubeflow Model Registry, Artifact Hub).
-
Experience with GitOps practices and Infrastructure as Code (Terraform, Ansible).
-
Experience with HPC schedulers (SLURM) and HPC-cloud integration.