Qureos

Find The RightJob.

About the Role


The AI/ML Support Automation Analyst will be a key member of the KSL AI Support Team, focusing on MLOps infrastructure, container orchestration, and workflow automation at a supercomputing scale.

This role is responsible for developing and maintaining secure, OCI-compliant container images, robust CI/CD pipelines, and cloud-native MLOps workflows that enable researchers to efficiently deploy and manage AI/ML workloads. The Analyst will bridge the gap between cutting-edge Kubernetes-based infrastructure and the diverse needs of the research community, contributing to governance, technical enablement, and community development initiatives.


Responsibilities


MLOps and Container Development

  • Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions for all types of inquiries.
  • Maintain high customer service standards in dealing with and responding to user issues and questions.
  • Develop and maintain secure, OCI-compliant, and HPC-ready AI/ML and data science software container images.
  • Design and implement robust MLOps workflows and pipelines at supercomputing scale.
  • Develop and maintain CI/CD pipelines for reproducible infrastructure and workflow deployment.
  • Design and deploy APIs for AI/ML services and inference endpoints.
  • Implement and manage Kubernetes-based orchestration, including CNI, CSI, and service mesh configurations and optimization.
  • Deploy and maintain container registries (Harbor) and model registries (MLFlow, Kubeflow Model Registry).


Governance and Compliance Support

  • Assist in computational readiness reviews for AI research projects.
  • Assist in AI model and artifact control reviews to ensure compliance with institutional standards.
  • Provide consultation to users on efficient resource usage for AI/ML and MLOps workflows.
  • Ensure container images and workflows comply with security policies and best practices.
  • Support the implementation of usage monitoring and reporting systems.


Performance and Benchmarking

  • Perform performance debugging and tuning of MLOps and cloud-native workflows.
  • Develop and maintain AI/ML and MLOps workload benchmarks for procuring new systems.
  • Create and maintain regression testing workloads for existing clusters.
  • Deploy and maintain observability and resource monitoring stacks using Prometheus, Grafana, NVIDIA DCGM, and Grafana Loki.
  • Contribute to technology evaluation and benchmarking exercises for future infrastructure investments.


Training and Documentation

  • Create comprehensive training content for users on MLOps platforms, Kubernetes, and containerization.
  • Develop and maintain high-quality user documentation for automation tools and workflows.
  • Support the delivery of workshops on CI/CD, container orchestration, and MLOps best practices.
  • Contribute to knowledge transfer initiatives within the KAUST research community.
  • Provide one-on-one consultation to researchers on efficient use of automation infrastructure.


Qualifications


  • Bachelor's or master’s degree in computer science, Data Science, Computational Science, Artificial Intelligence, or a related field.
  • Certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application Developer), CKS (Certified Kubernetes Security Specialist), or CNPE (Certified Cloud Native Platform Engineer) are highly valued.


Required Skills


  • Demonstrated experience developing robust and complex MLOps pipelines.
  • Hands-on experience with API design and deployment.
  • Experience developing robust and portable CI/CD pipelines for reproducible infrastructure and workflow deployment.
  • Experience supporting researchers or working in academic/research computing settings preferred.


Technical Skills - Essential

  • Kubernetes: Strong expertise in Kubernetes, Container Network Interface (CNI), Container Storage Interface (CSI), and Service Mesh.
  • MLOps: Experience developing and maintaining MLOps pipelines and workflows.
  • CI/CD: Proficiency in building CI/CD pipelines for infrastructure and application deployment.
  • Containerization: Experience building secure, OCI-compliant container images.
  • API Development: Experience in API design, development, and deployment.
  • Programming: Proficiency in Python; experience with Go, Bash scripting.
  • Linux: Strong Linux/Unix systems administration skills.


Technical Skills - Desired

  • Experience with ArgoCD, Airflow, DASK, Spark for workflow orchestration.
  • Experience with Kubeflow, KServe, and Seldon for ML serving and pipelines.
  • Experience deploying and maintaining observability stacks (Prometheus, Grafana, NVIDIA DCGM, Grafana Loki).
  • Knowledge of Model Context Protocol (MCP) and agentic frameworks.
  • Experience deploying inference services at scale.
  • Experience deploying and maintaining container registries (Harbor) and model registries (MLFlow, Kubeflow Model Registry, Artifact Hub).
  • Experience with GitOps practices and Infrastructure as Code (Terraform, Ansible).
  • Experience with HPC schedulers (SLURM) and HPC-cloud integration.

© 2026 Qureos. All rights reserved.