Senior Infrastructure / HPC

Date Posted: 11 June, 2026

Industry: IT Services and IT Consulting

Location: VaporVM

Job Description:

Senior HPC / Infrastructure Engineer

Location: Riyadh, Saudi Arabia

Employment Type: Full-Time

Experience: 10+ Years (Hands-On)

Position Overview

We are seeking a highly experienced Senior HPC / Infrastructure Engineer with proven expertise in designing, deploying, and operating enterprise-scale High-Performance Computing (HPC) and AI infrastructure environments. This role is ideal for a hands-on technical leader who has built and managed production-grade HPC platforms, GPU clusters, Kubernetes ecosystems, and AI infrastructure from the ground up.

The successful candidate will play a critical role in architecting, optimizing, and maintaining mission-critical compute environments that support advanced AI/ML, data science, and high-performance workloads.

Required Certifications

RHCE – Red Hat Certified Engineer (Active)
CKA – Certified Kubernetes Administrator (Active)

Core Technical Expertise

HPC & NVIDIA AI Ecosystem

NVIDIA Base Command Manager (BCM)
NVIDIA AI Enterprise
NVIDIA GPU Operator & Network Operator
NVIDIA NIM Inference Services
NVIDIA AI Blueprints
CUDA, GPU Drivers, and Performance Optimization

Compute & Container Platforms

Kubernetes (Architecture, Operations & Scaling)
Slurm Workload Manager
Distributed Computing Environments

Operating Systems

Red Hat Enterprise Linux (RHEL)
Ubuntu LTS (Canonical)

Automation & DevOps

CI/CD Pipeline Design & Implementation
Infrastructure Automation
Platform Lifecycle Management
Configuration Management & Orchestration

Key Responsibilities

Design, deploy, and operate large-scale HPC and AI infrastructure environments from bare metal through workload orchestration.
Architect and manage NVIDIA GPU platforms, including BCM, AI Enterprise, GPU Operator, and AI service enablement.
Configure, optimize, and maintain Slurm scheduling environments for high-throughput and GPU-intensive workloads.
Design and operate highly available Kubernetes clusters supporting AI/ML, analytics, and containerized workloads.
Enable and support NVIDIA NIM services and AI Blueprint deployments for enterprise AI initiatives.
Administer and optimize RHEL and Ubuntu environments, ensuring stability, security, and performance.
Develop and maintain infrastructure automation frameworks and CI/CD pipelines for platform and application deployment.
Optimize performance across compute, GPU, storage, networking, and cluster resources.
Implement monitoring, observability, alerting, capacity planning, and operational best practices.
Enforce security, patch management, access controls, and compliance standards across the infrastructure stack.
Lead troubleshooting, root cause analysis, and resolution of complex infrastructure and platform issues.

Candidate Profile

10+ years of hands-on experience in HPC, Linux infrastructure, and enterprise platform engineering.
Proven track record of building and operating production-scale HPC, GPU, or AI infrastructure environments.
Deep expertise in Kubernetes, Slurm, Linux administration, and NVIDIA AI technologies.
Strong understanding of distributed systems, workload scheduling, cluster management, and performance optimization.
Experience supporting AI/ML, data science, and high-performance computing workloads at scale.
Strong analytical, troubleshooting, and problem-solving skills.
Ability to work across infrastructure, platform, automation, and AI enablement domains.
Demonstrated ownership mindset with a history of delivering reliable, scalable, and high-performing solutions.

Similar jobs