Location: Riyadh, Saudi Arabia
Employment Type: Full-Time
Experience: 10+ Years (Hands-On)
We are seeking a highly experienced Senior HPC / Infrastructure Engineer with proven expertise in designing, deploying, and operating enterprise-scale High-Performance Computing (HPC) and AI infrastructure environments. This role is ideal for a hands-on technical leader who has built and managed production-grade HPC platforms, GPU clusters, Kubernetes ecosystems, and AI infrastructure from the ground up.
The successful candidate will play a critical role in architecting, optimizing, and maintaining mission-critical compute environments that support advanced AI/ML, data science, and high-performance workloads.
- RHCE – Red Hat Certified Engineer (Active)
- CKA – Certified Kubernetes Administrator (Active)
- NVIDIA Base Command Manager (BCM)
- NVIDIA AI Enterprise
- NVIDIA GPU Operator & Network Operator
- NVIDIA NIM Inference Services
- NVIDIA AI Blueprints
- CUDA, GPU Drivers, and Performance Optimization
- Kubernetes (Architecture, Operations & Scaling)
- Slurm Workload Manager
- Distributed Computing Environments
- Red Hat Enterprise Linux (RHEL)
- Ubuntu LTS (Canonical)
- CI/CD Pipeline Design & Implementation
- Infrastructure Automation
- Platform Lifecycle Management
- Configuration Management & Orchestration
- Design, deploy, and operate large-scale HPC and AI infrastructure environments from bare metal through workload orchestration.
- Architect and manage NVIDIA GPU platforms, including BCM, AI Enterprise, GPU Operator, and AI service enablement.
- Configure, optimize, and maintain Slurm scheduling environments for high-throughput and GPU-intensive workloads.
- Design and operate highly available Kubernetes clusters supporting AI/ML, analytics, and containerized workloads.
- Enable and support NVIDIA NIM services and AI Blueprint deployments for enterprise AI initiatives.
- Administer and optimize RHEL and Ubuntu environments, ensuring stability, security, and performance.
- Develop and maintain infrastructure automation frameworks and CI/CD pipelines for platform and application deployment.
- Optimize performance across compute, GPU, storage, networking, and cluster resources.
- Implement monitoring, observability, alerting, capacity planning, and operational best practices.
- Enforce security, patch management, access controls, and compliance standards across the infrastructure stack.
- Lead troubleshooting, root cause analysis, and resolution of complex infrastructure and platform issues.
- 10+ years of hands-on experience in HPC, Linux infrastructure, and enterprise platform engineering.
- Proven track record of building and operating production-scale HPC, GPU, or AI infrastructure environments.
- Deep expertise in Kubernetes, Slurm, Linux administration, and NVIDIA AI technologies.
- Strong understanding of distributed systems, workload scheduling, cluster management, and performance optimization.
- Experience supporting AI/ML, data science, and high-performance computing workloads at scale.
- Strong analytical, troubleshooting, and problem-solving skills.
- Ability to work across infrastructure, platform, automation, and AI enablement domains.
- Demonstrated ownership mindset with a history of delivering reliable, scalable, and high-performing solutions.