Job Title: AI Infrastructure Consultant
Job Type: Permanent
Job Location: Riyadh, Saudi Arabia
Job Summary:
We are seeking a seasoned AI Infrastructure Consultant to lead the design, implementation, and optimization of our high-performance computing environment. This role is critical for bridging the gap between raw hardware capabilities (GPUs) and scalable AI/ML model deployment. You will be responsible for ensuring our infrastructure is robust, cost-effective, and capable of supporting complex machine learning workloads at scale.
Roles and Responsibilities:
- Architecture & Design
- Assess AI/ML workload requirements to design end-to-end compute, storage, and networking architectures.
- Architect specialized GPU clusters (NVIDIA A100/H100 or similar) tailored for training and inference.
- Define high-speed networking requirements (e.g., InfiniBand, RoCE) and low-latency storage solutions for massive datasets.
- Containerization & Orchestration
- Implement and manage Docker containerization for consistent model environments.
- Deploy and scale AI workloads using Kubernetes (or managed services like EKS/GKE/AKS), ensuring high availability and seamless resource scheduling.
- MLOps & CI/CD Integration
- Build and maintain robust CI/CD pipelines specifically for AI models, automating the journey from code to production.
- Integrate automated testing, versioning for models/data, and deployment strategies (Canary, Blue-Green).
- Monitoring & Cost Optimization
- Establish comprehensive monitoring frameworks to track infrastructure utilization and GPU health.
- Analyze performance bottlenecks and implement strategies to optimize cost-performance, ensuring maximum ROI on expensive compute resources.
Required Qualifications & Skills:
- Total Experience: 10+ years in IT Infrastructure, Systems Engineering, or DevOps.
- AI Specialization: 2-3 years of hands on experience specifically in AI/ML infrastructure.
- GPU Expertise: Proven track record in GPU setup, CUDA configurations, and managing hardware acceleration for deep learning.
- Orchestration: Expert level knowledge of Kubernetes and the CNCF ecosystem.
- Cloud & Hybrid: Proficiency in major cloud providers (AWS/Azure/GCP) and on premise data center environments.
- Soft Skills: Strong consultancy mindset with the ability to translate complex technical requirements into actionable architectural roadmaps.