Qureos

FIND_THE_RIGHTJOB.

Lead Platform Engineer

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Get AI-powered advice on this job and more exclusive features.

Direct message the job poster from Open Innovation AI

Executive Search Building Product-Centric Tech Teams Talent Advisor to VCs, Founders & CTOs

Company Description

Open Innovation AI is a global technology company that specializes in developing advanced solutions for managing AI workloads. Its flagship product, the Open Innovation Cluster Manager (OICM), orchestrates complex AI tasks efficiently across diverse infrastructures. The platform is hardware-agnostic, optimized for various GPUs and accelerators hardware, and facilitates seamless integration and scalability for enterprise AI applications. Open Innovation AI focuses on optimizing and simplifying AI workload management and making AI technologies accessible to organizations of all sizes. With its innovative solutions, companies can reduce operational costs, accelerate time to value, and maximize their return on investment, ensuring that their AI strategies contribute directly to enhanced business outcomes.

About the Role

We're looking for a Lead Platform Engineer to design and build OICM (Open Innovation Cluster Manager), our AI/ML orchestration platform for distributed computing. You'll work on systems that manage GPU workloads across cloud and on-premises infrastructure, focusing on reliability, performance, and scalability. This role involves building distributed systems, implementing resource scheduling algorithms, and creating fault-tolerant services that operate across multiple environments. You'll need strong systems architecture skills and experience solving complex engineering problems at scale.

What You'll Do:

  • Build distributed systems that handle large scale AI/ML workloads with high availability requirements
  • Develop APIs and microservices that process high request volumes with low latency
  • Implement/enhance scheduling algorithms for efficient GPU resource allocation and load balancing
  • Drive adoption of clean architecture principles and engineering best practices
  • Mentor senior engineers and lead technical initiatives
  • Analyze and optimize system performance across distributed environments

Technical Requirements:

  • 8+ years of experience in building distributed systems and platform infrastructure
  • Expert proficiency in Python and Go.
  • Advanced Kubernetes experience: Custom operators, CRDs, networking, service mesh, and multi-cluster management
  • GPU computing expertise: MIG/vGpu, scheduling, and ML framework integration
  • Distributed systems knowledge: Consensus algorithms, caching, message queues, and fault tolerance
  • Performance engineering: System profiling, benchmarking, and optimization
  • Security practices: security by design, secrets management
  • Leadership experience: Leading technical projects and mentoring teams

Preferred Experience

  • Ray, Kubeflow, Pytorch or similar distributed computing frameworks
  • Open source contributions in Kubernetes or ML infrastructure
  • Custom hardware integration and bare metal provisioning
  • Knowledge of networking is a plus
  • ML/AI model deployment and serving infrastructure
  • Infrastructure automation: Terraform, Ansible, GitOps workflows

Seniority level

Mid Senior level

Employment type

Full time

Job function

Information Technology, Product Management, and Consulting

Industries

IT Services and IT Consulting, Technology, Information and Media, and Information Services

Location: Abu Dhabi, United Arab Emirates

© 2025 Qureos. All rights reserved.