Qureos

Find The RightJob.

Senior Lead SysOps/Devops Engineer

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.


This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

  • Partner with sales and solution teams to identify and qualify new opportunities
  • Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations
  • Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients
  • Prepare high-quality technical materials
  • Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals


In-Account Delivery — SysOps & DevOps Execution

  • Operate directly within client accounts as a senior SysOps/DevOps engineer
  • Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on
  • Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling
  • Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems
  • Serve as the senior escalation point for complex operational incidents within accounts


Architecture & Solution Design

  • Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments
  • Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms
  • Recommend and validate technology choices aligned to client scale, budget, and team maturity
  • Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

  • Design production-grade multi-cluster Kubernetes platforms:
  • RKE2, EKS (AWS), AKS (Azure) at enterprise scale
  • GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools
  • Hybrid cloud + on-premises HPC infrastructure
  • Define and document:
  • Workload isolation: namespaces, MIG partitioning, multi-tenancy models
  • Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)
  • Storage: Longhorn, Ceph, distributed and high-throughput file systems


2. Platform Engineering & GitOps Strategy

  • Define and enforce platform standards across the delivery lifecycle
  • GitOps tooling: ArgoCD, Fleet — declarative cluster management
  • CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote
  • Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible
  • Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev QA Prod)


3. AI / GPU Infrastructure Architecture (Priority Competency)

  • Design and operate GPU compute platforms at scale:
  • GPU Operator deployment and lifecycle management
  • MIG (Multi-Instance GPU) partitioning for multi-tenant workloads
  • Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)
  • Understand AI workload classes and their infrastructure implications:
  • Distributed training workloads (data/model/pipeline parallelism)
  • Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization
  • Align infrastructure to the full AI stack:
  • CUDA stack, cuDNN, NCCL collective communication libraries
  • High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA
  • GPUDirect RDMA / GPUDirect Storage for low-latency data paths


4. Observability & Reliability Engineering

  • Define and implement full-stack observability:
  • Metrics: Prometheus, Thanos (long-term retention, multi-cluster)
  • Logs: Loki, Fluent Bit
  • GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems
  • Build operational frameworks:
  • SLO / SLA definitions and error budget tracking
  • Alerting strategy — noise reduction, severity routing
  • Incident response playbooks and on-call runbooks


5. Security & Multi-Tenancy Architecture

  • Design zero-trust security postures for multi-tenant platforms
  • Secret management: HashiCorp Vault, External Secrets Operator
  • Identity and access: IAM, RBAC, SSO/OIDC integration
  • Network isolation: NetworkPolicy, micro-segmentation, mTLS
  • Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement


6. HPC, Data & Storage Architecture (Priority Competency)

  • Understand the high-performance storage for AI/HPC workloads:
  • GPUDirect Storage — bypassing CPU for GPU-native I/O
  • Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)
  • Storage tiering, caching strategies, and data lifecycle management
  • Size and validate storage architectures against workload I/O profiles


7. Operational Leadership & Linux Systems

  • Lead incident response and root cause analysis (RCA) for critical production issues
  • Define upgrade strategies, change management procedures, and disaster recovery plans
  • Write and maintain runbooks, operational playbooks, and knowledge base content
  • Integrate organizational processes, compliance requirements, and security policies into operational frameworks
  • Deep Linux expertise:
  • Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)
  • Storage I/O scheduling, NVMe optimization
  • Network stack tuning for RDMA / InfiniBand
  • System performance profiling and bottleneck analysis


Candidate Profile — Who You Are
  • you are comfortable running production systems.
  • You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity
  • You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment
  • You communicate technical complexity clearly — to engineers and to C-level stakeholders
  • You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations
  • You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions
  • You thrive in ambiguity and can scope both short POCs and long-horizon platform programs

Requirements


Required

  • 10+ years in platform/infrastructure engineering, with at least 2 years in architect-level role
  • Proven hands-on experience operating Kubernetes at scale in production (multi-cluster, multi-tenant)
  • Significant Linux systems administration experience — kernel, networking, storage at a low level
  • HPC and/or GPU infrastructure experience — physical GPU servers, NCCL, InfiniBand, or high-speed fabrics
  • Demonstrable presales or client-facing experience
  • IaC experience: Terraform and/or Ansible in production environments
  • Strong understanding of GitOps and CI/CD pipelines in enterprise settings


Strongly Preferred

  • Experience with NVIDIA GPU Operator, MIG partitioning, Run:AI, or equivalent GPU scheduling tooling
  • Knowledge of distributed AI training infrastructure (PyTorch DDP, Horovod, DeepSpeed) from an infrastructure perspective
  • Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines
  • Experience with Weka, Ceph, or GPUDirect Storage in HPC/AI environments
  • Hands-on experience with Vault, External Secrets, and zero-trust network architectures
  • Exposure to bare-metal provisioning and HPC cluster management (Slurm, PBS, or equivalent)


Certifications (Advantageous)

  • CKA / CKS (Certified Kubernetes Administrator / Security Specialist)
  • RHCE / RHCA (Red Hat Certified Engineer / Architect)
  • AWS Solutions Architect / Azure Solutions Architect Expert
  • HashiCorp Terraform Associate or Vault Associate
  • NVIDIA DLI certifications (GPU computing, AI infrastructure)

© 2026 Qureos. All rights reserved.