Find The RightJob.

Senior Lead SysOps/Devops Engineer

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.

This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

Partner with sales and solution teams to identify and qualify new opportunities
Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations
Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients
Prepare high-quality technical materials
Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals

In-Account Delivery — SysOps & DevOps Execution

Operate directly within client accounts as a senior SysOps/DevOps engineer
Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on
Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling
Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems
Serve as the senior escalation point for complex operational incidents within accounts

Architecture & Solution Design

Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments
Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms
Recommend and validate technology choices aligned to client scale, budget, and team maturity
Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

Design production-grade multi-cluster Kubernetes platforms:
RKE2, EKS (AWS), AKS (Azure) at enterprise scale
GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools
Hybrid cloud + on-premises HPC infrastructure
Define and document:
Workload isolation: namespaces, MIG partitioning, multi-tenancy models
Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)
Storage: Longhorn, Ceph, distributed and high-throughput file systems

2. Platform Engineering & GitOps Strategy

Define and enforce platform standards across the delivery lifecycle
GitOps tooling: ArgoCD, Fleet — declarative cluster management
CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote
Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible
Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev QA Prod)

3. AI / GPU Infrastructure Architecture (Priority Competency)

Design and operate GPU compute platforms at scale:
GPU Operator deployment and lifecycle management
MIG (Multi-Instance GPU) partitioning for multi-tenant workloads
Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)
Understand AI workload classes and their infrastructure implications:
Distributed training workloads (data/model/pipeline parallelism)
Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization
Align infrastructure to the full AI stack:
CUDA stack, cuDNN, NCCL collective communication libraries
High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA
GPUDirect RDMA / GPUDirect Storage for low-latency data paths

4. Observability & Reliability Engineering

Define and implement full-stack observability:
Metrics: Prometheus, Thanos (long-term retention, multi-cluster)
Logs: Loki, Fluent Bit
GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems
Build operational frameworks:
SLO / SLA definitions and error budget tracking
Alerting strategy — noise reduction, severity routing
Incident response playbooks and on-call runbooks

5. Security & Multi-Tenancy Architecture

Design zero-trust security postures for multi-tenant platforms
Secret management: HashiCorp Vault, External Secrets Operator
Identity and access: IAM, RBAC, SSO/OIDC integration
Network isolation: NetworkPolicy, micro-segmentation, mTLS
Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement

6. HPC, Data & Storage Architecture (Priority Competency)

Understand the high-performance storage for AI/HPC workloads:
GPUDirect Storage — bypassing CPU for GPU-native I/O
Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)
Storage tiering, caching strategies, and data lifecycle management
Size and validate storage architectures against workload I/O profiles

7. Operational Leadership & Linux Systems

Lead incident response and root cause analysis (RCA) for critical production issues
Define upgrade strategies, change management procedures, and disaster recovery plans
Write and maintain runbooks, operational playbooks, and knowledge base content
Integrate organizational processes, compliance requirements, and security policies into operational frameworks
Deep Linux expertise:
Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)
Storage I/O scheduling, NVMe optimization
Network stack tuning for RDMA / InfiniBand
System performance profiling and bottleneck analysis

Candidate Profile — Who You Are

you are comfortable running production systems.
You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity
You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment
You communicate technical complexity clearly — to engineers and to C-level stakeholders
You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations
You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions
You thrive in ambiguity and can scope both short POCs and long-horizon platform programs

Requirements

Required

10+ years in platform/infrastructure engineering, with at least 2 years in architect-level role
Proven hands-on experience operating Kubernetes at scale in production (multi-cluster, multi-tenant)
Significant Linux systems administration experience — kernel, networking, storage at a low level
HPC and/or GPU infrastructure experience — physical GPU servers, NCCL, InfiniBand, or high-speed fabrics
Demonstrable presales or client-facing experience
IaC experience: Terraform and/or Ansible in production environments
Strong understanding of GitOps and CI/CD pipelines in enterprise settings

Strongly Preferred

Experience with NVIDIA GPU Operator, MIG partitioning, Run:AI, or equivalent GPU scheduling tooling
Knowledge of distributed AI training infrastructure (PyTorch DDP, Horovod, DeepSpeed) from an infrastructure perspective
Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines
Experience with Weka, Ceph, or GPUDirect Storage in HPC/AI environments
Hands-on experience with Vault, External Secrets, and zero-trust network architectures
Exposure to bare-metal provisioning and HPC cluster management (Slurm, PBS, or equivalent)

Certifications (Advantageous)

CKA / CKS (Certified Kubernetes Administrator / Security Specialist)
RHCE / RHCA (Red Hat Certified Engineer / Architect)
AWS Solutions Architect / Azure Solutions Architect Expert
HashiCorp Terraform Associate or Vault Associate
NVIDIA DLI certifications (GPU computing, AI infrastructure)

Similar jobs

Senior Vendor Manager (In Translation and Localization Field)

Future Group

Egypt

6 days ago

Senior System Engineer Virtualization (VMware Cloud Foundation Specialist)

Interact Technology Solutions

Cairo, Egypt

6 days ago

Optimizely Developer (CMS/Commerce | .NET) - Remote - 8 Months - Octopus by RTG

Robusta Studio

Cairo, Egypt

6 days ago

Optimizely Developer (CMS/Commerce | .NET) - Remote - 8 Months - Octopus by RTG

Robusta

Egypt

6 days ago

Mid-Senior Devops

EYouth

Egypt

6 days ago

Microsoft Dynamics 365 Functional Consultant - Octopus by RTG

Robusta

Egypt

6 days ago

INFRASTRUCTURE ENGINEER

Atos

Egypt

6 days ago

Term of use Privacy policy