Key Responsibilities
- Design and implement Azure DevOps pipelines for automating deployment of GPU-enabled workloads, containers, and ML models.
- Integrate security scanning tools (SAST, DAST, dependency analysis) within CI/CD workflows.
- Automate SUSE Linux system hardening, patching, and compliance enforcement for GPU nodes.
- Build and manage Infrastructure as Code (IaC) using Terraform and Bicep for hybrid GPU infrastructure provisioning.
- Implement zero-trust architecture and enforce RBAC across hybrid workloads.
- Configure and manage Azure Key Vault, Azure Policy, and Defender for Cloud for secure configurations.
- Monitor GPU utilization and costs using Azure Monitor and NVIDIA DCGM integrations.
- Manage Kubernetes security via GPU Operator and device plugin DaemonSets for AKS/Arc clusters.
- Drive continuous improvements in compliance, automation, and observability across cloud and on-prem environments.
Technical Expertise
- Azure DevOps: Repos, Pipelines, Boards, Artifacts
- Operating Systems: SUSE Linux Enterprise Server (SLES) administration using zypper and YaST
- GPU Management: NVIDIA GPU Operator, CUDA runtime, Kubernetes GPU workloads
- Azure Security Suite: Defender for Cloud, Azure Policy, Key Vault, Sentinel
- IaC & Automation: Terraform, Terragrunt, Python scripting
- Monitoring & Logging: Azure Monitor, Grafana, Prometheus
Preferred Skills
- 8+ years in DevOps/DevSecOps with proven experience in hybrid cloud + on-prem infrastructure.
- Hands-on expertise managing SUSE Linux GPU-enabled systems.
- In-depth knowledge of H200 GPU operations, CUDA libraries, and lifecycle management.
- Experience integrating security and compliance into AI/ML CI/CD pipelines.
- Familiarity with CIS Benchmarks, ISO 27001, and NIST frameworks for hybrid environments.
Technical Screening Rubric (Hands-On Tasks)
- Candidates May Be Assessed On
- Creating an Azure DevOps pipeline for GPU-enabled container deployment on AKS/Arc clusters.
- Automating SUSE Linux hardening and compliance reporting within CI/CD.
- Deploying and validating NVIDIA GPU Operator in Kubernetes clusters.
- Developing Terraform IaC for provisioning hybrid GPU infrastructure.
- Integrating Defender for Cloud and Key Vault for security compliance validation.