Find The RightJob.

DevOps Engineer – AI Infrastructure

We are hiring an experienced DevOps Engineer with strong expertise in AI infrastructure, GPU environments, Kubernetes, and enterprise-grade platform engineering. This role is ideal for professionals who have worked on deploying and managing production-level LLM environments within secure on-premise infrastructures.

Key Responsibilities

Lead deployment and management of Linux-based AI infrastructure environments.
Configure and maintain KVM virtualization platforms and GPU-enabled systems.
Deploy and manage containerized LLM serving environments in production.
Design scalable and secure Kubernetes-based infrastructure for AI workloads.
Implement CI/CD pipelines, automation frameworks, and infrastructure-as-code practices.
Apply security hardening, RBAC, encryption, secrets management, and audit-ready controls.
Monitor GPU utilization, infrastructure health, and system performance.
Support high availability, disaster recovery, backup, and failover strategies.
Troubleshoot infrastructure, GPU runtime, networking, and platform stability issues.
Prepare technical documentation, architecture diagrams, and operational runbooks.

Required Skills Experience

5+ years of experience in DevOps, Platform Engineering, or Infrastructure Engineering.
Strong Linux administration experience including networking, storage, security hardening, and performance tuning.
Hands-on experience working with NVIDIA GPU infrastructure including H100, A100, H200, or equivalent.
Strong experience with CUDA drivers, GPU runtimes, GPU scheduling, and AI inference optimization.
Experience deploying production-grade LLM serving platforms using technologies like vLLM, TensorRT-LLM, or Triton Inference Server.
Strong hands-on experience with Docker, Kubernetes, and containerized AI workloads.
Experience with Infrastructure as Code and automation tools such as Ansible.
Strong understanding of DevSecOps practices, secure CI/CD pipelines, SAST integration, and secrets management.
Experience working with PostgreSQL, vector databases such as Qdrant, and observability tools.
Experience working in on-premise, air-gapped, or regulated enterprise environments.

Preferred

Experience designing scalable AI infrastructure environments.
Knowledge of HA/DR strategies and enterprise backup solutions.
Strong troubleshooting and documentation skills.

Similar jobs

No similar jobs found

Term of use Privacy policy