Job Summary:
We’re looking for an experienced Senior DevOps Engineer who loves working with Kubernetes and AI-driven applications. In this role, you’ll be responsible for designing, implementing, and maintaining scalable cloud infrastructure while supporting MLOps pipelines for AI workloads.
What You’ll Be Doing:
- Building Scalable Infrastructure: You’ll design, implement, and maintain cloud infrastructure using Kubernetes to handle AI and non-AI workloads efficiently.
- Developing CI/CD & MLOps Pipelines: Help us automate AI/ML workflows using tools like Kubeflow, MLflow, or Argo Workflows, ensuring seamless deployment and monitoring of AI models.
- Optimizing AI Model Deployments: Work with ML engineers to fine-tune LLM models, AI-driven applications, and containerized environments for smooth operation.
- Monitoring & Performance Tuning: Keep an eye on Kubernetes clusters and AI workloads, using tools like Prometheus, Grafana, and Loki to ensure high availability and performance.
- Automating Everything: Whether it’s infrastructure provisioning (Terraform, Helm) or Kubernetes security best practices, you’ll help enforce efficiency and compliance.
Staying Ahead of the Curve: You’ll have the opportunity to explore and implement emerging AI infrastructure trends, including KServe, Ray, and Triton Inference Server.
What We’re Looking For:
- 8+ years of experience in DevOps, SRE, or Platform Engineering role, with expertise in Kubernetes and cloud-native DevOps
- Strong knowledge of Kubernetes fundamentals (deployments, services, ingress, storage, GPU scheduling, multi-cluster management).
- Proficiency in scripting & automation with Python, Bash, or Go, particularly for AI-related workflows.
- Hands-on experience with AWS, Azure, or GCP, especially in Kubernetes-based AI/ML infrastructure (e.g., Amazon SageMaker, GKE with AI, Azure ML).
- Hands-on experience with model deployment frameworks (NVIDIA Triton, vLLM. TGI etc.)
- Experience with Distributed computing, multi-GPU training on kubernetes and on-prem GPU clusters.
- Experience managing resource allocation and autoscaling for large training/inference workloads. (KEDA, HPA etc.)
- Experience with CI/CD & MLOps tools such as Jenkins, Argo CD, Kubeflow, MLflow, or Tekton.
- Familiarity with GenAI model deployment, including fine-tuning, inference optimization, and A/B testing.
- Hands-on experience with managed ML services (AWS bedrock, Vertext AI models etc.)
- Strong problem-solving skills and a mindset of automating repetitive tasks.
- Excellent communication skills to collaborate with ML engineers, data scientists, and software teams.
Bonus Points If You Have:
- Experience with LLMOps (Large Language Model Operations) and deploying LLM-based applications at scale.
- Knowledge of Vector Databases (FAISS, Weaviate, Qdrant) for AI-driven applications.