Company Description
We're Nagarro
We are a Digital Product Engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (17500+ experts across 39 countries, to be exact). Our work culture is dynamic and non-hierarchical. We are looking for great new colleagues. That is where you come in!
Job Description
Requirement:
-
Experience: 5+ years
-
Strong experience in DevOps or Site Reliability Engineering (SRE) roles.
-
Strong knowledge of Docker, Kubernetes, Terraform, and CI/CD pipelines.
-
Hands-on experience with AWS, Azure, or other cloud platforms.
-
Familiarity with GPU infrastructure and ML workloads is a plus.
-
Good understanding of monitoring and logging systems (Prometheus, Grafana).
-
Ability to collaborate with ML teams for optimized inference and deployment.
-
Strong troubleshooting and problem-solving skills in high-scale environments.
-
Knowledge of infrastructure security best practices, cost optimization, and performance tuning.
-
Exposure to vector databases and AI/ML deployment pipelines is highly desirable.
Responsibilities:
-
Maintain and manage Kubernetes clusters, AWS/Azure environments, and GPU infrastructure for high-performance workloads.
-
Design and implement CI/CD pipelines for seamless deployments and faster release cycles.
-
Set up and maintain monitoring and logging systems using Prometheus and Grafana to ensure system health and reliability.
-
Support vector database scaling and model deployment for AI/ML workloads.
-
Collaborate with ML engineering teams to optimize inference performance and resource utilization.
-
Ensure high availability, security, and scalability of infrastructure across multiple environments.
-
Automate infrastructure provisioning and configuration using Terraform and other IaC tools.
-
Troubleshoot production issues and implement proactive measures to prevent downtime.
-
Continuously improve deployment processes and infrastructure reliability through automation and best practices.
-
Participate in architecture reviews, capacity planning, and disaster recovery strategies.
-
Drive cost optimization initiatives for cloud resources and GPU utilization.
-
Stay updated with emerging technologies in cloud-native, AI infrastructure, and DevOps automation.
Qualifications
Bachelor’s or master’s degree in computer science, Information Technology, or a related field