Location: Delhi/NCR (Work From Home / Remote/Hybrid)
Experience: 5+ Years
Employment Type: Full-time
We are looking for an experienced MLOps Engineer who can design, build, and manage scalable cloud infrastructure to support our machine learning workflows. The ideal candidate will have a strong DevOps background, hands-on expertise in AWS services, and a solid understanding of ML operationalization. This role requires strong ownership, problem-solving skills, and the ability to drive end-to-end infrastructure automation.
Design, implement, and manage AWS-based cloud infrastructure for ML and data workloads.
Build and maintain CI/CD pipelines for ML model training, testing, and deployment.
Work extensively with Docker, Kubernetes, and container orchestration platforms.
Manage and optimize container runtime environments; exposure to ROSA is an added advantage.
Support large-scale ML operations, including GPU-enabled workloads.
Develop automation and tooling using Python, Go, Bash, or Ruby.
Implement and manage Infrastructure as Code (IaC) using Terraform, Terragrunt, and Chef.
Set up and monitor systems using Prometheus, Grafana, and the ELK stack.
Assist in scaling infrastructure for distributed systems such as Elasticsearch, Kafka, HBase, and Spark.
Collaborate closely with data scientists, engineers, and product teams to streamline deployment workflows.
Troubleshoot complex issues across the infrastructure, environment, and ML stack.
B.Tech (or equivalent) in
Computer Science or related field.
5+ years of hands-on experience in DevOps or MLOps.
Strong expertise in AWS cloud services.
Proficiency with Docker, Kubernetes, and container ecosystem.
Good understanding of ML techniques, Linux systems, GPU workloads, and large-scale deployments.
Strong scripting/programming skills in Python / Go / Bash / Ruby.
Experience with Terraform, Terragrunt, Chef for infrastructure automation.
Familiarity with monitoring tools like Prometheus, Grafana, ELK stack.
Knowledge of scaling data infrastructure such as Elasticsearch, Kafka, HBase, Spark.
Strong analytical, problem-solving, communication, and collaboration skills.
Remote-first work culture.
Opportunity to work on cutting-edge ML infrastructure.
Collaborative and growth-oriented team environment.
Freedom to innovate and implement your ideas.