We are looking for a passionate
MLOps & Infrastructure Engineer
to take the innovative models developed by our growing data science team and transport them reliably, scalably, and efficiently into live production environments.
If you enjoy automating complex systems, if Kubernetes is your playground, and if solving the operational challenges of the machine learning world excites you, let's meet!
Position Summary
In this role, working closely with Data Scientists and Software Engineers, you will design and build the infrastructure that manages the entire lifecycle of ML models from the research environment to production. You will not only deploy models but also act as the architect of systems that monitor their performance, trigger retraining processes, and ensure cost/performance optimization of the underlying infrastructure.
Responsibilities
-
Infrastructure Management:
Designing, setting up, securing, and maintaining cloud-based (AWS, Azure, or GCP) and/or on-premise infrastructures for machine learning workloads (especially those requiring GPUs).
-
ML Pipeline Automation:
Building robust end-to-end CI/CD/CT (Continuous Training) pipelines that automate data preparation, model training, testing, and deployment processes.
-
Container Orchestration:
Creating Docker containers and managing them in Kubernetes (K8s) environments adhering to high availability and scalability principles.
-
Model Serving:
Optimizing the APIs and service architectures required for serving models in real-time or batch modes (e.g., using TensorFlow Serving, TorchServe, FastAPI).
-
Monitoring & Observability:
Establishing monitoring and alerting systems that track both system metrics (CPU, RAM, Latency) and ML-specific metrics (Model Drift, Data Drift, Accuracy drops) (using Prometheus, Grafana, ELK Stack, etc.).
-
Infrastructure as Code (IaC):
Managing all infrastructure processes using tools like Terraform, Ansible, or CloudFormation under the "Infrastructure as Code" principle.
-
Collaboration & Standardization:
Standardizing development environments for the data science team, defining best MLOps practices, and guiding the team in these areas.
Qualifications
-
Bachelor's degree in Computer Engineering, Software Engineering, or related technical fields.
-
3+ years of experience in DevOps, Systems Engineering, or MLOps.
-
Advanced proficiency in
Python
programming (Bash/Shell scripting knowledge is a plus).
-
Deep knowledge and experience in
Linux/Unix
system administration.
-
Production environment experience with at least one major cloud provider (
AWS, GCP, or Azure
).
-
Solid practical experience with
Docker
and
Kubernetes
(ability to write Helm charts, manage K8s objects).
-
Mastery of CI/CD tools (Jenkins, GitLab CI, GitHub Actions, etc.) and processes.
-
Basic understanding of the machine learning lifecycle (data collection, training, validation, deployment). (You don't need to develop models, but understanding the needs of a data scientist is expected).
-
Analytical thinking, problem-solving abilities, and strong communication skills.
Preferred Qualifications
-
Experience with MLOps platforms and tools (e.g., Kubeflow, MLflow, Airflow, Seldon Core).
-
Familiarity with Feature Store concepts (e.g., Feast).
-
Knowledge of big data technologies (Spark, Kafka, etc.).
-
Experience managing GPU-based compute infrastructures (NVIDIA drivers, CUDA, etc.).
What We Offer
-
Competitive salary and benefits package.
-
Flexible working hours and [Remote/Hybrid] working options.
-
Opportunity to work with state-of-the-art tools and large-scale data.
-
Continuous learning and development budget (Training, certifications, conference attendance).