Job Responsibilities
The Manager will be responsible for:
- Lead and manage a team of Engineers to deploy and monitor machine learning models in production.
- Working with data engineers for designing data engineering pipelines and performs robust ETL processes to ensure reliable, high‑quality data for analytics and ML workloads.
- Collaborate with cross-functional teams, including data science, engineering, and operations, to understand business requirements and translate them into scalable ML solutions.
- Architect and implement end-to-end machine learning pipelines for model training, testing, deployment, and monitoring.
- Establish best practices and standards for model versioning, deployment, and monitoring to ensure reliability, scalability, and performance.
- Implement automated processes for model training, hyperparameter tuning, and model evaluation using tools such as Weight and Biases, MLflow, Kubeflow, or similar.
- Design and implement infrastructure for scalable and efficient model serving and inference, leveraging technologies such as Kubernetes, Docker, and serverless computing.
- Develop and maintain monitoring and alerting systems to detect model drift, performance degradation, and other issues in production.
- Provide technical leadership and mentorship to team members, fostering their professional growth and development.
- Stay current with emerging technologies and industry trends in machine learning engineering, and evaluate their potential impact on our processes and infrastructure.
- Collaborate with product management to define requirements and priorities for machine learning model deployments and validation, ensuring alignment with business goals and objectives.
- Implement monitoring and logging solutions to track model performance metrics, resource utilization, and system health, enabling proactive issue detection and resolution.
- Lead efforts to optimize resource utilization and cost-effectiveness of machine learning infrastructure, including compute resources, storage, and data transfer.
- Stay abreast of advancements in machine learning technologies, evaluating their applicability and potential impact on our AI Operations strategy and roadmap.
- Foster a culture of innovation, collaboration, and continuous improvement within the AI Operations team, encouraging experimentation and learning from failures.