Overview
We are seeking a highly skilled and proactive AI Solutions SRE Lead to oversee the maintenance, optimization, and ongoing performance of deployed AI/ML systems and solutions. In this role, you'll act as the bridge between innovation and operations, ensuring our AI solutions consistently deliver value and operate seamlessly in real-world environments. You will lead efforts to monitor deployments, troubleshoot issues, and define best practices for sustaining AI systems throughout their lifecycle.
Responsibilities
Monitoring & Sustenance:
-
Lead the post-deployment lifecycle of AI solutions, ensuring continued functionality, reliability, and scalability.
-
Establish monitoring frameworks to oversee system performance, usage, and metrics for AI/ML models and APIs.
-
Detect anomalies in AI systems, troubleshoot operational issues, and initiate timely corrective actions.
Performance Optimization:
-
Continuously assess and optimize the performance of AI models to maintain efficiency and accuracy in production environments.
-
Collaborate with data scientists and engineers to refine algorithms, retrain models, and update solutions as needed.
-
Implement automation where possible to streamline maintenance processes.
Stakeholder Collaboration:
-
Work with cross-functional teams (engineering, product, operations, etc.) to ensure alignment of AI sustainment activities with business goals.
-
Communicate effectively with stakeholders to provide updates on system health, risks, and improvements.
Governance & Best Practices:
-
Define and implement best practices for sustaining AI solutions, including documentation, testing protocols, and version control.
-
Ensure compliance with ethical AI standards, regulatory guidelines, and established governance frameworks.
-
Manage and mitigate risks associated with model drift, data shifts, and system vulnerabilities.
Incident Management:
-
Lead responses to critical incidents involving AI systems by performing root cause analysis and deploying solutions for quick resolution.
-
Advocate for proactive risk prevention and early detection strategies.
-
Mentor and develop junior team members, fostering their skills in AI observability and domain-specific knowledge in ML, Computer Vision, and Generative AI.
Qualifications
Required:
-
Bachelor's degree in Computer Science, Engineering, Data Science, or related field; advanced degree preferred.
-
9+ years of experience in machine learning, data science, or software engineering roles, with significant exposure to Computer Vision and Generative AI projects.
-
4+ years of experience specifically focused on AI/ML development and sustain the applications / solutions.
-
Strong programming skills in languages such as Python, Java, or Go.
-
Extensive experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, scikit-learn) and cloud platforms (e.g., AWS, Azure, GCP).
-
Proficiency in data visualization tools and techniques (e.g., Grafana, Tableau, D3.js).
-
Deep understanding of AI/ML concepts, including model training, evaluation, and deployment, with specific knowledge of Computer Vision and Generative AI techniques.
-
Experience with monitoring and observability tools such as Prometheus, ELK stack, or similar systems.
-
Excellent problem-solving skills and ability to troubleshoot complex AI systems across various domains.
-
Proven track record of mentoring and developing junior team members in AI-related roles.
Preferred:
-
Experience with MLOps practices and tools, particularly for large-scale AI systems.
-
Familiarity with AI ethics and responsible AI principles, especially as they relate to Generative AI.
-
Knowledge of relevant AI regulations and compliance requirements, including those specific to Computer Vision applications.
-
Experience with distributed systems and large-scale data processing for AI applications.
-
Contributions to open-source projects or research publications in AI solution at production scale. Previous experience with large-scale AI/ML solutions in production environments.
-
Knowledge of DevOps principles and CI/CD pipelines specific to AI/ML systems.
Key Competencies
-
Strong analytical and critical thinking skills
-
Excellent communication and collaboration abilities
-
Proactive and self-motivated work ethic
-
Ability to explain complex technical concepts to both technical and non-technical audiences
-
Adaptability and willingness to learn in a rapidly evolving field
-
Strong mentorship and leadership skills
-
Deep curiosity and passion for AI, particularly in ML, Computer Vision, and Generative AI domains
-
We are looking for a passionate and innovative individual who can help us build robust, transparent, and reliable AI systems while nurturing the growth of our team. If you have a strong background in AI/ML, with specific expertise in Computer Vision and Generative AI, and a keen interest in observability and system reliability, we encourage you to apply.