Role: Lead/ Senior ML Data Engineer
Experience: 8+ years
Work Mode: Remote
We are seeking a highly autonomous and experienced Lead/ Senior ML Data Engineer to drive the critical data foundation for our AI analytics and Generative AI platforms. This is a specialized hybrid position, focusing on designing, building, and optimizing scalable data pipelines (ETL/ELT) that transform complex, messy clinical and healthcare data into high-quality, production-ready feature stores for Machine Learning and NLP models.
The successful candidate will own technical work streams end-to-end, ensuring data quality, governance, and low-latency delivery in a cloud-native environment.
Key Responsibilities & Focus Areas:
- ML Data Pipeline Ownership (70-80% Focus): Design and implement high-performance, scalable ETL/ELT pipelines using PySpark and a Lakehouse architecture (such as Databricks) to ingest, clean, and transform large-scale healthcare datasets.
- AI Data Preparation: Specialize in Feature Engineering and data preparation for complex ML workloads, including transforming unstructured clinical data (e.g., medical notes) for Generative AI and NLP model training.
- Cloud Architecture & Orchestration: Deploy, manage, and optimize data workflows using Airflow in a production AWS environment.
- Data Governance & Compliance: Mandatorily implement pipelines with robust data masking, pseudonymization, and security controls to ensure continuous adherence to HIPAA and other relevant health data privacy regulations.
- Technical Leadership: Lead and define technical requirements from ambiguous business problems, acting as a key contributor to the data architecture strategy for the core AI platform.
Non-Negotiable Requirements (The "Must-Haves"):
- Releevant 5+ years of progressive experience as a Data Engineer, with a clear focus on ML/AI support.
- Deep expertise in PySpark/Python for distributed data processing.
- Mandatory proficiency with Lakehouse platforms (e.g., Databricks) in an AWS production environment.
- Proven experience handling complex clinical/healthcare data (EHR, Claims), including unstructured text.
- Hands-on experience with HIPAA/GDPR compliance in data pipeline design.
Job Type: Full-time
Pay: ₹2,000,000.00 - ₹3,000,000.00 per year
Work Location: Remote