Job Summary
The
Senior AI Data Engineer
is responsible for designing, building, and optimizing enterprise-scale data and AI infrastructure to support machine learning models, generative AI applications, and real-time analytics. The role drives the development of end-to-end data pipelines, from ingestion to production-ready AI data products, ensuring scalability, performance, and compliance across multi-cloud environments.
Accountability & Responsibilities
-
Design, build, and maintain scalable ETL/ELT data pipelines using modern data engineering tools (e.g., Apache Spark, dbt).
-
Architect and implement Lakehouse data platforms (Delta Lake, Apache Iceberg, Apache Hudi) following Medallion architecture (Bronze/Silver/Gold).
-
Develop real-time streaming pipelines using Apache Kafka, Apache Flink, and Spark Structured Streaming.
-
Build and optimize AI/GenAI data pipelines for LLM training, fine-tuning, and inference (tokenization, dataset curation, prompt engineering datasets).
-
Design and implement Retrieval-Augmented Generation (RAG) pipelines, including embedding workflows and vector database integration.
-
Manage feature stores for real-time and batch machine learning use cases.
-
Integrate data pipelines with AI/ML platforms (Databricks MLflow, Azure ML, AWS SageMaker, Vertex AI, OpenAI/Azure OpenAI).
-
Implement data orchestration workflows using Apache Airflow or similar tools with CI/CD pipelines.
-
Ensure data quality, governance, and security using frameworks such as Great Expectations and data catalog tools.
-
Deploy and manage infrastructure using Infrastructure-as-Code tools (Terraform, Bicep, CDK).
-
Collaborate with Data Scientists, ML Engineers, and Solution Architects to deliver production-ready AI solutions.
-
Lead technical design decisions, mentor junior engineers, and contribute to data platform strategy.
-
Maintain documentation, data contracts, and operational runbooks for all pipelines.
Requirements
1
– Required Experience
-
Bachelor’s or Master’s degree in Computer Science, Data Engineering, or related field.
-
4–5 years of experience in data engineering, with strong exposure to AI/ML data infrastructure.
-
Proven experience building scalable data pipelines and working with large-scale datasets.
-
Hands-on experience with AI/ML platforms and modern data architectures.
-
Experience in regulated industries (e.g., Banking, Telecom, Healthcare) is a plus.
-
Strong problem-solving, analytical thinking, and communication skills.
-
Experience working in cross-functional teams and agile environments.
2– Technical Skills
-
Strong SQL and advanced data modeling techniques
-
Apache Spark (PySpark, Spark SQL, Streaming)
-
Python (pandas, PySpark, data processing libraries)
-
Data pipeline orchestration (Apache Airflow)
-
CI/CD for data pipelines (GitHub Actions / Azure DevOps)
-
Lakehouse architectures (Delta Lake / Iceberg / Hudi)
-
Streaming technologies (Kafka, Flink)
-
Cloud platforms (AWS / Azure / GCP)
-
Vector databases (Pinecone, Weaviate, pgvector, OpenSearch)
-
RAG pipeline design and LLM data processing
-
Infrastructure-as-Code (Terraform / Bicep / CDK)
-
Containers (Docker, Kubernetes)
-
Data quality & governance tools