Responsible for designing, building, and maintaining data pipelines and infrastructure to support data-driven decisions and analytics. The individual is responsible for the following tasks:
- Design, develop and maintain data pipelines, and extract, transform, load (ETL) processes to collect, process and store structured and unstructured data
- Build data architecture and storage solutions, including data lakehouses, data lakes, data warehouse, and data marts to support analytics and reporting
- Develop data reliability, efficiency, and qualify checks and processes
- Prepare data for data modeling
- Monitor and optimize data architecture and data processing systems
- Collaboration with multiple teams to understand requirements and objectives
- Administer testing and troubleshooting related to performance, reliability, and scalability
H. Create and update documentation
Hands-On Data Pipeline Development
- Design, code, and deploy ETL/ELT pipelines across bronze, silver, and gold layers of the Data Lakehouse.
- Build ingestion pipelines for structured (SQL), semi-structured (JSON, XML), and unstructured data using PySpark/Python programming language using AWS Glue or EMR.
- Implement incremental loads, deduplication, error handling, and data validation.
Actively troubleshoot, debug, and optimize pipelines for scalability and cost efficiency.
EDW & Data Lake Implementation
- Develop dimensional data models (Star Schema, Snowflake Schema) for analytics and reporting.
- Build and maintain tables in Iceberg, Delta Lake, or equivalent OTF formats.
Optimize partitioning, indexing, and metadata for fast query performance.
Healthcare Data Integration
- Build ingestion and transformation pipelines for EDI X12 transactions (837, 835, 278, etc.).
- Implement mapping and transformation of EDI data with FHIR and HL7 frameworks.
Work hands-on with AWS Health Lake (or equivalent) to store and query healthcare data.
Data Quality, Security & Compliance
- Develop automated validation scripts to enforce data quality and integrity.
- Implement IAM roles, encryption, and auditing to meet HIPAA and CMS compliance standards.
Maintain lineage and governance documentation for all pipelines.
Collaboration & Delivery
- Work closely with the Lead Data Engineer, analysts, and data scientists to deliver pipelines that support enterprise-wide analytics.
- Actively contribute to CI/CD pipelines, Infrastructure-as-Code (IaC), and automation.
Continuously improve pipelines and adopt new technologies where appropriate.