The Project: Build an enterprise-grade, AI-driven data masking and synthetic data framework from scratch using the Databricks Platform. You will solve complex data privacy challenges while maintaining absolute referential integrity across heterogeneous systems.
Key Responsibilities
- Architecture & Security: Design the core framework utilizing Databricks Lakehouse. Fully leverage Unity Catalog for metadata management, audit logging, and RBAC models.
- AI-Driven Masking: Implement Named Entity Recognition (NER) and classification models for context-aware masking and adaptive algorithms.
- Synthetic Data Generation: Build an AI engine capable of generating high-volume synthetic datasets and handling complex edge cases while ensuring cross-environment consistency.
- Core Data Engineering: Develop Delta Lake pipelines, optimize PySpark transformations, and manage workflow orchestration using MLflow.
Technical Focus
- Databricks Ecosystem: Lakehouse, Unity Catalog, Delta Lake, MLflow.
- Data Security & Privacy: Format-Preserving Encryption (FPE), Test Data Management (TDM), Secure Key Management.
- AI/ML: PySpark, NLP/NER modeling, Data Classification.
Phased Deliverables
- Phase 1 (Assessment): Future-state architecture, gap analysis, and implementation roadmap.
- Phase 2 (Build): Configured Databricks framework, AI masking engine, synthetic data module, and self-service APIs.
- Phase 3 & 4 (Deploy & Handover): SIT support, automated deployment to vendor environments, and internal team knowledge transfer.
Profile Requirements
- Solid Databricks implementation expertise (Unity Catalog, Lakehouse, PySpark).
- Hands-on AI/ML engineering experience (NLP, NER, classification models).
- Strong background in Test Data Management (TDM) and enterprise encryption strategies.
- Preferred: Financial services experience or deep understanding of regulatory compliance.
Pay: Up to $120,000.00 per year
Work Location: Remote