Experienced Machine Learning / NLP Engineer with strong expertise in entity resolution, record linkage, and large-scale text matching systems. Skilled in designing end-to-end ML pipelines using sentence-transformers, vector similarity search, and cross-encoder reranking models. Hands-on experience with Azure ecosystem (AI Foundry, Databricks, Key Vault, Blob Storage) and scalable data engineering using Snowflake, PySpark, and Python. Adept at building production-grade NLP solutions for classification, deduplication, and information extraction across large datasets.
Responsibilities:
-
Design and implement entity resolution, record linkage, and deduplication systems at scale.
-
Build and optimize NLP pipelines using sentence-transformer models (e.g., all-mpnet-base-v2, all-MiniLM-L6-v2).
-
Implement vector similarity search solutions using FAISS, Annoy, or ScaNN.
-
Apply string similarity techniques such as Levenshtein, Jaro-Winkler, TF-IDF, and phonetic algorithms (Soundex, Double Metaphone).
-
Develop bi-encoder and cross-encoder architectures for retrieval and reranking systems.
-
Perform prompt engineering for LLM-based classification and validation tasks.
-
Deploy and manage ML models using Azure AI Foundry and inference endpoints.
-
Develop distributed data pipelines using Azure Databricks and PySpark.
-
Manage secure data workflows using Azure Key Vault and Azure Active Directory.
-
Design and optimize data storage and processing in Snowflake and Azure Data Lake.
-
Handle large-scale batch processing (1M+ records) using parallelization techniques.
-
Build APIs and ML services using Python and FastAPI.
-
Document ML experiments, model performance, and system architecture.
Qualifications:
-
5+ years of experience in Data Science, Machine Learning, or NLP Engineering.
-
Minimum 3+ years of experience in large-scale text matching, entity resolution, or knowledge graph systems.
-
Strong proficiency in Python (pandas, numpy, scikit-learn, sentence-transformers, FastAPI).
-
Experience with PySpark for distributed data processing and optimization.
-
Hands-on experience with vector search frameworks (FAISS, Annoy, ScaNN).
-
Solid understanding of similarity metrics (Levenshtein, Jaro-Winkler, TF-IDF, phonetic algorithms).
-
Experience with cross-encoder and bi-encoder model architectures.
-
Strong knowledge of Azure ecosystem:
-
Azure AI Foundry
-
Azure Databricks
-
Azure Blob Storage / ADLS Gen2
-
Azure Key Vault
-
Azure Active Directory (Entra ID)
-
Experience with Snowflake data warehousing and SQL optimization.
-
Understanding of LLM prompt engineering for structured tasks.
-
Bachelor’s or Master’s degree in Computer Science, Data Science, Mathematics, or related field.