Find The RightJob.

Data Engineering Lead

Data Engineering Lead

Type: Contractual / Project Based (Leading to permanent position upon project extension)

The Data Engineering Lead owns how the platform sources, ingests, and maintains roughly 200 distinct external data integrations and keeps them flowing reliably at production scale. This spans well-structured federal APIs, around 120 state regulatory portal scrapers (each with its own page structure, format, and coverage), commercial data APIs, and a website-scraping system operating across tens of thousands of operator sites. The role is responsible not just for the planning and documentation but for the initial build, format-change maintenance model and the ongoing operational load that runs indefinitely after launch. This is a hands-on lead role: you will set the integration architecture and also write production code.

Key Responsibilities

Source feasibility assessment: As the first step before any integration is built, evaluate each of the roughly 200 sources for technical viability, access method (API, bulk file, or scrape), data quality and coverage, refresh availability, legal and terms-of-service constraints, and rate limits. Produce a feasibility report that classifies each source as build-now, build-with-caveats, or defer/replace, and use it to sequence the build.

Integration architecture: Design the source-bucket framework that organizes all external integrations across origination categories: federal APIs, state-level regulatory scrapers, commercial APIs, and large-scale website scraping. Decide where a single shared framework applies and where per-source custom code is unavoidable.

Complexity tiering and effort estimation: Build and maintain a complexity-tiered model for every integration, ranging from trivial flat-file pulls and simple APIs through multi-page scrapes to JavaScript single-page-application and complex narrative parsing. Translate that distribution into realistic build-hour and maintenance budgets that leadership can plan against.

Scraping at scale: Build resilient scrapers with anti-bot handling, proxy rotation, schema-drift detection, and format-change monitoring. Roughly 120 of the integrations are state regulatory portals concentrated in the harder complexity tiers; you own the strategy that keeps them running as upstream sites change.

Vendor consolidation: Identify and prioritize opportunities to replace many per-source scrapers with consolidated commercial feeds (for example, multi-state licensing or credentialing aggregators). Quantify the trade-off in engineering hours and ongoing maintenance for each.

Maintenance and reliability model: Own the post-launch operational model: drift detection, triage (distinguishing breaks that block downstream scoring from display-only failures), remediation cadence, and the ongoing staffing load required to sustain it.

Pipeline orchestration: Implement the ingestion scheduling and queueing model so that one slow or failed source never blocks the others, with per-source isolation, retry-with-backoff, dead-letter handling, and priority lanes for time-sensitive data.

Data quality and provenance: Establish source-conflict resolution rules (which source wins when two disagree) and maintain clear data provenance, which matters both for downstream scoring accuracy and for legal defensibility.

Collaboration: Partner closely with the platform architect on the scraper framework and queueing model, and with the AI / ML Lead so that ingested data lands in a form the scoring models can consume.

Required Qualifications

● 3+ years of production data-engineering experience, including a lead or senior-IC role

on data-intensive systems.

● Deep, hands-on web-scraping expertise: anti-bot evasion, proxy/IP rotation,

headless-browser and JavaScript-heavy site handling, and resilient parsing of

inconsistent HTML.

● Strong experience integrating heterogeneous third-party data: REST APIs, bulk file

ingestion, and undocumented or semi-structured government data sources.

● Proficiency in Python (the data and ML pipeline standard) and SQL, with strong

data-modeling fundamentals; familiarity with a PHP/Laravel application backend a plus.

● Experience with workflow orchestration and queue-based pipelines (for example,

Airflow, Dagster, Prefect, or equivalent) and message/queue systems.

● Demonstrated ownership of long-lived pipelines: monitoring, alerting, schema-drift

detection, and maintenance of scrapers or integrations that break as upstream sources

change.

● Ability to estimate and communicate engineering effort credibly to non-technical

stakeholders.

Preferred Qualifications

● Experience scraping or integrating government, regulatory, or healthcare data sources.

● Familiarity with entity-resolution and record-linkage problems across disparate sources.

● Exposure to cloud data infrastructure (AWS preferred) and infrastructure-as-code.

● Awareness of the legal and compliance considerations around public-data scraping.

What Success Looks Like

● A completed source feasibility report covering all integrations, classifying each as

build-now, build-with-caveats, or defer/replace, used to sequence the build.

● A complete, complexity-tiered catalog of every integration with a defensible

build-and-maintenance hour budget, delivered as the Data Engineering & Integration

Assessment.

● A reliable ingestion platform where individual source failures are isolated, detected

automatically, and triaged by downstream impact.

● A documented maintenance model with a realistic ongoing staffing estimate, so

pipeline upkeep is a planned cost rather than a recurring surprise

Bahria Town Lahore
4pm-1am
Mon-Fri

Work Location: In person

Similar jobs

Data Engineering Director

ViratLab Advanced Technologies

Doha, Qatar

about 6 hours ago

Data Engineer Lead

Edenred

Dubai, United Arab Emirates

about 6 hours ago

Term of use Privacy policy