Find The RightJob.
Data Engineering Lead
Type: Contractual / Project Based (Leading to permanent position upon project extension)
The Data Engineering Lead owns how the platform sources, ingests, and maintains roughly 200 distinct external data integrations and keeps them flowing reliably at production scale. This spans well-structured federal APIs, around 120 state regulatory portal scrapers (each with its own page structure, format, and coverage), commercial data APIs, and a website-scraping system operating across tens of thousands of operator sites. The role is responsible not just for the planning and documentation but for the initial build, format-change maintenance model and the ongoing operational load that runs indefinitely after launch. This is a hands-on lead role: you will set the integration architecture and also write production code.
Key Responsibilities
Source feasibility assessment: As the first step before any integration is built, evaluate each of the roughly 200 sources for technical viability, access method (API, bulk file, or scrape), data quality and coverage, refresh availability, legal and terms-of-service constraints, and rate limits. Produce a feasibility report that classifies each source as build-now, build-with-caveats, or defer/replace, and use it to sequence the build.
Integration architecture: Design the source-bucket framework that organizes all external integrations across origination categories: federal APIs, state-level regulatory scrapers, commercial APIs, and large-scale website scraping. Decide where a single shared framework applies and where per-source custom code is unavoidable.
Complexity tiering and effort estimation: Build and maintain a complexity-tiered model for every integration, ranging from trivial flat-file pulls and simple APIs through multi-page scrapes to JavaScript single-page-application and complex narrative parsing. Translate that distribution into realistic build-hour and maintenance budgets that leadership can plan against.
Scraping at scale: Build resilient scrapers with anti-bot handling, proxy rotation, schema-drift detection, and format-change monitoring. Roughly 120 of the integrations are state regulatory portals concentrated in the harder complexity tiers; you own the strategy that keeps them running as upstream sites change.
Vendor consolidation: Identify and prioritize opportunities to replace many per-source scrapers with consolidated commercial feeds (for example, multi-state licensing or credentialing aggregators). Quantify the trade-off in engineering hours and ongoing maintenance for each.
Maintenance and reliability model: Own the post-launch operational model: drift detection, triage (distinguishing breaks that block downstream scoring from display-only failures), remediation cadence, and the ongoing staffing load required to sustain it.
Pipeline orchestration: Implement the ingestion scheduling and queueing model so that one slow or failed source never blocks the others, with per-source isolation, retry-with-backoff, dead-letter handling, and priority lanes for time-sensitive data.
Data quality and provenance: Establish source-conflict resolution rules (which source wins when two disagree) and maintain clear data provenance, which matters both for downstream scoring accuracy and for legal defensibility.
Collaboration: Partner closely with the platform architect on the scraper framework and queueing model, and with the AI / ML Lead so that ingested data lands in a form the scoring models can consume.
Required Qualifications
● 3+ years of production data-engineering experience, including a lead or senior-IC role
on data-intensive systems.
● Deep, hands-on web-scraping expertise: anti-bot evasion, proxy/IP rotation,
headless-browser and JavaScript-heavy site handling, and resilient parsing of
inconsistent HTML.
● Strong experience integrating heterogeneous third-party data: REST APIs, bulk file
ingestion, and undocumented or semi-structured government data sources.
● Proficiency in Python (the data and ML pipeline standard) and SQL, with strong
data-modeling fundamentals; familiarity with a PHP/Laravel application backend a plus.
● Experience with workflow orchestration and queue-based pipelines (for example,
Airflow, Dagster, Prefect, or equivalent) and message/queue systems.
● Demonstrated ownership of long-lived pipelines: monitoring, alerting, schema-drift
detection, and maintenance of scrapers or integrations that break as upstream sources
change.
● Ability to estimate and communicate engineering effort credibly to non-technical
stakeholders.
Preferred Qualifications
● Experience scraping or integrating government, regulatory, or healthcare data sources.
● Familiarity with entity-resolution and record-linkage problems across disparate sources.
● Exposure to cloud data infrastructure (AWS preferred) and infrastructure-as-code.
● Awareness of the legal and compliance considerations around public-data scraping.
What Success Looks Like
● A completed source feasibility report covering all integrations, classifying each as
build-now, build-with-caveats, or defer/replace, used to sequence the build.
● A complete, complexity-tiered catalog of every integration with a defensible
build-and-maintenance hour budget, delivered as the Data Engineering & Integration
Assessment.
● A reliable ingestion platform where individual source failures are isolated, detected
automatically, and triaged by downstream impact.
● A documented maintenance model with a realistic ongoing staffing estimate, so
pipeline upkeep is a planned cost rather than a recurring surprise
Bahria Town Lahore
4pm-1am
Mon-Fri
Work Location: In person
© 2026 Qureos. All rights reserved.