Qureos

Find The RightJob.

Data Engineering Internship – Airflow, NLP & Web Scraping (Pre-Junior / Production-Ready Only)

Important — Read Before Applying

This is NOT a beginner role.
If you have not built real pipelines using Airflow, processed text with NLP, or scraped data at scale, do not apply.

You are expected to deliver production-level work from week 1.

Required Experience (Mandatory)

You will be automatically rejected if you do not have:

  • At least 1 real project combining scraping + processing + storage
  • Hands-on experience with:
  • Apache Airflow (DAGs, scheduling, retries, dependencies)
  • Python (advanced level)
  • Scrapy (spiders, pipelines, middleware)
  • spaCy (NER, text preprocessing, custom pipelines)
  • SQL (advanced queries + optimization)
  • Strong understanding of:
  • Data pipelines (ETL/ELT)
  • Handling messy / unstructured data
  • Writing scalable, fault-tolerant scraping systems

What You Will Do

  • Build automated data pipelines using Airflow (DAG-based workflows)
  • Develop and maintain web scraping systems using Scrapy
  • Process and enrich text data using spaCy (NER, classification, cleaning)
  • Store and structure scraped data into usable formats (DB / warehouse)
  • Handle anti-bot challenges, rate limits, and scraping failures
  • Ensure pipelines are resilient, scheduled, and production-ready
  • Debug broken DAGs, failed jobs, and corrupted datasets

Strict Performance Rules

  • Zero hand-holding. You are expected to debug pipelines independently
  • Deadlines are absolute. Miss once → warning. Repeat → termination
  • Daily reporting required (actual outputs, logs, issues)
  • Pipelines must be:
  • Repeatable
  • Fault-tolerant
  • Scalable
  • “It works on my machine” is unacceptable

Disqualification Triggers

Immediate removal if:

  • Your Airflow DAGs fail repeatedly and you can’t fix them
  • Your scrapers get blocked and you don’t adapt
  • You cannot explain your NLP pipeline (spaCy usage)
  • You build fragile pipelines that break on real data
  • You disappear or stop reporting

Selection Process (Aggressive Filtering)

  • GitHub Review (must include scraping + pipeline work)
  • Technical Task:
  • Build a scraper + process text + schedule via Airflow
  • Live Debugging Session (pipeline failure scenario)
  • Final acceptance

Most applicants will fail before step 2.

What You Get (If You Pass)

  • Real-world experience in:
  • Data pipelines
  • NLP processing
  • Web scraping at scale
  • Strong portfolio with production-grade systems
  • Fast-track to a paid data engineering / NLP role
  • Experience working under real constraints and pressure

Application Requirements

Submit:

  • GitHub with:
  • Scrapy project
  • Airflow DAGs
  • NLP / spaCy usage
  • Description of one pipeline you built (architecture + challenges)
  • Answer:

“How would you design a system to scrape, clean, and extract entities from thousands of web pages daily?”

Final Note

If you haven’t already worked with Airflow + scraping + NLP in real scenarios, this role will not work for you.

Pay: E£1,000.00 - E£8,000.00 per month

Application Question(s):

  • Are you familiar with Airflow?
  • Are you familiar with Spacy?

Language:

  • English (Required)

Work Location: Remote

Similar jobs

No similar jobs found

© 2026 Qureos. All rights reserved.