Find The RightJob.

Data Engineering Internship – Airflow, NLP & Web Scraping (Pre-Junior / Production-Ready Only)

Important — Read Before Applying

This is NOT a beginner role.
If you have not built real pipelines using Airflow, processed text with NLP, or scraped data at scale, do not apply.

You are expected to deliver production-level work from week 1.

Required Experience (Mandatory)

You will be automatically rejected if you do not have:

At least 1 real project combining scraping + processing + storage
Hands-on experience with:
Apache Airflow (DAGs, scheduling, retries, dependencies)
Python (advanced level)
Scrapy (spiders, pipelines, middleware)
spaCy (NER, text preprocessing, custom pipelines)
SQL (advanced queries + optimization)
Strong understanding of:
Data pipelines (ETL/ELT)
Handling messy / unstructured data
Writing scalable, fault-tolerant scraping systems

What You Will Do

Build automated data pipelines using Airflow (DAG-based workflows)
Develop and maintain web scraping systems using Scrapy
Process and enrich text data using spaCy (NER, classification, cleaning)
Store and structure scraped data into usable formats (DB / warehouse)
Handle anti-bot challenges, rate limits, and scraping failures
Ensure pipelines are resilient, scheduled, and production-ready
Debug broken DAGs, failed jobs, and corrupted datasets

Strict Performance Rules

Zero hand-holding. You are expected to debug pipelines independently
Deadlines are absolute. Miss once → warning. Repeat → termination
Daily reporting required (actual outputs, logs, issues)
Pipelines must be:
Repeatable
Fault-tolerant
Scalable
“It works on my machine” is unacceptable

Disqualification Triggers

Immediate removal if:

Your Airflow DAGs fail repeatedly and you can’t fix them
Your scrapers get blocked and you don’t adapt
You cannot explain your NLP pipeline (spaCy usage)
You build fragile pipelines that break on real data
You disappear or stop reporting

Selection Process (Aggressive Filtering)

GitHub Review (must include scraping + pipeline work)
Technical Task:
Build a scraper + process text + schedule via Airflow
Live Debugging Session (pipeline failure scenario)
Final acceptance

Most applicants will fail before step 2.

What You Get (If You Pass)

Real-world experience in:
Data pipelines
NLP processing
Web scraping at scale
Strong portfolio with production-grade systems
Fast-track to a paid data engineering / NLP role
Experience working under real constraints and pressure

Application Requirements

Submit:

GitHub with:
Scrapy project
Airflow DAGs
NLP / spaCy usage
Description of one pipeline you built (architecture + challenges)
Answer:

“How would you design a system to scrape, clean, and extract entities from thousands of web pages daily?”

Final Note

If you haven’t already worked with Airflow + scraping + NLP in real scenarios, this role will not work for you.

Pay: E£1,000.00 - E£8,000.00 per month

Application Question(s):

Are you familiar with Airflow?
Are you familiar with Spacy?

Language:

English (Required)

Work Location: Remote

Similar jobs

Full-stack Data Engineer II

Archer Integrated Risk Management

Cairo, Egypt

4 days ago

Accounting Intern (1 Year)

Mercedes-Benz Group

Egypt

5 days ago

AI Engineer (Hybrid)

Bayantech

Giza, Egypt

5 days ago

AFC Engineer - Internship

RATP Dev

Egypt

5 days ago

Term of use Privacy policy