Important — Read Before Applying
This is NOT a beginner role.
If you have not built real pipelines using Airflow, processed text with NLP, or scraped data at scale, do not apply.
You are expected to deliver production-level work from week 1.
Required Experience (Mandatory)
You will be automatically rejected if you do not have:
- At least 1 real project combining scraping + processing + storage
- Hands-on experience with:
- Apache Airflow (DAGs, scheduling, retries, dependencies)
- Python (advanced level)
- Scrapy (spiders, pipelines, middleware)
- spaCy (NER, text preprocessing, custom pipelines)
- SQL (advanced queries + optimization)
- Strong understanding of:
- Data pipelines (ETL/ELT)
- Handling messy / unstructured data
- Writing scalable, fault-tolerant scraping systems
What You Will Do
- Build automated data pipelines using Airflow (DAG-based workflows)
- Develop and maintain web scraping systems using Scrapy
- Process and enrich text data using spaCy (NER, classification, cleaning)
- Store and structure scraped data into usable formats (DB / warehouse)
- Handle anti-bot challenges, rate limits, and scraping failures
- Ensure pipelines are resilient, scheduled, and production-ready
- Debug broken DAGs, failed jobs, and corrupted datasets
Strict Performance Rules
- Zero hand-holding. You are expected to debug pipelines independently
- Deadlines are absolute. Miss once → warning. Repeat → termination
- Daily reporting required (actual outputs, logs, issues)
- Pipelines must be:
- Repeatable
- Fault-tolerant
- Scalable
- “It works on my machine” is unacceptable
Disqualification Triggers
Immediate removal if:
- Your Airflow DAGs fail repeatedly and you can’t fix them
- Your scrapers get blocked and you don’t adapt
- You cannot explain your NLP pipeline (spaCy usage)
- You build fragile pipelines that break on real data
- You disappear or stop reporting
Selection Process (Aggressive Filtering)
- GitHub Review (must include scraping + pipeline work)
- Technical Task:
- Build a scraper + process text + schedule via Airflow
- Live Debugging Session (pipeline failure scenario)
- Final acceptance
Most applicants will fail before step 2.
What You Get (If You Pass)
- Real-world experience in:
- Data pipelines
- NLP processing
- Web scraping at scale
- Strong portfolio with production-grade systems
- Fast-track to a paid data engineering / NLP role
- Experience working under real constraints and pressure
Application Requirements
Submit:
- GitHub with:
- Scrapy project
- Airflow DAGs
- NLP / spaCy usage
- Description of one pipeline you built (architecture + challenges)
- Answer:
“How would you design a system to scrape, clean, and extract entities from thousands of web pages daily?”
Final Note
If you haven’t already worked with Airflow + scraping + NLP in real scenarios, this role will not work for you.
Pay: E£1,000.00 - E£8,000.00 per month
Application Question(s):
- Are you familiar with Airflow?
- Are you familiar with Spacy?
Language:
Work Location: Remote