Senior Data Engineer - Data Pipelines

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Job Title: Senior Data Engineer - Data Pipelines

Introduction to role:
Are you ready to build high-throughput, reproducible data pipelines that turn complex bioinformatics and scientific data into decisions that speed medicines to patients? This role sits at the heart of how we generate trusted evidence, bringing together scientists, data scientists and engineers to move from experiment to insight with reliability and pace.

You will design and operate end-to-end pipelines across Unix/Linux HPC and cloud environments, using technologies such as Nextflow, Snakemake and AWS to deliver robust, auditable data flows. You will shape our standards for metadata, lineage and governance, create reusable components and mentor others to uplift engineering excellence across the product teams. Can you see yourself setting the patterns others will reuse across studies and lines of work to scale impact?

Accountabilities:
Pipeline Design and Delivery: Design, implement and operate fit-for-purpose pipelines from ingestion to consumption for bioinformatics and other scientific data, ensuring they are robust, auditable and easy to evolve.

Workflow Orchestration: Build reproducible workflows with Nextflow or Snakemake; integrate with schedulers and HPC/cloud resources to deliver timely, scalable execution.

Data Platform Engineering: Develop data models, warehousing layers, and end-to-end metadata and lineage; apply data quality, reliability and governance controls to enable trusted analytics and evidence generation.

Scalability and Cost Optimisation: Optimise for throughput, reliability and cost across Unix/Linux HPC and AWS; implement observability, alerting and SLOs to keep pipelines healthy at scale.

Collaboration and Translation: Translate scientific and business requirements into technical designs; co-create solutions with CPSS stakeholders, R&D IT and DS&AI partners to deliver measurable outcomes.

Engineering Excellence: Embed version control, CI/CD, automated testing, code review and proven design patterns to meet maintainability and compliance standards.

Reusability and Enablement: Produce clear documentation, templates and reusable modules; mentor peers and champion guidelines in data engineering and scientific computing.

Security and Compliance: Apply appropriate data security, privacy and regulatory controls for sensitive scientific datasets.

Continuous Improvement: Identify bottlenecks, measure pipeline performance and shorten cycle times from experiment to decision, informing architectural roadmaps across the portfolio.

Essential Skills/Experience:

Pipeline engineering: Design, implement, and operate fit-for-purpose data pipelines for bioinformatics and scientific data, from ingestion to consumption.
Workflow orchestration: Build reproducible pipelines using frameworks such as Nextflow (preferred) or Snakemake; integrate with schedulers and HPC/cloud resources.
Data platforms: Develop data models, warehousing layers, and metadata/lineage; ensure data quality, reliability, and governance.
Scalability and performance: Optimize pipelines for throughput and cost across Unix/Linux HPC and cloud environments (AWS preferred); implement observability and reliability practices.
Collaboration: Translate scientific and business requirements into technical designs; partner with CPSS stakeholders, R&D IT, and DS&AI to co-create solutions.
Engineering excellence: Establish and maintain version control, CI/CD, automated testing, code review, and design patterns to ensure maintainability and compliance.
Enablement: Produce documentation and reusable components; mentor peers and promote guidelines in data engineering and scientific computing.

Desirable Skills/Experience:

Strong programming in Python and Bash; familiarity with software packaging and environments (conda/mamba).
Deep experience with Nextflow (and Tower) or Snakemake on HPC and AWS Batch; containerisation with Docker/Singularity.
Distributed data processing and lakehouse technologies (Spark, Databricks, Parquet/Delta); data warehousing (Redshift, Snowflake) and SQL.
Infrastructure as Code and platform tooling (Terraform, CloudFormation, Kubernetes, Argo/Airflow).
Observability and reliability engineering (Prometheus, Grafana, OpenTelemetry); defining and tracking SLOs/SLIs.
Data security, privacy and compliance practices for scientific/regulated data (e.g., GxP, 21 CFR Part 11).
Experience with bioinformatics tools and scientific file formats; understanding of HPC schedulers (SLURM, LSF).
Proven ability to translate scientific use cases into technical architectures and to influence standards across teams.
Clear written and verbal communication, with a mentor approach and a bias for action.

When we put unexpected teams in the same room, we unleash bold thinking with the power to
inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge
perceptions. That's why we work, on average, a minimum of three days per week from the office. But that
doesn't mean we're not flexible. We balance the expectation of being in the office while respecting individual
flexibility. Join us in our unique and ambitious world.

Why AstraZeneca:
At AstraZeneca you will engineer where impact is immediate and visible—your pipelines will shape evidence, accelerate decisions and help bring new treatments to people sooner. We bring experts from different fields together to solve hard problems quickly, backed by modern platforms across HPC and public cloud so your work runs at scale. Leaders remove barriers, teams share knowledge openly and we value kindness alongside ambition, giving you room to innovate while staying grounded in real patient outcomes.

Call to Action:
If you are ready to architect the data flows that move science into the clinic, send us your CV and tell us about the toughest pipeline you have built and scaled.

Similar jobs