The Senior Data Engineer will be responsible for the architecture, design, development, and maintenance of our data platforms, with a strong focus on leveraging Python and PySpark for data processing and transformation. This role requires a strong technical leader who can work independently and as part of a team, contributing to the overall data strategy and helping to drive data-driven decision-making across the organization.
Key Responsibilities
-
Data Architecture & Design: Design, develop, and optimize data architectures, pipelines, and data models to support various business needs, including analytics, reporting, and machine learning.
-
ETL/ELT Development (Python/PySpark Focus): Build, test, and deploy highly scalable and efficient ETL/ELT processes using Python and PySpark to ingest, transform, and load data from diverse sources into data warehouses and data lakes. Develop and optimize complex data transformations using PySpark.
-
Data Quality & Governance: Implement best practices for data quality, data governance, and data security to ensure the integrity, reliability, and privacy of our data assets.
-
Performance Optimization: Monitor, troubleshoot, and optimize data pipeline performance, ensuring data availability and timely delivery, particularly for PySpark jobs.
-
Infrastructure Management: Collaborate with DevOps and MLOps teams to manage and optimize data infrastructure, including cloud resources (AWS, Azure, GCP), databases, and data processing frameworks, ensuring efficient operation of PySpark clusters.
-
Mentorship & Leadership: Provide technical guidance, mentorship, and code reviews to junior data engineers, particularly in Python and PySpark best practices, fostering a culture of excellence and continuous improvement.
-
Collaboration: Work closely with data scientists, analysts, product managers, and other stakeholders to understand data requirements and deliver solutions that meet business objectives.
-
Innovation: Research and evaluate new data technologies, tools, and methodologies to enhance our data capabilities and stay ahead of industry trends.
-
Documentation: Create and maintain comprehensive documentation for data pipelines, data models, and data infrastructure.
Qualifications
Education
-
Bachelor's or Master's degree in Computer Science, Software Engineering, Data Science, or a related quantitative field.
Experience
-
5+ years of professional experience in data engineering, with a strong emphasis on building and maintaining large-scale data systems.
-
Extensive hands-on experience with Python for data engineering tasks.
-
Proven experience with PySpark for big data processing and transformation.
-
Proven experience with cloud data platforms (e.g., AWS Redshift, S3, EMR, Glue; Azure Data Lake, Databricks, Synapse; Google BigQuery, Dataflow).
-
Strong experience with SQL and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB, Cassandra).
-
Extensive experience with distributed data processing frameworks, especially Apache Spark.
Technical Skills
-
Programming Languages: Expert proficiency in Python is mandatory. Strong SQL mastery is essential. Familiarity with Scala or Java is a plus.
-
Big Data Technologies: In-depth knowledge and hands-on experience with Apache Spark (PySpark) for data processing, including Spark SQL, Spark Streaming, and DataFrame API. Experience with Apache Kafka, Apache Airflow, Delta Lake, or similar technologies.
-
Data Warehousing: In-depth knowledge of data warehousing concepts, dimensional modeling, and ETL/ELT processes.
-
Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, GCP) and their data services, particularly those supporting Spark/PySpark workloads.
-
Containerization: Familiarity with Docker and Kubernetes is a plus.
-
Version Control: Proficient with Git and CI/CD pipelines.
Soft Skills
-
Excellent problem-solving and analytical abilities.
-
Strong communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders.
-
Ability to work effectively in a fast-paced, agile environment.
-
Proactive and self-motivated with a strong sense of ownership.
Preferred Qualifications
-
Experience with real-time data streaming and processing using PySpark Structured Streaming.
-
Knowledge of machine learning concepts and MLOps practices, especially integrating ML workflows with PySpark.
-
Familiarity with data visualization tools (e.g., Tableau, Power BI).
-
Contributions to open-source data projects.
-
Job Family Group:
Technology
-
Job Family:
Data Analytics
-
Time Type:
Full time
-
Most Relevant Skills
Please see the requirements listed above.
-
Other Relevant Skills
For complementary skills, please see above and/or contact the recruiter.
-
Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.
If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.