Roles & Responsibilities
Job Title: Data Engineer
Job Description:
We are seeking a highly skilled and motivated Data Engineer to play a pivotal role in designing, building, and optimizing our next-generation scalable data pipelines. This position requires expertise in processing massive datasets using cutting-edge technologies like Apache Spark, PySpark, and Hive within a dynamic cloud environment. Your primary objective will be to ensure the utmost data reliability, speed, and efficiency, providing a robust foundation for downstream business intelligence and advanced analytics initiatives.
Roles & Responsibilities:
- Data Pipeline Development & Maintenance: Design, build, and maintain highly scalable and efficient ETL/ELT data pipelines utilizing PySpark and Spark SQL for complex data transformations.
- Cloud Data Infrastructure Management: Deploy, manage, and scale critical data infrastructure components on leading cloud platforms such as Amazon Web Services (AWS) (e.g., EMR, Glue), Microsoft Azure (e.g., Databricks, Synapse), or Google Cloud Platform (GCP).
- Data Warehousing & Storage Optimization: Strategically manage data layout, partitioning, and indexing within Apache Hive and various cloud data lake solutions to optimize performance and accessibility.
- Performance Tuning & Optimization: Proactively identify and resolve performance bottlenecks in Spark jobs, leveraging Spark UI for in-depth analysis, effectively managing data skewness, and optimizing memory utilization.
- Diverse Data Integration: Develop robust solutions for ingesting high-volume and diverse datasets from both structured relational databases and unstructured flat files into our data ecosystem.
- Automated Workflow Orchestration: Implement and manage automated data workflows using industry-standard scheduling tools like Apache Airflow or platform-native schedulers, ensuring timely and reliable data delivery.
- Strategic Collaboration: Partner closely with data scientists, business analysts, and cross-functional enterprise teams to translate complex business requirements into technically sound and efficient data solutions.
Qualifications:
- Big Data Frameworks Expertise: Demonstrated high proficiency in Apache Spark architecture, including a deep understanding of drivers, executors, and Directed Acyclic Graphs (DAGs).
- Advanced Programming: Exceptional coding skills in Python and extensive experience with the PySpark API for developing intricate data transformations and processing logic.
- Querying & Schema Management: Strong command of HiveQL and ANSI SQL, coupled with expertise in data partitioning techniques and effective schema definition.
- Optimized Storage Formats: In-depth understanding and practical experience with optimized big data storage file formats such as Parquet, ORC, and Avro.
- Cloud Ecosystem Development: Hands-on development experience utilizing cloud-native big data utilities (e.g., AWS EMR, Azure Databricks) with in major cloud platforms.
- Data Warehousing Fundamentals: Solid foundation in Dimensional Data Modeling, including Star and Snowflake schemas, and practical experience with Data Lakes concepts and implementation.
Preferred Qualifications
- CI/CD & DevOps Automation: Experience with Continuous Integration/Continuous Deployment (CI/CD) practices and automation tools like Git, Jenkins, or Ansible.
- NoSQL Database Integration: Exposure to and experience with NoSQL databases such as HBase, Cassandra, or MongoDB.
- Professional Cloud Certifications: Relevant professional cloud certifications (e.g., AWS Certified Data Engineer, Microsoft Certified: Azure Data Engineer Associate) are highly valued
Salary Range: $125,000 to $140,000 per year
Salary Range
$125,000-$140,000 a year
Desired Candidate Profile
Qualifications : BACHELOR OF COMPUTER SCIENCE