
Responsibilities:- Design, implement, and optimize data pipelines for batch and real-time data processing using Cloudera (Hadoop, Hive, Spark, Impala) and Informatica (PowerCenter, Cloud Data Integration)
- Build data extraction, transformation, and loading (ETL) workflows using Informatica PowerCenter for large-scale data integration from source systems (e.g., relational databases, flat files, APIs) into Cloudera Data Lake or data warehouse environments.
- Implement Spark jobs on Cloudera for distributed data processing and optimization of data workflows.
- Leverage Informatica for orchestrating ETL workflows, including data extraction, cleansing, transformation, and loading into data repositories (HDFS, Hive, SQL databases, etc.).
- Optimize the Informatica workflows to minimize runtime, ensure smooth data integration, and maintain high data quality.
- Utilize Hadoop and Spark on Cloudera to process large datasets and implement data transformations using MapReduce, Spark SQL, and PySpark.
- Leverage Impala for low-latency SQL queries on Hadoop, ensuring real-time access to processed data.
- Implement partitioning, bucketing, and indexing strategies in Hive and HBase to improve query performance on large datasets.
- Implement and enforce data quality rules within Informatica workflows, ensuring that all transformations meet the required standards for completeness, consistency, and accuracy.
- Ensure compliance with data governance and security protocols (e.g., encryption, masking, access control) in accordance with industry best practices.
- Automation and Scheduling: Automate ETL workflows using Informatica Server, integrating with Airflow, Nifi or other workflow orchestration tools for scheduling and monitoring jobs.
- Utilize Cloudera Navigator for monitoring and auditing data processes within the Hadoop ecosystem.
- Perform regular tuning of the ETL pipelines, data flows, and SQL queries to ensure optimal performance.
Qualifications:- Bachelor’s degree in Computer Science, Engineering, or related field.
- 6+ years of experience in the same field.
- Proven experience with the Cloudera Distribution of Hadoop (CDH), including expertise in HDFS, Hive, Impala, Spark, and HBase.
- Strong hands-on experience with Informatica PowerCenter (ETL), EDC, IDQ, B2B, and Axon.
- Deep understanding of ETL best practices, data pipelines, and distributed computing technologies such as Spark, MapReduce, PySpark, and Hadoop ecosystem components.
- Advanced SQL skills for data manipulation, aggregation, optimization, and reporting across relational and non-relational data stores (e.g., SQL Server, MySQL, PostgreSQL, Hive, Impala).
- Experience in Python and SQL.
- Strong background in data warehousing principles and data modeling, including dimensional modeling (star schema, snowflake schema) and OLAP/OLTP considerations.
© 2026 Qureos. All rights reserved.