Job Description
The successful candidate will work on batch and real-time data pipelines, leveraging technologies such as Scala, Java, and Apache Spark. They will manage distributed systems and cloud resources, ensure security compliance using HashiCorp Vault, orchestrate workflows with Apache Airflow, and investigate production issues by analyzing logs and monitoring metrics.
Responsibilities
- Develop and maintain batch and streaming data pipelines using Scala and Java.
- Write and execute shell scripts and Yarn commands for Spark job management.
- Manage and optimize big data environments using Apache Spark, EMR, Hadoop, and YARN.
- Utilize cloud storage and container platforms such as Amazon S3 and EKS.
- Implement and manage security features, including HashiCorp Vault and tokenization/encryption protocols.
- Schedule and orchestrate batch workflows using Apache Airflow.
- Conduct root cause analysis and handle production incidents.
- Analyze application, Spark executor, and Dynatrace logs.
- Monitor jobs and EMR clusters using Dynatrace metrics.
- Validate data and set up production alerts.