Job Purpose
Analysing, troubleshooting, and designing vital services, platforms, and infrastructure on GCP while always thinking about reliability, scalability, resilience, security, and performance.
Job Responsibilities(JR) :
-
Help build a Site Reliability Engineering culture by sharing the best practices, approaches, documentation, and code with other engineering teams
-
Apply automation and software to any tasks or parts of the system which are performed manually
-
Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents
-
Monitor application performance take steps to improve overall application performance and stability and follow through with implementation
-
Design, write, ship, and motivate the creation of software and systems to increase observability, product reliability and organizational efficiency
-
Conduct system analysis, configuration management and develops improvements for system software performance, availability and reliability
Key Skills:
-
Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools
-
Demonstrable experience in Containerization-Docker and orchestration (Kubernetes)
-
Experience with Infrastructure As Code (Terraform, Cloud Formation, Ansible)
-
Knowledge and proven hands-on experience in large-scale databases and distributed technologies, such as Kafka and Confluent Platform Kafka
-
Basic programming and scripting skills