DevOps & Site Reliability Engineer (GCP)

Full time

Remote

Islamabad, Pakistan

Job Requirements

Hires in

Islamabad, Pakistan

Employment Type

Full time

Company Location

United Arab Emirates

Salary

Not specified

As a DevOps Engineer, you will design, implement, and maintain the infrastructure that supports our applications and services — and, just as importantly, you will keep that infrastructure reliable in production. You will work closely with development, QA, and IT teams to automate and streamline operations, build the observability that lets us catch problems before customers do, and lead the response when incidents happen. This is a hands-on role where you own the health of production, not only its build-out. We are hiring at a mid level (3–5 years) for someone with strong production instincts and the judgment to grow into a senior reliability owner as we scale globally.

Key Responsibilities

Infrastructure Management

Design, deploy, and manage cloud infrastructure on Google Cloud Platform (GCP).
Implement and maintain scalable container orchestration using Docker.

CI/CD Pipeline

Develop and maintain continuous integration/continuous deployment (CI/CD) pipelines to automate testing, building, and deployment processes through GitHub and Google Cloud Run.
Collaborate with development teams to integrate new features into the CI/CD process.

Monitoring & Observability

Set up and manage monitoring, logging, and alerting systems using tools like Elasticsearch, Kibana, Prometheus, and New Relic.
Build dashboards, metrics, and actionable alerts so engineering detects degradations before customers do — and so no critical signal (e.g. a database running hot for hours) ever goes unnoticed.

Reliability & Incident Response (SRE)

Define, measure, and own service-level objectives (SLOs/SLIs) and error budgets for critical services.
Lead incident response end to end: triage by severity, form a hypothesis and confirm it with metrics before taking action, mitigate, and drive to resolution. Act as incident commander on major incidents and coordinate communication across stakeholders.
Run blameless postmortems, identify true root causes, and track corrective actions to closure.
Participate in an on-call rotation; continuously reduce toil and mean-time-to-recovery (MTTR) through automation.
Plan for capacity, performance, and cost as we scale toward multi-region, global traffic.

Database Management

Administer and optimize MongoDB instances, ensuring data integrity, performance, and security.
Implement backup, recovery, and disaster recovery strategies for MongoDB and other databases.

Security & Compliance

Implement security best practices across infrastructure, applications, and data.
Ensure compliance with industry standards and internal policies.

Automation & Scripting

Automate infrastructure provisioning, configuration management, and system operations using GCP services.
Develop custom scripts as needed to enhance automation and operational efficiency.

Collaboration & Support

Work closely with development and QA teams to support the software development lifecycle.
Provide technical guidance and support to resolve infrastructure-related issues.

Qualifications

Education

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).

Experience

3–5+ years of experience as a DevOps Engineer or in a similar role.
Strong experience with Docker, including container orchestration using Kubernetes.
Hands-on experience with Google Cloud Platform (GCP) services, including Compute Engine, Cloud Storage, Cloud Run, and GKE.
Experience with Elasticsearch for monitoring, logging, and search.
Proficiency in administering and optimizing MongoDB databases.
Demonstrated experience operating production systems and responding to incidents — not only building infrastructure. You can point to real production incidents you owned, quantify their impact concretely (users affected, duration, consequence), and describe what you changed afterward.

Skills

Strong scripting skills in Python, Bash, or similar languages.
Proficiency in infrastructure as code (IaC) tools such as Terraform or Ansible.
Experience with CI/CD tools such as GitHub Actions or CircleCI.
Sound production-debugging methodology: observe and form a hypothesis before acting, reason about how components fail together (e.g. how a queue backlog interacts with the database), and update your approach when new information appears.
Familiarity with defining alerts, dashboards, and SLOs; comfort reasoning about availability, latency, saturation, and error budgets.
Excellent problem-solving and troubleshooting skills.
Strong communication skills and ability to work collaboratively across teams.
Strong ownership and a blameless, collaborative posture under pressure — focused on diagnosing and fixing problems rather than assigning blame.