Qureos

Find The RightJob.

Production Operations / SRE

Requirements and responsibilities

SRE

Key Responsibilities

  • Deployment & Automation
    • Understand, deploy and maintain Helm charts, and CI/CD workflows for AKS, EKS, and on-prem Kubernetes (K3s or RKE2) in customer environments.
    • Standardize customer deployments (private cloud / air-gapped) using reproducible manifests and configuration validation tooling.
    • Maintain our single-node and multi-node install processes; improve installer packaging.
  • Environment Reliability
    • Monitor uptime, capacity, and performance across distributed clusters (migration, scan, OLAP DB node groups).
    • Implement proactive alerting (Prometheus, Grafana, Azure Monitor, CloudWatch) and ensure runbooks exist for all major services.
    • Coordinate with customer IT/security teams to handle firewall, proxy, and credential configurations safely and consistently.
  • Release & Incident Management
    • Participate in release-readiness and hardening cycles; validate new images and helm charts before customer rollout.
    • Lead incident response for production issues—triage, communicate status, and drive post-incident reviews and root-cause documentation.
    • Track reliability metrics (MTTR, deployment success rate, change-failure rate) and feed insights back into engineering planning.
  • Security & Compliance
    • Integrate static/dynamic security scanning (GitHub Advanced Security / CodeQL / Dependabot) and image-signing pipelines.
    • Ensure secrets, credentials, and certificates are rotated and stored per corporate security standards.
    • Support ISO / SOC2 audit evidence collection (CCR change control, deployment logs, access reviews).
  • Tooling & Observability
    • Extend monitoring to include customer-facing telemetry where allowed; maintain log shipping and retention policies.
    • Contribute to internal dashboards showing environment health, install duration, and customer success metrics.
  • Collaboration & Enablement
    • Work closely with Dev / QA / Support to reproduce issues in controlled environments and publish fixes or workarounds.
    • Provide training and documentation for Services and Support engineers deploying or maintaining on-prem instances.
    • Champion “build-to-run” culture—drive automation, resiliency testing, and feedback loops between engineering and field ops.


Required Experience

  • 5 + years in SRE, DevOps, or Production Ops roles supporting hybrid or on-prem software delivery.
  • Minimum 3 years working with Fortune 500 companies implementing or maintaining enterprise software.
  • Expertise with Kubernetes, Helm, and Docker in mixed cloud environments (Azure AKS, AWS EKS, on-prem K3s).
  • Solid understanding of network security (proxies, TLS, VPN, firewalls) and Linux administration.
  • Strong scripting and automation skills (Bash, Python, PowerShell, YAML / Terraform) especially as it relates to K8s.
  • Familiarity with CI/CD pipelines (GitHub Actions, TeamCity, Argo CD or Flux).
  • Experience supporting distributed systems (e.g., Apache Pulsar, Postgres, ClickHouse, Redis, MinIO).
  • Comfort working directly with enterprise customer admins and security teams.


Success Indicators (How you will be measured)

  • 95 % + installation success on first attempt across customer environments
  • Measurable reduction in install/upgrade time and CRI (Customer Raised Issues) related to configuration or infrastructure
  • Clear, actionable runbooks for all critical services
  • Improved observability and automation coverage across all deployment models

Similar jobs

No similar jobs found

© 2026 Qureos. All rights reserved.