Requirements and responsibilities
Key Responsibilities
-
Deployment & Automation
-
Understand, deploy and maintain Helm charts, and CI/CD workflows for AKS, EKS, and on-prem Kubernetes (K3s or RKE2) in customer environments.
-
Standardize customer deployments (private cloud / air-gapped) using reproducible manifests and configuration validation tooling.
-
Maintain our single-node and multi-node install processes; improve installer packaging.
-
Environment Reliability
-
Monitor uptime, capacity, and performance across distributed clusters (migration, scan, OLAP DB node groups).
-
Implement proactive alerting (Prometheus, Grafana, Azure Monitor, CloudWatch) and ensure runbooks exist for all major services.
-
Coordinate with customer IT/security teams to handle firewall, proxy, and credential configurations safely and consistently.
-
Release & Incident Management
-
Participate in release-readiness and hardening cycles; validate new images and helm charts before customer rollout.
-
Lead incident response for production issues—triage, communicate status, and drive post-incident reviews and root-cause documentation.
-
Track reliability metrics (MTTR, deployment success rate, change-failure rate) and feed insights back into engineering planning.
-
Security & Compliance
-
Integrate static/dynamic security scanning (GitHub Advanced Security / CodeQL / Dependabot) and image-signing pipelines.
-
Ensure secrets, credentials, and certificates are rotated and stored per corporate security standards.
-
Support ISO / SOC2 audit evidence collection (CCR change control, deployment logs, access reviews).
-
Tooling & Observability
-
Extend monitoring to include customer-facing telemetry where allowed; maintain log shipping and retention policies.
-
Contribute to internal dashboards showing environment health, install duration, and customer success metrics.
-
Collaboration & Enablement
-
Work closely with Dev / QA / Support to reproduce issues in controlled environments and publish fixes or workarounds.
-
Provide training and documentation for Services and Support engineers deploying or maintaining on-prem instances.
-
Champion “build-to-run” culture—drive automation, resiliency testing, and feedback loops between engineering and field ops.
Required Experience
-
5 + years in SRE, DevOps, or Production Ops roles supporting hybrid or on-prem software delivery.
-
Minimum 3 years working with Fortune 500 companies implementing or maintaining enterprise software.
-
Expertise with Kubernetes, Helm, and Docker in mixed cloud environments (Azure AKS, AWS EKS, on-prem K3s).
-
Solid understanding of network security (proxies, TLS, VPN, firewalls) and Linux administration.
-
Strong scripting and automation skills (Bash, Python, PowerShell, YAML / Terraform) especially as it relates to K8s.
-
Familiarity with CI/CD pipelines (GitHub Actions, TeamCity, Argo CD or Flux).
-
Experience supporting distributed systems (e.g., Apache Pulsar, Postgres, ClickHouse, Redis, MinIO).
-
Comfort working directly with enterprise customer admins and security teams.
Success Indicators (How you will be measured)
-
95 % + installation success on first attempt across customer environments
-
Measurable reduction in install/upgrade time and CRI (Customer Raised Issues) related to configuration or infrastructure
-
Clear, actionable runbooks for all critical services
-
Improved observability and automation coverage across all deployment models