Production Operations / SRE

Requirements and responsibilities

SRE

Key Responsibilities

Deployment & Automation
- Understand, deploy and maintain Helm charts, and CI/CD workflows for AKS, EKS, and on-prem Kubernetes (K3s or RKE2) in customer environments.
- Standardize customer deployments (private cloud / air-gapped) using reproducible manifests and configuration validation tooling.
- Maintain our single-node and multi-node install processes; improve installer packaging.
Environment Reliability
- Monitor uptime, capacity, and performance across distributed clusters (migration, scan, OLAP DB node groups).
- Implement proactive alerting (Prometheus, Grafana, Azure Monitor, CloudWatch) and ensure runbooks exist for all major services.
- Coordinate with customer IT/security teams to handle firewall, proxy, and credential configurations safely and consistently.
Release & Incident Management
- Participate in release-readiness and hardening cycles; validate new images and helm charts before customer rollout.
- Lead incident response for production issues—triage, communicate status, and drive post-incident reviews and root-cause documentation.
- Track reliability metrics (MTTR, deployment success rate, change-failure rate) and feed insights back into engineering planning.
Security & Compliance
- Integrate static/dynamic security scanning (GitHub Advanced Security / CodeQL / Dependabot) and image-signing pipelines.
- Ensure secrets, credentials, and certificates are rotated and stored per corporate security standards.
- Support ISO / SOC2 audit evidence collection (CCR change control, deployment logs, access reviews).
Tooling & Observability
- Extend monitoring to include customer-facing telemetry where allowed; maintain log shipping and retention policies.
- Contribute to internal dashboards showing environment health, install duration, and customer success metrics.
Collaboration & Enablement
- Work closely with Dev / QA / Support to reproduce issues in controlled environments and publish fixes or workarounds.
- Provide training and documentation for Services and Support engineers deploying or maintaining on-prem instances.
- Champion “build-to-run” culture—drive automation, resiliency testing, and feedback loops between engineering and field ops.

Required Experience

5 + years in SRE, DevOps, or Production Ops roles supporting hybrid or on-prem software delivery.
Minimum 3 years working with Fortune 500 companies implementing or maintaining enterprise software.
Expertise with Kubernetes, Helm, and Docker in mixed cloud environments (Azure AKS, AWS EKS, on-prem K3s).
Solid understanding of network security (proxies, TLS, VPN, firewalls) and Linux administration.
Strong scripting and automation skills (Bash, Python, PowerShell, YAML / Terraform) especially as it relates to K8s.
Familiarity with CI/CD pipelines (GitHub Actions, TeamCity, Argo CD or Flux).
Experience supporting distributed systems (e.g., Apache Pulsar, Postgres, ClickHouse, Redis, MinIO).
Comfort working directly with enterprise customer admins and security teams.

Success Indicators (How you will be measured)

95 % + installation success on first attempt across customer environments
Measurable reduction in install/upgrade time and CRI (Customer Raised Issues) related to configuration or infrastructure
Clear, actionable runbooks for all critical services
Improved observability and automation coverage across all deployment models

Similar jobs