Experience: 10+ years
Location: Remote (Open to traveling to KSA or Turkey for 1 year)
Position Overview:
We are hiring a “SRE [Site Reliability Engineer] Infrastructure Support” engineer with deep expertise in Linux, Kubernetes, and hardware infrastructure management for our “Enterprise-grade high-performance supercomputing” platform. We are helping enterprises and service providers build their AI inference platforms for end users, powered by our state-of-the-art RDU (Reconfigurable Dataflow Unit) hardware architecture. This is a high-impact, high-visibility role. The ideal candidate will play a pivotal role in supporting and maintaining our enterprise infrastructure stack, ensuring high availability and optimal performance across mission-critical AI & ML environments. This role involves close collaboration with global SRE and Platform teams to manage and troubleshoot enterprise systems and clusters.
Key Responsibilities:
-
Linux Administration: Manage, configure, and optimize Linux servers (RHEL, Ubuntu, or similar), including patching, security hardening, and performance tuning
-
Kubernetes Administration: Deploy, manage, and troubleshoot Kubernetes clusters, ensuring reliability and scalability
-
Hardware Infrastructure Management: Oversee physical data center infrastructure, including servers, storage, and networking hardware
-
Security & Compliance: Apply security patches and upgrades for Linux-based Kubernetes environments and ensure compliance with organizational policies
-
Collaboration & Support: Work closely with SRE and Platform teams worldwide to support enterprise systems and clusters
-
Ticket-Based Case Management: Handle tickets efficiently using tools such as Salesforce or ServiceNow
Required Qualifications:
-
Strong hands-on experience with Linux system administration (RHEL, Ubuntu, or similar). RHCSA/RHCE certification is a plus
-
Solid understanding of Kubernetes administration; CKA/CKS certification is a plus
-
Hands-on experience with bare-metal and hardware infrastructure (servers, storage, networking)
-
Good understanding of networking concepts (TCP/IP, DNS, Load Balancers, Firewalls); knowledge of Juniper OS is a plus
-
Strong troubleshooting skills across hardware, OS, and Kubernetes environments
-
Knowledge of automation tools such as Ansible, Python, Bash, or similar is a plus
-
Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK) is a plus
Soft Skills:
-
Strong communication, problem-solving, and collaboration abilities
-
Ability to work effectively in fast-paced, dynamic environments and adapt to evolving AI & ML technologies
-
Proactive mindset with a focus on automation, scalability, and operational excellence
Why Join Us:
-
Work on cutting-edge AI & ML infrastructure supporting mission-critical applications
-
Collaborate with global teams and gain exposure to advanced cloud-native and enterprise technologies
-
Opportunity to grow your expertise in Linux, Kubernetes, and data center operations