We are looking for a Infrastructure Architect (AI & Data Center) - Remote / Telecommute for our client in San Jose, CA
Job Title: Infrastructure Architect (AI & Data Center) - Remote / Telecommute
Job Location: San Jose, CA
Job Type: Contract
Job Overview:
Pay Range: $71.16hr - $74.90hr
Requirement/Must Have:
- Bachelor s degree in Information Technology, Business, or a related field.
- 5+ years of experience in Data Center projects in an enterprise environment.
- Knowledge of Cisco, Dell, HPE, Supermicro hardware.
- Deep knowledge of Cisco HW, NVIDIA GPU architectures (H100, B200, RTX 6000 Pro) and high-speed interconnects (RoCE v2, InfiniBand).
- Extensive knowledge and experience with Data Center infrastructure.
- Proficiency with asset management and automation tools (Netbox, ServiceNow, Terraform, or OpenTofu).
- Experience in Data Center lifecycle management, DC HW capacity planning, decommissioning, defragmentation, building complex financial showback models for shared infrastructure.
- Proven expertise in Kubernetes (NKP preferred) and NVIDIA AI Enterprise stacks (GPU Operator, DCGM, Triton, vLLM).
Responsibilities:
- Lead the architectural design and refinement of the client GPU-as-a-Service (GPUaaS) platform, ensuring a seamless experience for internal R&D, QA, and Sales teams.
- Provide technical leadership in key initiatives such as client Validated Designs (NVD) for the AI Factory, incorporating NVIDIA MGX/HGX architectures and high-density Cisco nodes (e.g., UCS 845A).
- Architect the Management Cluster control plane (NKP, Prism Central, NuDeploy) to ensure it is decoupled from GPU compute nodes for maximum efficiency.
- Implement policy-driven placement of workloads across on-prem and cloud-burst environments.
- Design solution for a centralized Data Center Asset Inventory system, ensuring real-time visibility into all hardware assets, including CPUs, GPUs, Virtual Machines, and networking.
- Develop a comprehensive Hardware Lifecycle Management strategy, including procurement forecasting, 'rack and stack' operationalization, and decommissioning of legacy systems (G3/G4/G5).
- Lead 'Tiger Team' initiatives to navigate supply chain constraints, ensuring critical release milestones are not delayed by hardware shortages.
- Enforce strict Security Standards for Data Center HW Provisioning.
- Implement network segmentation for all critical applications.
- Ensure all infrastructure meets SOC 2 and ISO 27001 compliance objectives while maintaining low-latency performance.
- Provide required architecture and designs during the project intake process. Review, guide the teams for right architecture for all demands before they become approved projects.
- Partner with security team and provide guidelines for upcoming projects.
- Involve and lead projects as an architect on special projects.
Nice to Have:
- Experience managing (as an architect) massive-scale data center environments (1,000+ nodes).
- Knowledge of client Cloud Infrastructure (NCI), AHV, and Prism Central.
- Strong background in MLOps and automated pipeline integration (Kubeflow/MLflow).
For applications and inquiries, contact: hirings@openkyber.com