FIND_THE_RIGHTJOB.
JOB_REQUIREMENTS
Hires in
Not specified
Employment Type
Not specified
Company Location
Not specified
Salary
Not specified
OCI is driving development of next generation hyperscalar GPU data centers built on Nvidia and AMD GPUs. OCI enables popular AI services such as openAI on GPU compute servers. We are looking for engineers experienced in working with GPU device drivers and the runtime libraries (CUDA and ROCM). You must understand GPU architectural concepts such as UVM, host to device and device to host interactions including able to quantify performance issues in all such interactions. We are looking for strong experience in building and debugging issues that occur in the GPU drivers and Linux kernels that interact with GPU stack including functional and performance issues when running GPU AI/ML/inference workloads. The candidate should be able to use all standard tools targeted performance and stress such as DCGM, NCCL and RCCL suites. In addition, we are looking for experience debugging and diagnosing issues in the system reported via RAS events notified via the GPU BMC and other monitoring agents. The candidate should have breath knowledge in BIOS, CPU and GPU BMC and must show strong proficiency in C programming and working knowledge in Python or other scripting language used in AI/GPU environments.
As a member of the software engineering division, you will be required to have in depth knowledge of Nvidia and AMD GPU architecture working in a fast paced development environment on projects critical to OCI's success. You must demonstrate a good knowledge of GPU drivers including building and debugging issues related to them. You will regularly engage in debugging issues that are seen during new product bring up and at data centers running customer workloads including driving those issues with GPU vendors to resolution. All OCI engineers are expected to be on call periodically to handle OCI data center escalations. Must be comfortable with CI/CD pipelines to take vendor SW drops and build customized drivers against Oracle Linux and Ubuntu distributions, unit test functionality and run GPU workloads to validate performance using standard benchmarks. In addition, you should have working knowledge of the entire boot process including touch points with the BIOS and the BMC subsystems. We need engineers who show strong technical and communication skills as they engage with cross functional teams such as the HW and FW teams to debug issues and to ultimately drive OCI success.
Similar jobs
Glean
Palo Alto, United States
6 days ago
Meta
New York, United States
6 days ago
8VC
Austin, United States
6 days ago
Amazon.com
Cupertino, United States
6 days ago

Stanford University
California, United States
11 days ago

Stanford University
California, United States
11 days ago
© 2025 Qureos. All rights reserved.