Qureos

FIND_THE_RIGHTJOB.

Architect, AI Operations, OCI, NA

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Here at OCI we’re building the world’s largest AI clusters and we’re the fastest at bringing them to market. OCI (Oracle Cloud Infrastructure) AI Infrastructure is at the forefront of building a cutting-edge, ultra-high-performance GPU platform designed to support AI/ML/HPC workloads. This is your chance to be part of the AI revolution, working with systems that allow customers to scale from tens to thousands of GPUs without compromising performance.

Our team is responsible for designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services. You will have the opportunity to work with cutting-edge technologies and make a significant impact on our organization's success.


  • Play a pivotal role in ensuring AI infrastructure continues to meet the rapidly evolving demands of both Enterprise and AI/ML customers.
  • Ensure reliability and customer satisfaction through proactive issue management and recurring pattern resolution.
  • Engage with Enterprise and AI/ML customers to understand their specific requirements for uninterrupted workloads.
  • Drive the organization’s goals to pursue opportunities that make AI infrastructure more efficient.
  • Partner and collaborate with organization leaders to help improve the performance of the team and organization.
  • Represent major incidents to customers, communicating both technically and strategically.
  • Identify recurring issues and drive organizational improvements to prevent future incidents.
  • Hands-on debugging and log analysis (individual contributor role).
  • Collaborate with internal teams to maintain uptime, performance, and customer growth.
  • Implement monitoring and optimization frameworks for AI workloads, including latency, throughput, and cost efficiency.
  • Lead incident response and post-mortem processes for infrastructure issues impacting AI services.

© 2025 Qureos. All rights reserved.