Qureos

Find The RightJob.

AI Infrastructure Architect

What you will do (Key responsibilities)

1) Architect and deliver customer AI infrastructure (end-to-end)

  • Lead architecture and implementation for secure, scalable AI/ML/LLM platforms based on customer requirements and constraints.
  • Produce implementation-ready artifacts: HLD/LLD, reference architectures, network/topology diagrams, deployment plans, runbooks, and operational handover packs.
  • Translate business and technical requirements into a scalable target state, and guide delivery teams through build, rollout, and production readiness.

2) Solve real enterprise constraints (network + access + topology)

  • Design enterprise network topologies with segmentation/isolation: private subnets, route tables, security policies, egress control, private endpoints, controlled ingress patterns.
  • Work within common enterprise constraints
    • Fixed network address plans (pre-approved CIDR ranges), IP allowlists/deny-lists, and limited routing flexibility
    • Private connectivity requirements (VPN/Direct Connect/FastConnect/ExpressRoute), no public endpoints, and restricted DNS resolution
    • Controlled administrative access (bastion/jump host, privileged access management, session recording, time-bound access)
    • Restricted egress (proxy-only outbound, firewall-controlled destinations, egress allowlists, DNS filtering, no direct internet)Ensure secure data movement and integration patterns for AI workloads (east-west and north-south traffic)
    • Customer-managed encryption and key custody (KMS/HSM, BYOK/HYOK, key rotation, certificate lifecycle)
    • Strict TLS policies (mTLS, approved ciphers, enterprise PKI, certificate pinning where required)
    • Identity and access controls (SSO/SAML/OIDC, RBAC/ABAC, least privilege, break-glass accounts)
    • Data governance constraints (PII/PHI handling, residency/sovereignty, retention, audit evidence requirements)
    • Secure software supply chain (approved base images, artifact signing, SBOM, vulnerability scanning, patch SLAs)
    • Endpoint controls (EDR agents, OS hardening standards, restricted packages, golden images)
    • Change management gates (CAB approvals, limited maintenance windows, separation of duties)
    • Observability restrictions (logs can’t leave tenant, redaction/masking, approved collectors/forwarders only)
    • Multi-tenant isolation and policy boundaries (namespace isolation, network policies, runtime sandboxing)
    • High availability & DR expectations (multi-zone patterns, backup/restore, failover runbooks, RTO/RPO)

3) Security-by-design, InfoSec approvals, and guardrails for AI platforms

  • Lead InfoSec engagement: threat modeling, control mapping, evidence collection, remediation plans, and security signoffs for AI infrastructure.
  • Implement security controls and platform guardrails:
    • TLS/SSL-only communication patterns; encryption-in-transit and encryption-at-rest
    • API security: OAuth2/JWT/mTLS, gateway policies, request signing patterns where required
    • Secrets management using vault/key management services, rotation and lifecycle controls
    • IAM and least-privilege access models; tenant/project isolation
    • VM hardening (CIS-aligned baselines), patching strategy, secure images
    • “Kill switches” / emergency stop mechanisms for agents (tool-disable, egress cut-off, policy stop, rollback runbooks)
    • AI infra guardrails: controlled tool execution, outbound allowlists, boundary policies, audit-ready logging

4) LLM hosting, GPU infrastructure, and scale

  • Architect LLM hosting patterns: managed endpoints, self-hosted inference, multi-model routing, and workload isolation.
  • Design and operationalize GPU-based inference at scale:
    • Capacity planning, GPU node pools, scaling policies, cost/performance optimization
    • Performance profiling and reliability patterns for inference services
  • Build container/Kubernetes-based AI platforms (OKE/EKS/AKS/GKE as applicable):
    • Secure cluster designs, namespaces/tenancy, node isolation, secrets, and safe rollout strategies
    • Support AI frameworks and application runtimes on Kubernetes for scale and portability

5) Observability, reliability engineering, and operational readiness

  • Define and implement observability across AI systems:
    • Metrics, logs, traces, audit trails, and network call tracing
    • Integration with enterprise observability tools (customer standard platforms)
  • Define SLIs/SLOs for AI services:
    • Latency, throughput, error rates, saturation, GPU utilization, queue depth, retry behavior
  • Execute load testing and capacity validation for inference endpoints, vector stores, agent runtimes, and integration services.
  • Build reliable ops workflows: incident response, runbooks, dashboards, alerting, and proactive health checks.

6) Disaster recovery and resilience for AI platforms

  • Design DR strategies for AI solutions:
    • Multi-AD / multi-region patterns, backup/restore for critical stores, IaC-based rebuilds
    • Failover runbooks, RTO/RPO alignment, and validation exercises
  • Ensure production-grade resilience and safe rollback for platform and application layers.

7) Red teaming and risk mitigation for AI infrastructure

  • Drive security validation for AI infrastructure and agent deployments:
    • Attack surface review, secrets leakage paths, egress abuse scenarios
    • Prompt/tool misuse impact assessment at infrastructure level
  • Implement mitigations and hardening measures with measurable controls.

8) Consulting leadership and stakeholder management

  • Act as a trusted technical advisor to customer platform, network, and security teams.
  • Communicate clearly with diverse stakeholders (CIO/CTO, Security, Infra, App teams) and drive decisions under ambiguity.
  • Mentor engineers/architects, conduct design reviews, and build reusable delivery accelerators and blueprints.

Required experience and qualifications

  • 15+ years of experience in infrastructure architecture, cloud engineering, or platform consulting, with proven ownership of end-to-end architecture and delivery.
  • Strong fundamentals in networking, operating systems, distributed systems, and enterprise security.
  • Proven experience delivering secure, highly available platforms in regulated or enterprise environments.
  • Deep hands-on experience with:
    • Cloud infrastructure (OCI preferred; AWS/Azure/GCP acceptable)
    • Enterprise network design (VPC/VCN, VPNs, routing, firewalls, proxies, private endpoints, DNS)
    • Kubernetes/container platforms (OKE/EKS/AKS/GKE), secure cluster patterns, and scaling strategies
    • Infrastructure-as-Code (Terraform strongly preferred) and automation (Python/shell)
    • Observability stacks (logs/metrics/traces) and integration with enterprise monitoring tools
    • IAM, vault/key management, secrets handling, encryption standards, and audit controls
  • Strong customer-facing skills: requirements discovery, architecture documentation, and delivery leadership.

Preferred (nice-to-have) skills

  • LLM inference serving (open models and/or managed endpoints), multi-model routing, and AI workload isolation.
  • GPU platform engineering: scheduling, node pool design, performance tuning, and cost controls.
  • Experience implementing agentic AI runtime patterns with safe tool execution and enterprise guardrails.
  • Hybrid and multi-cloud deployments, including on-prem connectivity and enterprise integration patterns.
  • Familiarity with data platforms relevant to AI (vector stores, metadata stores, object storage patterns).

Core competencies (what we value)

  • Systems thinking and security-first architecture mindset
  • Strong problem solving in constrained enterprise environments
  • Crisp documentation and executive-ready communication
  • Hands-on delivery orientation (not just advisory)
  • Ownership, urgency, and accountability for production outcomes

Scope and impact (IC4 expectations)

  • Independently leads complex customer AI infrastructure programs from discovery through production and handover.
  • Unblocks security/network constraints and drives approvals with clear evidence and mitigations.
  • Establishes reusable, referenceable blueprints (secure AI landing zones, LLM hosting patterns, DR templates, observability baselines).
  • Raises the quality bar by mentoring teams and institutionalizing guardrails, reliability practices, and delivery accelerators.

Similar jobs

No similar jobs found

© 2026 Qureos. All rights reserved.