Principal Engineer - ML Ops

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

About KATIM

KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.

Job Purpose (specific To This Role)

The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.

You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.

You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.

AI-Augmented Product Development Model (Context for the Role)

We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.

Core Principles

Security is integrated into every decision, from architecture to deployment.
Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
Quality is measurable, enforced, and automated at every stage.
All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.

Key Responsiblities

AI MLOps Architecture & Governance (30%)

Define the MLOps architecture and governance framework across products.
Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
Lead architectural designs and reviews for AI pipelines.
Design and maintain LLM inference infrastructure
Manage model registries and versioning (MLflow, Weights & Biases)
Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
Optimize model performance and cost (quantization, caching, batching)
Build and maintain vector databases (Pinecone, Weaviate, Chroma)
Hardware and inference optimization awareness

Agent & Tool Development (25%)

Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
Build tool integrations for LLM agents (function calling, APIs)
Implement retrieval-augmented generation (RAG) pipelines
Create prompt management and versioning systems
Monitor and optimize agent performance

CI/CT/CD Pipelines (20%)

Build continuous integration pipelines for models and code
Implement continuous training (CT) workflows
Automate model deployment with rollback capabilities
Create staging and production deployment strategies
Integrate AI-assisted code review into CI/CD
Building a continuous evaluation loop

Infrastructure & Automation (15%)

Manage cloud infrastructure (Kubernetes, serverless)
Implement Infrastructure as Code (Terraform, Pulumi)
Build monitoring and observability systems (Prometheus, Grafana, DataDog)
Automate operational tasks with AI agents
Ensure security and compliance (OWASP, SOC2) - AI-specific security

Developer Enablement (10%)

Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
Document AI/ML best practices and patterns
Conduct training on MLOps tools and workflows
Support engineers with AI integration challenges
Maintain development environment parity
AI Privacy, Governance, and Compliance

Education and Minimum Qualification

BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
8+ years in DevOps, SRE, or platform engineering
5+ years hands-on experience with ML/AI systems in production
Deep understanding of LLMs and their operational requirements
Experience building and maintaining CI/CD pipelines
Strong Linux/Unix systems knowledge
Cloud platform expertise (AWS, GCP, or Azure)
Experience with container orchestration (Kubernetes)

Key Skills

MLOps & AI:

LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
Model Registries: MLflow, Kubeflow, AWS SageMaker
Vector Databases: Pinecone, Weaviate, Chroma, Milvus
Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
Fine-tuning: LoRA, QLoRA, prompt tuning

Data Engineering:

Pipelines: Airflow, Prefect, Dagster
Processing: Spark, Dask, Ray
Streaming: Kafka, Pulsar, Kinesis
Data Quality: Great Expectations, dbt
Feature Stores: Feast, Tecton

DevOps & Infrastructure:

Containers: Docker, Kubernetes, Helm
Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
IaC: Terraform, Pulumi, CloudFormation
CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
Orchestration: Kubernetes operators, Kubeflow

Monitoring & Observability:

Metrics: Prometheus, Grafana, CloudWatch
Logging: ELK Stack, Loki, CloudWatch Logs
Tracing: Jaeger, Zipkin, OpenTelemetry
Alerting: PagerDuty, Opsgenie
Model Monitoring: Arize, Fiddler, Evidently

Programming:

Python: Primary language for ML/AI
Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
FastAPI, Flask for serving
Go: For high-performance services and tooling
Shell Scripting: Bash, Python for automation
SQL: Advanced queries, optimization

AI-Assisted Operations:

Autonomous agents for incident response
AI-powered log analysis and anomaly detection
Automated root cause analysis
Intelligent alerting and noise reduction

Other Highly Desirable Skills:

Experience with LLM fine-tuning and deployment at scale
Background in data engineering or ML engineering
Startup or high-growth environment experience
Security certifications (CISSP, AWS Security)
Contributions to open source MLOps projects
Experience with multi-cloud or hybrid cloud
Prior software engineering experience

Success Metrics

Uptime: 99.9%+ availability for AI services

Deployment Frequency: Daily or on-demand deployments

Model Performance: Latency (p95 < 500ms), accuracy tracking

Cost Efficiency: Cost per inference, infrastructure utilization

Developer Velocity: Time to deploy new models, AI feature adoption rate

Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)

#KATIM

Similar jobs