About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.
Job Purpose (specific To This Role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.
You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.
You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.
AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.
Core Principles
-
Security is integrated into every decision, from architecture to deployment.
-
Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
-
Quality is measurable, enforced, and automated at every stage.
-
All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
-
Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.
Key Responsiblities
AI MLOps Architecture & Governance (30%)
-
Define the MLOps architecture and governance framework across products.
-
Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
-
Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
-
Lead architectural designs and reviews for AI pipelines.
-
Design and maintain LLM inference infrastructure
-
Manage model registries and versioning (MLflow, Weights & Biases)
-
Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
-
Optimize model performance and cost (quantization, caching, batching)
-
Build and maintain vector databases (Pinecone, Weaviate, Chroma)
-
Hardware and inference optimization awareness
Agent & Tool Development (25%)
-
Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
-
Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
-
Build tool integrations for LLM agents (function calling, APIs)
-
Implement retrieval-augmented generation (RAG) pipelines
-
Create prompt management and versioning systems
-
Monitor and optimize agent performance
CI/CT/CD Pipelines (20%)
-
Build continuous integration pipelines for models and code
-
Implement continuous training (CT) workflows
-
Automate model deployment with rollback capabilities
-
Create staging and production deployment strategies
-
Integrate AI-assisted code review into CI/CD
-
Building a continuous evaluation loop
Infrastructure & Automation (15%)
-
Manage cloud infrastructure (Kubernetes, serverless)
-
Implement Infrastructure as Code (Terraform, Pulumi)
-
Build monitoring and observability systems (Prometheus, Grafana, DataDog)
-
Automate operational tasks with AI agents
-
Ensure security and compliance (OWASP, SOC2) - AI-specific security
Developer Enablement (10%)
-
Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
-
Document AI/ML best practices and patterns
-
Conduct training on MLOps tools and workflows
-
Support engineers with AI integration challenges
-
Maintain development environment parity
-
AI Privacy, Governance, and Compliance
Education and Minimum Qualification
-
BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
-
8+ years in DevOps, SRE, or platform engineering
-
5+ years hands-on experience with ML/AI systems in production
-
Deep understanding of LLMs and their operational requirements
-
Experience building and maintaining CI/CD pipelines
-
Strong Linux/Unix systems knowledge
-
Cloud platform expertise (AWS, GCP, or Azure)
-
Experience with container orchestration (Kubernetes)
Key Skills
MLOps & AI:
-
LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
-
Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
-
Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
-
Model Registries: MLflow, Kubeflow, AWS SageMaker
-
Vector Databases: Pinecone, Weaviate, Chroma, Milvus
-
Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
-
Fine-tuning: LoRA, QLoRA, prompt tuning
Data Engineering:
-
Pipelines: Airflow, Prefect, Dagster
-
Processing: Spark, Dask, Ray
-
Streaming: Kafka, Pulsar, Kinesis
-
Data Quality: Great Expectations, dbt
-
Feature Stores: Feast, Tecton
DevOps & Infrastructure:
-
Containers: Docker, Kubernetes, Helm
-
Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
-
IaC: Terraform, Pulumi, CloudFormation
-
CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
-
Orchestration: Kubernetes operators, Kubeflow
Monitoring & Observability:
-
Metrics: Prometheus, Grafana, CloudWatch
-
Logging: ELK Stack, Loki, CloudWatch Logs
-
Tracing: Jaeger, Zipkin, OpenTelemetry
-
Alerting: PagerDuty, Opsgenie
-
Model Monitoring: Arize, Fiddler, Evidently
Programming:
-
Python: Primary language for ML/AI
-
Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
-
FastAPI, Flask for serving
-
Go: For high-performance services and tooling
-
Shell Scripting: Bash, Python for automation
-
SQL: Advanced queries, optimization
AI-Assisted Operations:
-
Autonomous agents for incident response
-
AI-powered log analysis and anomaly detection
-
Automated root cause analysis
-
Intelligent alerting and noise reduction
Other Highly Desirable Skills:
-
Experience with LLM fine-tuning and deployment at scale
-
Background in data engineering or ML engineering
-
Startup or high-growth environment experience
-
Security certifications (CISSP, AWS Security)
-
Contributions to open source MLOps projects
-
Experience with multi-cloud or hybrid cloud
-
Prior software engineering experience
Success Metrics
Uptime: 99.9%+ availability for AI services
Deployment Frequency: Daily or on-demand deployments
Model Performance: Latency (p95 < 500ms), accuracy tracking
Cost Efficiency: Cost per inference, infrastructure utilization
Developer Velocity: Time to deploy new models, AI feature adoption rate
Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)
#KATIM