FIND_THE_RIGHTJOB.

Application House Limited

LLM Inference Engineer

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Job Title:

Machine Learning Engineer (Inference & Systems)

(also known as Inference Engineer)

Location:

Work from Home (fully flexible Remote)

Job Timing:

Part-Time, flexible any time that can be optionally transform to full-time

About the Role:

We’re building a next-generation cloud platform to serve multimodal AI, LLMs, vision, audio and other machine learning models at scale. As a Machine Learning Engineer (Inference & Systems), you’ll design and optimize runtime systems, OpenAI-compatible APIs, and distributed GPU pipelines for fast, cost-efficient inference and fine-tuning.

You’ll work with frameworks like vLLM, TensorRT-LLM, and TGI to design, optimize, and deploy distributed inference engines that serve text, vision, and multimodal models with low latency and high throughput. This includes deploying models such as LLaMA 3, Mistral, diffusion, ASR, TTS, and embeddings, while focusing on GPU/accelerator optimizations, software–hardware co-design, and fault-tolerant large-scale systems that power real-world applications and developer tools.

You’ll work at the intersection of machine learning, cloud infrastructure, and systems engineering, focusing on high-throughput, low-latency inference and cost-efficient deployment. This role offers a unique opportunity to shape the future of AI inference infrastructure, from cutting-edge model serving systems to production-grade deployment pipelines.

If you're passionate about pushing the boundaries of AI inference, we’d love to hear from you!

Key Responsibilities:

· Deploy and maintain LLMs (e.g., LLaMA 3, Mistral) and ML models using serving engines such as vLLM, Hugging Face TGI, TensorRT-LLM, or FasterTransformer.

· Design and develop fault-tolerant, high-concurrency, large-scale distributed inference engines for text, image, LLMs and multimodal models that are fault-tolerant, high-performance, and cost-efficient.

· Implement, optimize distributed inference and parallelism strategies: Mixture of Experts (MoE), tensor parallelism, pipeline parallelism, tensor parallelism, pipeline parallelism for high-performance serving.

· Integrate vLLM, TGI, SGLang, FasterTransformer, and explore emerging inference frameworks.

· Build and scale an OpenAI-compatible API layer to expose models for customer use.

· Experiment with model quantization, caching, and parallelism to reduce inference costs.

· Optimize GPU usage, memory, and batching to achieve low-latency, high-throughput inference.

· Optimize GPU performance using CUDA graph optimizations, TensorRT-LLM, Triton kernels, PyTorch compilation (torch.compile), quantization, speculative decoding to maximize efficiency.

· Work with cloud GPU providers (RunPod, Vast.ai, AWS, GCP, Azure) to manage costs and availability.

· Develop runtime inference services and APIs for LLMs, multimodal models, and fine-tuning pipelines.

· Build monitoring and observability for inference services to integrate inference metrics (latency, throughput, GPU utilization) into monitoring dashboards (Grafana, Prometheus, Loki, OpenTelemetry).

· Collaborate with backend and DevOps engineers to ensure secure, reliable APIs with rate-limiting and billing hooks.

· Document deployment processes and provide guidance to other engineers using the platform.

Requirements:

· Experience: 3+ years in deep learning inference, fault-tolerant, distributed systems, or high-performance computing.

· Proven experience in deploying ML/LLM models to production.

· Inference: Hands-on experience with at least one inference engine: vLLM, TGI, SGLang, TensorRT-LLM, FasterTransformer, or Triton.

· Runtime Services: Prior work implementing large-scale inference or serving pipelines.

· Solid understanding of GPU memory management, batching, and distributed inference. Strong knowledge of GPU programming (CUDA, Triton, TensorRT), compiler, model quantization, and GPU cluster scheduling.

· Experienced in the GPU/ML stack including PyTorch, Hugging Face Transformers, and GPU-accelerated inference.

· Deep understanding of Transformer architectures, LLM/VLM/Diffusion model optimization, and KV cache systems like Mooncake, PagedAttention, or custom in-house variants solutions, with experience enhancing them for long-context serving and applying inference optimization techniques such as workload scheduling, efficient kernels, and CUDA graphs.

· Comfortable working with cloud GPU platforms (AWS/GCP/Azure) or GPU marketplaces (RunPod, Vast.ai, TensorDock) to profile bottlenecks and optimize GPU utilization.

· Benchmark and tune multi-GPU clusters for throughput and memory efficiency.

· Experience building REST APIs or gRPC services (FastAPI, Flask, or similar).

· Programming: Proficient in Python, Go, Rust, C++, CUDA for high-performance systems.

· Systems knowledge: Demonstrated experience with distributed systems (storage, search, compute, or inference). Strong understanding of multi-threading, memory management, networking, storage, performance tuning.

· Familiarity with containerization (Docker) and orchestration (Kubernetes).

· Strong problem-solving and debugging skills across ML + infra stack.

· Familiarity with distributed storage (Ceph, HDFS, 3FS).

· Knowledge of datacenter networking (RDMA, RoCE).

Nice to Have:

· Experience with Stripe or other billing systems for metered API usage.

· Experience with large-scale datacenter networking (RDMA/RoCE).

· Familiarity with distributed storage (Ceph, HDFS, 3FS).

· Knowledge of Redis or Envoy for request rate limiting.

· Familiarity with observability tools (Grafana, Prometheus, Loki).

· Exposure to MLOps pipelines (CI/CD with Azure DevOps or GitHub Actions).

· Exposure to observability stacks (Prometheus, Grafana, Loki).

· Experience with model fine-tuning pipelines and GPU scheduling.

· Understanding of rate limiting, quota enforcement, and billing hooks in ML APIs.

· Prior work at an AI infra company (Together.ai, Modal, Anyscale, Replicate, etc.)

Why Join Us?

· Work from Anywhere – 100% remote, with the freedom to work from anywhere in the USA.

· Fully Flexible Shifts – complete control over your working hours; results matter more than clocking in.

· Career Growth & Fast-Track Promotions – we guarantee quickest promotion opportunities and clear pathways for advancement.

· Professional Development – training budget, mentorship, and exposure to cutting-edge Salesforce, AI/ML, and cloud technologies.

· Global Collaboration – work with an international, diverse, and inclusive team.

· Innovative Environment – freedom to experiment with new tools, frameworks, and ideas.

· Accelerated Salary Growth + Performance Incentives – ambitious and hard-working team members are rewarded with fast upward salary progression alongside strong performance bonuses.

Job Type: Part-time

Benefits:

Flexible schedule

Work Location: Remote

Similar jobs

2026 Annapurna Labs at AWS, Early Career (US) - Silicon & AI Systems Innovation

Amazon Web Services

Seattle, United States

5 days ago

Term of use Privacy policy