Qureos

FIND_THE_RIGHTJOB.

AI Model Optimization & Fine-Tuning Engineer

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

AI Model Optimization & Fine-Tuning Engineer

(On-Device, Fully Offline)

About the Role

We are seeking a hands-on AI Model Optimization Engineer with proven experience in taking large base models, fine-tuning, distilling, and quantizing them for fully offline mobile deployment. This role requires real-world experience with model compression, dataset preparation, and mobile inference optimization for Android/iOS devices.

Responsibilities

  • End-to-end pipeline: data prep → fine-tuning → distillation → quantization → mobile packaging → benchmarking.
  • Apply PTQ/QAT quantization and distillation to deploy LLMs and multimodal models onto devices with limited memory/thermal budgets.
  • Format and prepare datasets for fine-tuning (tokenization, tagging, deduplication, versioning).
  • Optimize models for battery efficiency, low latency, and minimal RAM usage.
  • Benchmark and debug inference performance with Perfetto, Battery Historian, Instruments, etc.
  • Collaborate with app teams to integrate optimized models.

Mandatory Skills Checklist (Applicants must demonstrate experience in ALL of the following)

Quantization & Distillation

  • Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
  • Methods like AWQ, GPTQ, SmoothQuant, RPTQ.
  • Knowledge of 4-bit/8-bit schemes (INT4, INT8, FP4, NF4).
  • Distillation methods (teacher–student, logit matching, feature distillation).

Fine-Tuning & Data Handling

  • LoRA/QLoRA/DoRA/AdaLoRA fine-tuning.
  • Instruction-tuning pipelines with PyTorch + Hugging Face.
  • Dataset formatting: JSONL, multi-turn dialogs, tagging, tokenization (SentencePiece/BPE).
  • Deduplication, stratified sampling, and eval set creation.

On-Device Deployment

  • Hands-on with at least two runtimes: llama.cpp / GGUF, MLC LLM, ExecuTorch, ONNX Runtime Mobile, TensorFlow Lite, Core ML.
  • Experience with hardware acceleration: Metal (iOS), NNAPI (Android), GPU/Vulkan, Qualcomm DSP/NPU, XNNPACK.
  • Real-world deployment: must provide examples of models running fully offline on mobile (tokens/s, RAM usage, device specs).

Performance & Benchmarking

  • Tools: Perfetto, systrace, Battery Historian, adb stats (Android); Xcode Instruments, Energy Log (iOS).
  • Profiling decode speed, cold start vs. warm start latency, RAM usage, and energy consumption.

General

  • Strong PyTorch and Hugging Face experience.
  • Clear documentation and ability to explain optimization trade-offs.

Nice to Have

  • Open-source contributions to LLM quantization/edge-AI frameworks.
  • Prior deployment of Qwen, LLaMA, Gemma, or Mistral families onto mobile devices.
  • Multilingual or low-resource dataset experience (Urdu, Arabic, Hindi, etc.), including tokenization, script handling, and fine-tuning.
  • Familiarity with multimodal (ASR/TTS/VAD) integration on device.

Application Requirements

Applicants must include in their application:

  • A short case study of a model they have fine-tuned (dataset + method + results).
  • A short case study of a model they have quantized/distilled for mobile (framework + bit-depth + device + performance metrics).
  • Links to GitHub repos, papers, or APK/TestFlight builds if available.

Job Type: Full-time

Pay: Rs250,000.00 - Rs400,000.00 per month

Work Location: In person

© 2025 Qureos. All rights reserved.