AI Model Optimization & Fine-Tuning Engineer
(On-Device, Fully Offline)
About the Role
We are seeking a hands-on AI Model Optimization Engineer with proven experience in taking large base models, fine-tuning, distilling, and quantizing them for fully offline mobile deployment. This role requires real-world experience with model compression, dataset preparation, and mobile inference optimization for Android/iOS devices.
Responsibilities
- End-to-end pipeline: data prep → fine-tuning → distillation → quantization → mobile packaging → benchmarking.
- Apply PTQ/QAT quantization and distillation to deploy LLMs and multimodal models onto devices with limited memory/thermal budgets.
- Format and prepare datasets for fine-tuning (tokenization, tagging, deduplication, versioning).
- Optimize models for battery efficiency, low latency, and minimal RAM usage.
- Benchmark and debug inference performance with Perfetto, Battery Historian, Instruments, etc.
- Collaborate with app teams to integrate optimized models.
Mandatory Skills Checklist (Applicants must demonstrate experience in ALL of the following)
✅ Quantization & Distillation
- Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
- Methods like AWQ, GPTQ, SmoothQuant, RPTQ.
- Knowledge of 4-bit/8-bit schemes (INT4, INT8, FP4, NF4).
- Distillation methods (teacher–student, logit matching, feature distillation).
✅ Fine-Tuning & Data Handling
- LoRA/QLoRA/DoRA/AdaLoRA fine-tuning.
- Instruction-tuning pipelines with PyTorch + Hugging Face.
- Dataset formatting: JSONL, multi-turn dialogs, tagging, tokenization (SentencePiece/BPE).
- Deduplication, stratified sampling, and eval set creation.
✅ On-Device Deployment
- Hands-on with at least two runtimes: llama.cpp / GGUF, MLC LLM, ExecuTorch, ONNX Runtime Mobile, TensorFlow Lite, Core ML.
- Experience with hardware acceleration: Metal (iOS), NNAPI (Android), GPU/Vulkan, Qualcomm DSP/NPU, XNNPACK.
- Real-world deployment: must provide examples of models running fully offline on mobile (tokens/s, RAM usage, device specs).
✅ Performance & Benchmarking
- Tools: Perfetto, systrace, Battery Historian, adb stats (Android); Xcode Instruments, Energy Log (iOS).
- Profiling decode speed, cold start vs. warm start latency, RAM usage, and energy consumption.
✅ General
- Strong PyTorch and Hugging Face experience.
- Clear documentation and ability to explain optimization trade-offs.
Nice to Have
- Open-source contributions to LLM quantization/edge-AI frameworks.
- Prior deployment of Qwen, LLaMA, Gemma, or Mistral families onto mobile devices.
- Multilingual or low-resource dataset experience (Urdu, Arabic, Hindi, etc.), including tokenization, script handling, and fine-tuning.
- Familiarity with multimodal (ASR/TTS/VAD) integration on device.
Application Requirements
Applicants must include in their application:
- A short case study of a model they have fine-tuned (dataset + method + results).
- A short case study of a model they have quantized/distilled for mobile (framework + bit-depth + device + performance metrics).
- Links to GitHub repos, papers, or APK/TestFlight builds if available.
Job Type: Full-time
Pay: Rs250,000.00 - Rs400,000.00 per month
Work Location: In person