AI Speech Engineer – Custom TTS (On-Device, Multilingual)
About the Role
We are looking for an experienced AI Speech Engineer with deep expertise in fine-tuning open-source TTS engines using custom voice datasets. The role requires the ability to build high-quality, multilingual, expressive TTS voices (with cues, non-verbal expressions, and prosody variations), and to optimize them for fully offline use on mobile devices.
You will be responsible for creating a world-class TTS pipeline: from dataset preparation → fine-tuning → evaluation → on-device deployment (Android/iOS).
Responsibilities
- Fine-tune open-source TTS models (e.g., VITS, Glow-TTS, Tacotron2, FastSpeech2, Coqui TTS, Fairseq S2T, Bark-like models).
- Build custom voices from multi-actor recordings, including prosody cues, NVEs (non-verbal expressions), laughter, sighs, whispers, emphasis, and emotional tones.
- Format and preprocess multilingual datasets (e.g., English, Urdu, Arabic, Hindi).
- Implement voice cloning and speaker adaptation methods (speaker embeddings, x-vectors, HuBERT/ContentVec conditioning).
- Apply latest fine-tuning techniques (gradual unfreezing, adversarial training, vocoder adaptation, multi-speaker conditioning).
- Deploy optimized TTS engines on-device using ONNX Runtime, TensorFlow Lite, Core ML, or custom inference runtimes.
- Optimize for low-latency, low-memory, and battery-efficient speech generation on mobile CPUs/NPUs/GPUs.
- Evaluate output quality (MOS, prosody accuracy, multilingual pronunciation consistency).
- Collaborate with engineers to integrate TTS modules into apps for real-time, offline speech synthesis.
Mandatory Skills Checklist (Applicants must demonstrate experience in ALL of the following)
✅ TTS Model Fine-Tuning
- Hands-on fine-tuning of open-source TTS engines (Coqui TTS, VITS, Glow-TTS, Tacotron, FastSpeech).
- Building multilingual and multi-speaker models.
- Dataset alignment: phoneme extraction, grapheme-to-phoneme (G2P), forced alignment (Montreal Forced Aligner, MFA, or equivalent).
- Handling prosody, cues, and NVEs in dataset labeling.
✅ Voice Dataset Engineering
- Preparing raw actor recordings → cleaned, labeled dataset.
- Handling multilingual phoneme sets (IPA, G2P for Urdu, Arabic, Hindi, English).
- Speaker embedding extraction (d-vectors, x-vectors, ECAPA, HuBERT units).
- Noise reduction, augmentation, silence trimming, forced alignment.
✅ On-Device Deployment
- Exporting TTS models to ONNX/TFLite/Core ML.
- Running inference with optimized vocoders (HiFi-GAN, WaveGlow, Parallel WaveGAN).
- Experience with quantization/pruning of speech models for mobile.
- Benchmarking real-time inference: latency, RAM usage, and energy efficiency.
✅ Latest Techniques Knowledge
- Expressive/controllable TTS (prosody embeddings, style tokens, GST, variational prosody models).
- Speaker adaptation & cross-lingual voice transfer.
- Handling low-resource languages (Urdu, Arabic).
- Evaluation frameworks (MOS testing, AB preference tests, WER for intelligibility).
Nice to Have
- Contributions to open-source TTS projects (Coqui, ESPnet, Fairseq, Bark, etc.).
- Experience with speech-to-speech systems or multimodal pipelines.
- Familiarity with distillation/quantization of TTS models for edge devices.
- Worked on custom vocoder design for emotional or non-verbal cues.
Application Requirements
Applicants must include:
- A short case study of a TTS model they fine-tuned (dataset type, model used, output samples).
- A short case study of deploying a TTS model on-device (framework, device, latency, memory usage).
- Links to audio samples, demos, GitHub repos, or production apps showing custom voices.
Job Type: Full-time
Pay: Rs250,000.00 - Rs400,000.00 per month
Work Location: In person