Qureos

FIND_THE_RIGHTJOB.

AI Speech Engineer – Custom TTS

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

AI Speech Engineer – Custom TTS (On-Device, Multilingual)

About the Role

We are looking for an experienced AI Speech Engineer with deep expertise in fine-tuning open-source TTS engines using custom voice datasets. The role requires the ability to build high-quality, multilingual, expressive TTS voices (with cues, non-verbal expressions, and prosody variations), and to optimize them for fully offline use on mobile devices.

You will be responsible for creating a world-class TTS pipeline: from dataset preparation → fine-tuning → evaluation → on-device deployment (Android/iOS).

Responsibilities

  • Fine-tune open-source TTS models (e.g., VITS, Glow-TTS, Tacotron2, FastSpeech2, Coqui TTS, Fairseq S2T, Bark-like models).
  • Build custom voices from multi-actor recordings, including prosody cues, NVEs (non-verbal expressions), laughter, sighs, whispers, emphasis, and emotional tones.
  • Format and preprocess multilingual datasets (e.g., English, Urdu, Arabic, Hindi).
  • Implement voice cloning and speaker adaptation methods (speaker embeddings, x-vectors, HuBERT/ContentVec conditioning).
  • Apply latest fine-tuning techniques (gradual unfreezing, adversarial training, vocoder adaptation, multi-speaker conditioning).
  • Deploy optimized TTS engines on-device using ONNX Runtime, TensorFlow Lite, Core ML, or custom inference runtimes.
  • Optimize for low-latency, low-memory, and battery-efficient speech generation on mobile CPUs/NPUs/GPUs.
  • Evaluate output quality (MOS, prosody accuracy, multilingual pronunciation consistency).
  • Collaborate with engineers to integrate TTS modules into apps for real-time, offline speech synthesis.

Mandatory Skills Checklist (Applicants must demonstrate experience in ALL of the following)

TTS Model Fine-Tuning

  • Hands-on fine-tuning of open-source TTS engines (Coqui TTS, VITS, Glow-TTS, Tacotron, FastSpeech).
  • Building multilingual and multi-speaker models.
  • Dataset alignment: phoneme extraction, grapheme-to-phoneme (G2P), forced alignment (Montreal Forced Aligner, MFA, or equivalent).
  • Handling prosody, cues, and NVEs in dataset labeling.

Voice Dataset Engineering

  • Preparing raw actor recordings → cleaned, labeled dataset.
  • Handling multilingual phoneme sets (IPA, G2P for Urdu, Arabic, Hindi, English).
  • Speaker embedding extraction (d-vectors, x-vectors, ECAPA, HuBERT units).
  • Noise reduction, augmentation, silence trimming, forced alignment.

On-Device Deployment

  • Exporting TTS models to ONNX/TFLite/Core ML.
  • Running inference with optimized vocoders (HiFi-GAN, WaveGlow, Parallel WaveGAN).
  • Experience with quantization/pruning of speech models for mobile.
  • Benchmarking real-time inference: latency, RAM usage, and energy efficiency.

Latest Techniques Knowledge

  • Expressive/controllable TTS (prosody embeddings, style tokens, GST, variational prosody models).
  • Speaker adaptation & cross-lingual voice transfer.
  • Handling low-resource languages (Urdu, Arabic).
  • Evaluation frameworks (MOS testing, AB preference tests, WER for intelligibility).

Nice to Have

  • Contributions to open-source TTS projects (Coqui, ESPnet, Fairseq, Bark, etc.).
  • Experience with speech-to-speech systems or multimodal pipelines.
  • Familiarity with distillation/quantization of TTS models for edge devices.
  • Worked on custom vocoder design for emotional or non-verbal cues.

Application Requirements

Applicants must include:

  • A short case study of a TTS model they fine-tuned (dataset type, model used, output samples).
  • A short case study of deploying a TTS model on-device (framework, device, latency, memory usage).
  • Links to audio samples, demos, GitHub repos, or production apps showing custom voices.

Job Type: Full-time

Pay: Rs250,000.00 - Rs400,000.00 per month

Work Location: In person

© 2025 Qureos. All rights reserved.