--- license: apache-2.0 language: - mr library_name: onnxruntime pipeline_tag: text-to-speech base_model: shreyask/bol-tts-marathi base_model_relation: quantized datasets: - ai4bharat/Rasa - ai4bharat/indicvoices_r - SPRINGLab/IndicTTS_Marathi tags: - text-to-speech - tts - kokoro - marathi - minglish - indic - styletts2 - bol-tts - onnx - webgpu - transformers.js --- # bol-tts-marathi-onnx — ONNX export ONNX-format export of the Marathi Kokoro-82M fine-tune at [shreyask/bol-tts-marathi](https://huggingface.co/shreyask/bol-tts-marathi). Designed for WebGPU / transformers.js / onnxruntime deployments. - **Live demo:** [shreyask/bol-tts-marathi](https://huggingface.co/spaces/shreyask/bol-tts-marathi) (in-browser via WebGPU using this very ONNX file) - **Write-up:** [kshreyas.dev/post/bol-tts-marathi](https://kshreyas.dev/post/bol-tts-marathi/) - **Code + export script:** [github.com/shreyaskarnik/bol-tts-marathi](https://github.com/shreyaskarnik/bol-tts-marathi) Architecture: Kokoro-82M with `disable_complex=True` (uses `CustomSTFT` instead of `TorchSTFT`, which uses complex tensors that ONNX doesn't support). ## Files ``` onnx/model.onnx — fp32 model, 326 MB config.json — Kokoro inference config with ɭ at slot 144 (Marathi retroflex lateral) voice_speeds.json — per-voice optimal default speed voices/*.pt — 25 voicepack .pt files, [510, 1, 256] float32 each ``` ## Model I/O ``` Inputs: input_ids: int64 [1, n_phonemes] — phoneme token IDs (per config.json vocab). MUST be wrapped with BOS=0 and EOS=0: [0, *content_ids, 0] style: float32 [1, 256] — voicepack slice at position [content_n_phonemes]. (Naming follows kokoro-js + thewh1teagle/kokoro-onnx ecosystem convention.) speed: float32 [1] — pacing multiplier (1.0 = neutral; <1.0 slows, >1.0 fastens). Divides the predictor's per-phoneme duration BEFORE rounding, so it scales actual frame allocation — not just playback rate. Outputs: audio: float32 [1, n_samples] — 24 kHz waveform. Includes BOS+EOS audio at start/end — strip `bos_frames * 600` samples from the front and `eos_frames * 600` from the back if you want content-only audio (Rasa-trained voicepacks generate a soft breathy pre-roll for BOS that surfaces as "umm" if not stripped). pred_dur: int64 [1, n_phonemes] — per-phoneme durations in predictor frames. 1 frame = 600 audio samples at 24 kHz. pred_dur[0] = BOS duration; pred_dur[-1] = EOS. ``` `pred_dur` is exposed so downstream apps can build phoneme/word-level timestamps. ## Usage — onnxruntime (Python) ```python import numpy as np import onnxruntime as ort import torch import soundfile as sf import json from misaki import espeak sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"]) vocab = json.load(open("config.json"))["vocab"] voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True) g2p = espeak.EspeakG2P(language="mr") text = "नमस्कार, मी मराठी बोलतो." phonemes, _ = g2p(text) content_ids = [vocab[p] for p in phonemes if p in vocab] # Wrap with BOS=0, EOS=0 input_ids = np.array([[0, *content_ids, 0]], dtype=np.int64) # Voicepack indexed by CONTENT length (not wrapped length): [510, 1, 256] -> slot style = voice[len(content_ids)].numpy().astype(np.float32) speed = np.array([1.0], dtype=np.float32) audio, pred_dur = sess.run(None, { "input_ids": input_ids, "style": style, "speed": speed, }) # Strip BOS+EOS audio (optional but recommended; see I/O notes above) HOP = 600 bos_frames = int(pred_dur.flatten()[0]) eos_frames = int(pred_dur.flatten()[-1]) audio = audio[bos_frames * HOP : len(audio) - eos_frames * HOP] sf.write("out.wav", audio, 24000) ``` ## Usage — WebGPU / transformers.js The live demo at [shreyask/bol-tts-marathi](https://huggingface.co/spaces/shreyask/bol-tts-marathi) uses this exact ONNX file via `@huggingface/transformers`. The TS client calls `await model({ input_ids, style, speed })` and applies the BOS/EOS strip + per-utterance silence injection at punctuation boundaries client-side. Source: [Space's `src/model.ts`](https://huggingface.co/spaces/shreyask/bol-tts-marathi/blob/main/src/model.ts). For Marathi support in upstream Kokoro-JS pipelines, you'll need to monkey-patch `'m'` as a Marathi `lang_code` (espeak `'mr'`). ## Voicepacks (25) This repo ships all 25 voicepacks deployed in the live demo as `.pt` files (use them as `style` input): - **4 trained on Marathi corpora:** `mf_asha`, `mm_vivek` (Rasa), `mf_mukta`, `mm_dnyanesh` (SPRINGLab) - **19 stock-Kokoro crossovers:** `af_heart` (Svara), `af_nova` (Tara), `am_liam` (Atharv), `bf_emma`-style (Ira), `hm_omega` (Vihaan), `zf_xiaoxiao` (Pari, kid), `zf_xiaoyi` (Vir, kid), … etc. See the [demo's voicepacks.json](https://huggingface.co/spaces/shreyask/bol-tts-marathi/blob/main/voicepacks.json) for the full ID → display-name mapping. - **2 synthetic:** `syn_sama` (centroid mean of 5 voicepacks), `syn_navya` (centroid + Gaussian noise) — generated arithmetically with no reference audio. ## Export details Exported via [`scripts/export_onnx.py`](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/scripts/export_onnx.py): ```python torch.onnx.export( KModelForONNX(kmodel), # upstream wrapper, runs forward_with_tokens (dummy_input_ids, dummy_style, dummy_speed), output_path, input_names=["input_ids", "style", "speed"], output_names=["audio", "pred_dur"], dynamic_axes={ "input_ids": {1: "n_phonemes"}, "audio": {1: "n_samples"}, "pred_dur": {1: "n_phonemes"}, }, opset_version=17, dynamo=False, # legacy TorchScript tracer; pinned for torch ≤ 2.8 do_constant_folding=True, ) ``` **⚠️ torch ≤ 2.8 required for export.** torch ≥ 2.9 silently emits a static-output ONNX with the legacy tracer (`dynamo=False`) on Kokoro's InstanceNorm-under-spectral-norm + LSTM + CustomSTFT combo. The exported file loads + runs in onnxruntime but produces silence. We pin `torch==2.6` in our export venv. See [bol-tts-marathi pyproject.toml](https://github.com/shreyaskarnik/bol-tts-marathi/blob/main/pyproject.toml) for the constraint. `disable_complex=True` is mandatory — Kokoro's default `TorchSTFT` uses complex tensors that ONNX doesn't support. ## License Apache 2.0. See the [base PyTorch model](https://huggingface.co/shreyask/bol-tts-marathi) for full citation/attribution.