- PrimeTTS — tiny bilingual zh‑TW + English TTS (24 kHz, CPU)
- Pronunciation diagnostic — two-ASR cross-check (entity sentence)
- 3-pass phonemic diagnostic — teacher vs v1 (4.63M) vs v2 (6.85M)
- Quickstart (inference, CPU)
- Training data
- How it was trained — the levers
- Architecture
- Reproduce / fine‑tune your own
- Train on your OWN voice — one command
- Credits & licenses
- Pronunciation diagnostic — two-ASR cross-check (entity sentence)
PrimeTTS — tiny bilingual zh‑TW + English TTS (24 kHz, CPU)
A 4.63M‑parameter Mandarin (Taiwan) + English text‑to‑speech model that runs entirely on CPU and emits 24 kHz audio — sized for on‑device (Jetson‑class) and contact‑centre / GPS / transit use. One model, one young‑female voice: Chinese, English, and code‑mix through a single frontend (no language routing). Built for entity correctness — phone numbers, emails, addresses, prices, dates, temperatures, percentages, serial numbers, and a broad bank of Taiwan/world named entities.
🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 · 🧩 Base:
owensong/Inflect-Nano-v1(warm‑started fine‑tune, same frozen architecture)
| Parameters | 6.85M (5.43M acoustic + 1.17M vocoder) — v4 default; the 4.63M v3 remains in checkpoints/ |
| Sample rate | 24 kHz |
| Runtime | onnxruntime, CPU‑only, torch‑free at inference |
| Languages | zh‑TW (Traditional) + English + code‑mix, single voice |
| Voice | young female, Taiwan‑Mandarin accent |
| Architecture | FastSpeech‑style (no attention) + Snake‑HiFiGAN — frozen, no NAS |
| License | Apache‑2.0 |
Held‑out quality (eval_big, 36 unseen phone‑attendant sentences)
| metric | v4 — default (6.85M) | v3 (4.63M, v3_4.6M/) |
|---|---|---|
| zh‑CER overall (Breeze‑ASR‑25) | 0.108 | 0.087 |
| · pure‑zh | 0.083 | 0.087 |
| · code‑mix | 0.134 | 0.092 |
| en‑WER (Whisper) | 0.083 | 0.083 |
| SQUIM PESQ | 3.11 | 3.15 |
| SQUIM STOI | 0.968 | 0.969 |
| SQUIM MOS | 4.41 | 4.42 |
Why v4 is the default despite the higher ASR CER: v4 was chosen by ear for clearer, more
naturally‑timed speech. It matches v3 on pure Mandarin (zh‑CER 0.083 ≈ 0.087); the overall gap is
code‑mixed zh+en (0.134 vs 0.092), where v3's English‑base warm‑start still helps and v4 (trained from
scratch) hasn't caught up. Same CPU real‑time envelope as v3 — RTF ≈ 0.04 on 2 cores (≈25× real‑time)
on a desktop; 0.75 (1.3× real‑time) on 2 cores of a Jetson Nano. If code‑mix accuracy matters more than
the perceptual gain for your deployment, v3_4.6M/ is the better choice and is kept in‑repo. (Both far
exceed the original 8 kHz release: zh‑CER 0.090, code‑mix 0.178, MOS 4.24.)
¹ CER(generic ASR) − CER(Taiwan‑tuned Breeze‑ASR‑25) per zh clip; >0 ⇒ a Taiwan‑tuned recognizer
understands it better ⇒ genuine Taiwan accent present.
v3 pronunciation fix: corrected the forced-aligner's bopomofo→IPA map (ㄜ/ㄟ/ㄡ were mapped to
ɤ/ei/ou, absent from the aligner vocab, so those vowels were starved to ~2 frames and dropped) and added the syllabic-vowel symbol ㄭ for empty-rime syllables (是/十/日/司/資…, previously rendered as a bare consonant). Both classes — 額/給/走 and 是/司 — now render correctly; zh‑CER 0.106→0.087. 88 phone symbols.
Pronunciation diagnostic — two-ASR cross-check (entity sentence)
The entity-dense diagnostic sentence was synthesized by the teacher, PrimeTTS v1 (4.63M) and PrimeTTS v2 (6.85M), then transcribed by two recognizers that treat empty-rime syllables (a bare retroflex/dental sibilant + the syllabic vowel ㄭ — 十/日/之…) very differently:
- Breeze-ASR-25 — Taiwan-tuned, robust; tends to over-read short empty-rimes.
- X-ASR — a fine-tuned zh-TW/en streaming zipformer2 transducer; stricter on those syllables.
(All evidence below is ASR transcripts only — no listening judgment is implied.)
Diagnostic sentence:
Anderson 先生您好,您 2024年3月15日 訂的 3 件商品總共 NT$1,299,序號 AB1234CD,…,降雨機率 70%,謝謝。
| token | Breeze-ASR-25 (robust) | X-ASR (strict) |
|---|---|---|
| 3月15日 — teacher | 3 月 15 日 ✓ |
date span lost (…二零二四年 YING 总共…) |
| 3月15日 — v1 4.63M | 3 月 15 日 ✓ |
三月十五日 ✓ |
| 3月15日 — v2 6.85M | 3 月 15 日 ✓ |
Y 号 — lost |
| 序號 AB1234CD | teacher AB1234CD ✓ · v1 1B1234City · v2 PB1234CD |
teacher A B 1 2 3 4 C D ✓ · v1 serial dropped · v2 DB … C D |
| 降雨機率 70% | numeral 70% for all three |
百分之七 (teacher) · 百分之 (v1, v2) — 十 blurred |
| 松高路11號5樓櫃台 | ✓ all three | v2 松高路十一号五楼柜台 ✓ · v1 dropped · teacher garbled (LL ZU Y 号) |
Findings
- Empty-rime 日 (3月15日). Robust Breeze renders the date for all three (it over-reads the syllabic ㄭ).
The stricter X-ASR yields it only for v1 (4.63M); v2 (6.85M) and the teacher both lose it. So the
syllable is fragile across the board and the two ASRs disagree — v2 shows no X-ASR gain on 日, a soft
regression vs v1, but not vs the teacher. Caveat: v1's X-ASR transcript is otherwise the most degraded
of the three (it also drops the serial, the address tail, and 七十), and
三月十五日is a high-frequency date the strict ASR may be pattern-completing — so treat v1's 日 as a soft win, not proof of an acoustic edge. - Serial
AB1234CD. Only the teacher is clean on both ASRs. Both students mis-render the leading letter "A"; even with the frontend emitting the correct letter name, a 4–7M acoustic renders an isolated spelled letter weakly. (The frontend letter-name fix lands the right phoneme; the acoustic is the limit.) - 降雨機率 70% (百分之七十). X-ASR blurs the final 十 even on the teacher (
百分之七); Breeze sidesteps it by emitting the numeral70%for all three. A hard empty-rime (十 = ㄕㄭ), present on the teacher and both students — not a v2-specific regression (under X-ASR v2 drops both 七 and 十, slightly worse than the teacher here, consistent with a hard syllable). - Address tail (松高路11號5樓櫃台). v2 renders it cleanly (X-ASR ✓); v1 drops it and the teacher is itself garbled there — so v2 beats both v1 and the teacher on the long tail.
Takeaway. The two ASRs disagree precisely on the empty-rime syllables, and that disagreement is the method: a robust ASR (Breeze) over-reads them, so its CER understates empty-rime fragility; a strict ASR (X-ASR) exposes it. Net, v2 (6.85M) trades a clearer, more complete long tail (address/temperature — beating both v1 and the teacher) for no gain on the empty-rime 日/十 family and a soft regression vs v1 on 日. Catching this required cross-checking a robust against a strict ASR — a single CER number hides it.
3-pass phonemic diagnostic — teacher vs v1 (4.63M) vs v2 (6.85M)
One utterance can't cover every phoneme inside the model's ~1400-frame input window, so the diagnostic is 3 natural-prose passes, each ≤1 window, that together exercise the full inventory + the entity normalizer. Coverage was verified with the frontend; the strings are fluent (not pangram-gibberish), so a mis-render reflects the model, not out-of-distribution text.
| Pass | Coverage | Utterance |
|---|---|---|
| 1 · zh-TW | all 37 bopomofo + ㄭ = 38/38 | 小明知道今天是好日子,他喝了四次熱湯。女兒給媽媽買肥皂、青菜、綠茶和八顆雞蛋,婆婆很歡喜。二月的雨不停,風很涼,我們走回家。 |
| 2 · English | all 39 arpabet = 39/39 | On a beige autumn morning, she shyly measured both choices with joy and pleasure; he laughed, thought it through, and quickly chose the rough path. Good vision, you know, can bring change now. |
| 3 · entities | date · price (NT$) · % · °C · phone · alphanumeric serial · ordinal | 訂單 AB1234CD,三月十五日,NT$299,折扣 70%,氣溫 28 度,請撥 0918,第3名,謝謝。 |
Each was synthesized by the teacher (VoxCPM2), v1 (4.63M) and v2 (6.85M), then transcribed by Breeze-ASR-25 (robust) and X-ASR (a stricter zh-TW/en zipformer2). Raw results below — bold = error, ✓ = clean — so you can judge for yourself.
Pass 1 (zh) — ref: 小明知道今天是好日子,…,二月的雨不停,風很涼,我們走回家。
| model | Breeze (robust) | X-ASR (strict) |
|---|---|---|
| teacher | ✓ | ✓ |
| v1 4.63M | ✓ | ✓ |
| v2 6.85M | …與兒給媽媽…(女兒→與兒) | …八顆雞蛋QUIP'N'月的雨…風很亮…回家家(婆婆很歡喜二月→Latin bleed) |
Pass 2 (English) — ref: …she shyly measured both choices…he laughed…chose the rough path. Good vision, you know, can bring change now.
| model | Breeze | X-ASR |
|---|---|---|
| teacher | ✓ | drops "morning" |
| v1 4.63M | ✓ | she shyly→XI SHAILY, laughed→LOVED |
| v2 6.85M | ✓ | Good→QUOTE |
Pass 3 (entities) — ref: 訂單 AB1234CD,三月十五日,NT$299,折扣 70%,氣溫 28 度,請撥 0918,第3名。
| model | Breeze | X-ASR |
|---|---|---|
| teacher | 訂單→DingDang, else ✓ (AB1234CD ✓) | serial ✓ but tail garbled |
| v1 4.63M | serial→AB1234C (drops D), else ✓ | A→EIGHT, 三月→先月 |
| v2 6.85M | AB1234CD ✓, 70%→77%, else ✓ | A→AV(B→V), 三月→现月 |
Diagnosis
- Dense zh (P1): v1 ≥ v2. teacher and v1 are clean on both ASRs; v2 still degrades on a dense zh run (婆婆很歡喜二月 → Latin "QUIP'N'" under X-ASR, 女兒→與兒 under Breeze). This persists on natural text, so it is a real v2 weakness, not OOD noise — the larger, more code-mix-exposed model over-triggers English on dense zh.
- English (P2): on natural prose all three are good. v2 and the teacher are clean on Breeze; X-ASR shows only minor slips (v1 "she shyly"→"XI SHAILY"/laughed→loved; v2 Good→"QUOTE"). This corrects an earlier pangram result that made the students look far worse than the teacher — that gap was largely an out-of-distribution artifact of dense, unnatural English.
- Entities (P3): the serial letter "A" is the hard case. Under robust Breeze, teacher and v2 render the full serial AB1234CD correctly (v1 drops a digit); v2's only number error is 70%→77%. Under the strict X-ASR the leading A garbles for all (→EIGHT/AV) and even the teacher's tail breaks up — i.e. an isolated spelled letter is near the capacity limit for a 4–7M model. The text-norm expansions themselves are correct; the failures are acoustic.
Net: naturalizing the test removed an OOD penalty on English (P2) — students are close to the teacher there. The weaknesses that survive natural text are v2's dense-zh Latin-bleed and the serial letter "A"; on entity-dense, long-form text v2 (6.85M) is otherwise the strongest student.
Quickstart (inference, CPU)
pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS
# from inside the PrimeTTS dir (uses the bundled frontend + scripts)
import sys; sys.path.insert(0, "scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate # numpy length‑regulator
meta = json.load(open("meta.json"))
enc = ort.InferenceSession("acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession("vocoder.onnx", providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.") # text -> phone/tone/lang ids
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn,
"lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])
The whole pipeline — encoder.onnx → numpy length‑regulator → decoder.onnx → vocoder.onnx — is
torch‑free and runs as‑is on a Jetson Nano CPU. See scripts/synth_from_text.py for the full runtime.
Training data
Everything is distilled from a single teacher voice so zh / en / code‑mix share one timbre and accent.
- Reference voice — a young Taiwan‑female speaker from Mozilla Common Voice zh‑TW, released CC0 / public domain (commercial‑use and voice‑cloning clear). ~13 s assembled from that one speaker's cleanest validated clips. This fixes the accent (Taiwan Mandarin comes from the reference, not from prompting) and keeps the model commercially shippable — no proprietary/voice‑likeness encumbrance.
- Teacher — VoxCPM2 (
openbmb/VoxCPM2) voice‑clones that one reference for every line, giving a consistent young‑female voice across all three languages (48 kHz, resampled to 24 kHz for training). - Text — Taiwan office / phone‑attendant / GPS / transit register: diverse Mandarin, general + domain English, and frame‑bank code‑mix with English in varied positions, plus a large named‑entity bank: Taiwan place & road names, transit stations, top Taiwan + world companies, famous people (TW + world), movies, electronics products, and time/date/metric expressions.
- Entity normalization (
text_norm.py, applied identically to teacher text and at inference) gives consistent readings for phone numbers, extensions, email addresses, street addresses, prices, dates (zh + en), times, temperatures (°C), percentages, decimals, counts, and serial numbers — digit‑by‑digit vs cardinal vs ordinal chosen by entity + language context. - ASR quality gate — generic clips are transcribed and kept only if they match their text, using a
Taiwan‑tuned recognizer so the gate never penalizes the accent we want (proper‑noun‑heavy coverage
clips are trusted unfiltered, since ASR mangles proper nouns):
- zh & code‑mix → Breeze‑ASR‑25 Han‑level CER
- English → Whisper‑medium WER
| split | clips (post‑gate) |
|---|---|
| pure Chinese | 11,842 |
| code‑mix (zh+en) | 13,422 |
| pure English | 4,283 |
| total | 29,547 |
The corpus is assembled from 32,500 teacher clips; the generic subset passes the ASR gate (≈21% dropped), the named‑entity coverage subset is trusted unfiltered, and English rows are upsampled ×2 (~27% exposure) to protect English quality. English phones additionally carry v1's native pronunciation via the warm‑start.
How it was trained — the levers
The recipe was established on the 4.63M v3 (kept in v3_4.6M/) and grown into the 6.85M v4
default. Three data/alignment levers carry across both and matter most for a tiny model:
- Phone‑level alignment (
scripts/align_durations_v4.py) — true per‑phone durations (espeak phoneme‑CTC +torchaudio.forced_align) instead of crude char/letter CTC. Sub‑syllable boundary accuracy is what separates intelligible speech from fluent babble; skipping this makes tiny TTS garble. - Vocabulary coverage + diverse code‑mix — broad character coverage and a code‑mix frame bank (varied syntax, English in varied positions) so the model isn't overfit to a few templates.
- Teacher choice — the English a tiny model learns is only as native as the teacher's; VoxCPM2 gives clean, natural zh and en in one voice.
What changed from v3 (4.63M) → v4 (6.85M, current default):
- More capacity — hidden 168→184, decoder 6→7 layers, ff×3→×4, and contextual prosody predictors on (a Conv‑FFN refinement block + per‑phone duration/energy/bright/pitch deltas). Latency is unchanged (vocoder + host length‑regulation dominate) — still CPU real‑time, incl. 1.3× real‑time on 2 cores of a Jetson Nano.
- Multi‑resolution‑STFT clarity loss added alongside the 2D mel‑GAN (both ramp in after a 25k pure‑ reconstruction warmup) to sharpen the predicted mel.
- Trained from scratch (the architecture changed, so the v1 warm‑start no longer applies). v3 was warm‑started from Inflect‑Nano‑v1's English‑native checkpoint, which is why v3 still edges v4 on code‑mixed text; v4 matches v3 on pure Mandarin (zh‑CER 0.083) and was chosen for clearer, more naturally‑timed delivery (preferred by ear).
The shipped v4 checkpoint is the held‑out best (50k step) — past that the mel‑GAN keeps sharpening the train mel but held‑out intelligibility drifts, so sweep the held‑out set and pick the optimum rather than the last step. A v1 warm‑start of the v4 architecture is the clear next lever for closing the code‑mix gap, since pure‑zh is already at parity.
Architecture
Acoustic —
MicroFastSpeech(v4: 5.43M): depthwise Conv‑FFN, no attention, external durations + length regulator, frame‑pitch, BiGRU, postnet, plus contextual prosody predictors. Exact v4 config (read from the checkpoint; the trainer builds from these flags):{ "vocab_size": 256, "tone_size": 16, "lang_size": 4, "n_mels": 80, "hidden": 184, "encoder_layers": 6, "decoder_layers": 7, "decoder_ff_mult": 4, "kernel_size": 7, "speaker_count": 2, "speaker_dim": 64, "dropout": 0.08, "sample_rate": 24000, "max_frames": 1400, "postnet_scale": 0.1, "use_frame_pitch": true, "use_frame_pitch_refiner": true, "abs_frame_bins": 512, "use_contextual_predictors": true, "use_group_duration_planner": true }The group‑duration planner is trained but disabled at export (it uses a non‑ONNX‑able host loop and only adjusts inference‑time durations);
scripts/export_8k.pysets it toNoneautomatically, so the ONNX uses the plain per‑phone durations (with the contextual delta). The v3 4.63M config (hidden 168, enc 5 / dec 6, ff×3, predictors off) is preserved inv3_4.6M/.Vocoder — Snake‑HiFiGAN (~1.17M), 24 kHz variant
snake_v2mid(sr 24000, n_fft 1024, hop 256, 80 mels, fmax 12000), retrained on the teacher corpus. Shared by v3 and v4.Frontend —
g2pw(Taiwan bopomofo + polyphone disambiguation) +g2p_en(arpabet), merged into one phone sequence with per‑phone language ids → handles zh, en, and code‑mix in a single pass. 88‑symbol table (symbol_table.json), identical for v3 and v4.Long text — the absolute positional code saturates past
max_frames(~1400 frames ≈ 15 s), so utterances longer than that are auto‑chunked at punctuation (scripts/synth_long.py); the live Space does this transparently.
Reproduce / fine‑tune your own
Pipeline: teacher corpus → ASR gate → align → train vocoder → warm‑start + train acoustic → export. Repo layout:
acoustic_encoder.onnx acoustic_decoder.onnx vocoder.onnx meta.json symbol_table.json ← DEFAULT = v4 6.85M (24 kHz)
v3_4.6M/{acoustic_encoder,acoustic_decoder,vocoder}.onnx v3_4.6M/meta.json ← prior 4.63M default, for record/rollback
checkpoints/inflect-micro-fastspeech-v4-50000.pt ← v4 acoustic (shipped)
checkpoints/inflect-micro-fastspeech-v3-30000.pt ← v3 acoustic (4.63M)
checkpoints/hifigan-snake_v2mid-final.pt ← vocoder (shared by v3 & v4)
scripts/ frontend, aligner, corpus‑gen, train/export (export_8k.py), long‑text chunking (synth_long.py), eval
inflect_nano/ the trainer (acoustic.py + vocoder.py), forked from Inflect‑Nano‑v1 (LICENSE included)
Prerequisites: Python 3.12, a GPU for training; pip install torch torchaudio transformers onnxruntime soundfile librosa g2pw g2p_en cn2an opencc faster-whisper edge-tts.
1 · Teacher corpus (one cloned voice)
# make a Taiwan‑female reference, then VoxCPM2‑clone every line in that voice
edge-tts --voice zh-TW-HsiaoChenNeural --text "<ref sentence>" --write-media ref.mp3
python gen_voxcpm_corpus.py --texts texts.jsonl --ref ref.wav --ref-text ref.txt \
--out-dir corpus --manifest manifest.jsonl
2 · ASR quality gate (Taiwan‑tuned)
python asr_filter.py --manifest manifest.jsonl --out manifest \
--device cuda # Breeze‑ASR‑25 (zh/mix) + Whisper‑medium (en) → manifest.clean.jsonl
3 · Phone‑level alignment ⭐ the key step
python scripts/align_durations_v4.py --manifest manifest.clean.jsonl --out align.jsonl
4 · Train the 24 kHz vocoder
PYTHONPATH=. python -m inflect_nano.vocoder --train-jsonl voc_rows.jsonl \
--out-dir vocoder_24k --variant snake_v2mid --steps 40000 --segment-size 16384 --stft-weight 2.5
5 · Train the acoustic model
v4 (current default, 6.85M, from scratch) — capacity + contextual prosody + mel‑GAN + MR‑STFT clarity:
PYTHONPATH=. PITCH_CACHE_DIR=pitch_cache python -m inflect_nano.acoustic \
--durations-jsonl align.jsonl --out-dir acoustic_24k_v4 \
--hidden 184 --encoder-layers 6 --decoder-layers 7 --decoder-ff-mult 4 \
--contextual-predictors --group-duration-planner --group-duration-weight 0.05 \
--vocoder-variant snake_v2mid --sample-rate 24000 \
--vocoder-checkpoint vocoder_24k/hifigan-snake_v2mid-final.pt --vocoder-mel-weight 1.0 \
--vocoder-mrstft-weight 1.0 --mrstft-warmup-steps 25000 \
--mel-gan-weight 0.1 --gan-2d --gan-fm-auto --gan-r1-gamma 1.0 --gan-crop 128 --gan-warmup-steps 25000 \
--frame-pitch-weight 1.0 --duration-weight 0.08 --pitch-weight 0.04 \
--steps 60000 --batch-size 8 --lr 2e-4 --max-frames 1400 --en-upsample 2 \
--save-interval 5000 --preload-features --device cuda
# ~5 h on one 24 GB GPU. Ship the HELD-OUT best (step 50000 here), not the last step.
v3 (4.63M, warm‑started) — the prior default, reproduced for record:
PYTHONPATH=. python -m inflect_nano.acoustic --durations-jsonl align.jsonl \
--out-dir acoustic_24k_v3 --vocoder-variant snake_v2mid --sample-rate 24000 \
--vocoder-checkpoint vocoder_24k/hifigan-snake_v2mid-final.pt --vocoder-mel-weight 1.0 \
--init-checkpoint inflect_nano_v1_acoustic.pt \
--mel-gan-weight 0.1 --gan-2d --gan-fm-auto --gan-r1-gamma 1.0 --gan-crop 128 --gan-warmup-steps 25000 \
--steps 60000 --batch-size 8 --max-frames 1400 --en-upsample 2 # ship step 30000
6 · Export to ONNX + evaluate
python scripts/export_8k.py --acoustic-ckpt acoustic_8k/…pt --vocoder-ckpt vocoder_8k/…pt --out-dir onnx/
python scripts/synth_from_text.py --onnx-dir onnx --out-dir syn --texts eval.jsonl
python scripts/assess_big.py --synth-dir syn # offline CER/WER
Evaluate on ≥30 held‑out sentences — small eval sets are too noisy to trust. Sweep checkpoints and pick the held‑out sweet spot (the GAN keeps improving train‑set sharpness past the held‑out optimum).
Train on your OWN voice — one command
Swap the reference voice; everything else (text pools, ASR gate, alignment, recipe) is fixed. Both
vocoder and acoustic are retrained (both are voice-specific). Text pools + eval sets are bundled in
data/ and at the repo root, so it reproduces exactly.
# 0. one venv with the deps (see prereqs in scripts/rebuild_voice.sh), PYTHONPATH=repo root,
# and inflect_nano_v1_acoustic.pt from owensong/Inflect-Nano-v1 for the warm-start.
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS && cd PrimeTTS
cp data/*.jsonl data/*.txt . # text pools at root
# 1. a ~10 s clip of your voice. For a commercial-clear reference, use a CC0 source such as
# Mozilla Common Voice zh-TW (the shipped model uses a young-female Common Voice speaker). Or synth one:
edge-tts --voice zh-TW-HsiaoYuNeural --text "您好,歡迎來電。Thank you for calling." --write-media ref.mp3
ffmpeg -y -i ref.mp3 -ar 24000 -ac 1 ref.wav ; printf '%s' "您好,歡迎來電。Thank you for calling." > ref.txt
# 2. ONE command -> corpus -> gate -> align -> vocoder -> acoustic -> export
PY=/path/to/venv/bin/python ./scripts/rebuild_voice.sh ref.wav ref.txt myvoice
# -> pick best corpus_myvoice/onnx_<K>/ (~35k is the usual held-out sweet spot)
Time on dual RTX 5090: ≈ 9 h end-to-end (6.5 h to a shippable 35k checkpoint) — synth ~2 h,
gate+align ~25 min, then vocoder (3 h) ∥ acoustic (~4–7 h) in parallel, export ~15 min.
Credits & licenses
- Base model / trainer:
owensong/Inflect-Nano-v1(Apache‑2.0; seeinflect_nano/LICENSE.inflect-nano) - Teacher TTS:
openbmb/VoxCPM2· Reference voice: Mozilla Common Voice zh‑TW (CC0 / public domain) - Gate ASR:
Breeze-ASR-25(MediaTek Research, Taiwan Mandarin + code‑switch) · OpenAI Whisper‑medium - Aligner:
facebook/wav2vec2-lv-60-espeak-cv-ft+torchaudio.forced_align - Frontend:
g2pw(Taiwan readings) +g2p_en· Eval ASR: sherpa‑onnx X‑ASR (zh‑en Zipformer)
This repository: Apache‑2.0.
Model tree for Luigi/PrimeTTS
Base model
owensong/Inflect-Nano-v1