Supertone has released Supertonic 3, the third generation of its on-device ONNX-based text-to-speech system. Supertonic 3 comes with support for 31 languages, improved reading accuracy, fewer duplicates and skips, and generic ONNX assets compatible with v2. It’s extremely fast, on-device, multilingual, and accurate text-to-speech.
What changed from v2 to v3
Compared to Supertonic 2, Supertonic 3 reduces repetition and skip failures, improves speaker similarity across the common language corpus, and expands language coverage from 5 to 31 languages. Version 2 supports English, Korean, Spanish, Portuguese and French languages. Version 3 adds Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese – a total of 31 ISO language codes. There is also a special na Fallback for text whose language is unknown or outside the supported group.
The model grows modestly to accommodate the added languages. At approximately 99 million parameters across public ONNX assets, Supertonic 3 is significantly smaller than 0.7B to 2B open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device heuristics. The update also brings the total disk space of public ONNX assets 404 MB. In addition, it recently launched Supertone Audio creatorallowing developers to create custom and original TTS templates from their audio recordings.
One of the new capabilities in version 3 that was not present in version 2 is support for emoji tags. Supertonic 3 supports simple expression tags such as , and . These elements allow you to embed prosodic cues directly into the input text without a separate preprocessing step or separate expression model. For engineers designing voice interfaces or accessibility tools, this means you can specify pauses for breathing or laughter embedded in your text payload.
Architecture and runtime
The basic architecture carries over from previous versions: a speech autoencoder that encodes waveforms into continuous latent representations, a text-to-latent stream matching-based module that maps text to audio features, and a duration indicator that controls natural timing. Flow matching is a generative modeling technique that learns a vector field to transform a simple distribution into a target distribution – it samples faster than diffusion models with low step numbers, which is why Supertonic can produce usable output in just two inference steps. To further improve the output, version 3 integrates Length-aware rotary position embedding (LARoPE) To perfectly align text and speech, a Self-purifying flow matching An in-training technique to remain robust against noisy data labels.
In terms of runtime efficiency, Supertonic 3 runs fast on the CPU, even compared to larger baselines measured on the A100 GPU, and uses significantly less memory. It does not require a GPU, making local, browser and edge deployment much easier.
Reading accuracy
Across the languages measured, Supertonic 3 remains within the competitive WER/CER range against larger open TTS models like VoxCPM2, while maintaining a lightweight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are standard TTS readability metrics: you compile a passage, run an ASR on the output, and compare the transcription to the original text. CER is used for languages that do not have clear word boundaries; Others use WER. System efficiency is best demonstrated on high-end devices; achieves average RTF 0.3x on Onyx Box Go 6 (e-ink e-reader) in airplane mode. Moreover, the ecosystem has expanded to include Flutter (with macOS support), Net 9and He goes,while web implementation benefits onnxruntime-web For pure client-side implementation.
Text normalization
The differentiation feature carried over from version 2 is inline text flattening. Supertonic deals with complex surface forms – such as financial expressions $5.2MAnd phone numbers with area codes and extensions such as (212) 555-0142 ext. 402and date and time formats such as 4:45 PM on Wed, Apr 3, 2024And technical units such as 2.3h and 30kph – Without any pre-processing pipeline or audio annotations. The financial expression “$5.2M” should be read as “five point two million dollars,” and “$450,000” should be read as “four hundred and fifty thousand dollars.” All four competing systems failed. The technical unit “2.3h” should be read as “2.3 hours” and “30kph” as “thirty kilometers per hour”. All four competitors also failed in this category. Competitive systems evaluated include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.


Never
Installing the Python SDK is pip install supertonic. On first run, the SDK automatically downloads the model assets from Hugging Face. Minimum example:
from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")
Visual explanation of Marktechpost
Key takeaways
- Supertonic 3 expands language support from 5 (v2) to 31 languages, growing from 66M to ~99M parameters with a total ONNX asset size of 404MB
- New in version 3: Emoji tags (
,,), more stable reading of short and long speech, and improved speaker similarity compared to version 2 - ONNX public interface compatible with v2 – Existing integrations are upgraded without changing the inference code
- Reading accuracy was measured according to VoxCPM2; Version 3 remains within the competitive WER/CER range while being much smaller
- RTF/throughput numbers for v3 are not published; The 167x faster than real-time figure is the standard for v2 and should not be assumed to be identical to v3
- The original output for 16-bit WAV files Ensuring high-fidelity sound for engineering applications
verify GitHub repo and Hugging the face space. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us