Supertone launches Supertonic v3: an on-device text-to-speech model with support for 31 languages, fewer reading failures, and expression tags


Supertone has released Supertonic 3, the third generation of its on-device ONNX-based text-to-speech system. Supertonic 3 comes with support for 31 languages, improved reading accuracy, fewer duplicates and skips, and generic ONNX assets compatible with v2. It’s extremely fast, on-device, multilingual, and accurate text-to-speech.

What changed from v2 to v3

Compared to Supertonic 2, Supertonic 3 reduces repetition and skip failures, improves speaker similarity across the common language corpus, and expands language coverage from 5 to 31 languages. Version 2 supports English, Korean, Spanish, Portuguese and French languages. Version 3 adds Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese – a total of 31 ISO language codes. There is also a special na Fallback for text whose language is unknown or outside the supported group.

The model grows modestly to accommodate the added languages. At approximately 99 million parameters across public ONNX assets, Supertonic 3 is significantly smaller than 0.7B to 2B open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device heuristics. The update also brings the total disk space of public ONNX assets 404 MB. In addition, it recently launched Supertone Audio creatorallowing developers to create custom and original TTS templates from their audio recordings.

One of the new capabilities in version 3 that was not present in version 2 is support for emoji tags. Supertonic 3 supports simple expression tags such as , and . These elements allow you to embed prosodic cues directly into the input text without a separate preprocessing step or separate expression model. For engineers designing voice interfaces or accessibility tools, this means you can specify pauses for breathing or laughter embedded in your text payload.

Architecture and runtime

The basic architecture carries over from previous versions: a speech autoencoder that encodes waveforms into continuous latent representations, a text-to-latent stream matching-based module that maps text to audio features, and a duration indicator that controls natural timing. Flow matching is a generative modeling technique that learns a vector field to transform a simple distribution into a target distribution – it samples faster than diffusion models with low step numbers, which is why Supertonic can produce usable output in just two inference steps. To further improve the output, version 3 integrates Length-aware rotary position embedding (LARoPE) To perfectly align text and speech, a Self-purifying flow matching An in-training technique to remain robust against noisy data labels.

In terms of runtime efficiency, Supertonic 3 runs fast on the CPU, even compared to larger baselines measured on the A100 GPU, and uses significantly less memory. It does not require a GPU, making local, browser and edge deployment much easier.

Reading accuracy

Across the languages ​​measured, Supertonic 3 remains within the competitive WER/CER range against larger open TTS models like VoxCPM2, while maintaining a lightweight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are standard TTS readability metrics: you compile a passage, run an ASR on the output, and compare the transcription to the original text. CER is used for languages ​​that do not have clear word boundaries; Others use WER. System efficiency is best demonstrated on high-end devices; achieves average RTF 0.3x on Onyx Box Go 6 (e-ink e-reader) in airplane mode. Moreover, the ecosystem has expanded to include Flutter (with macOS support), Net 9and He goes,while web implementation benefits onnxruntime-web For pure client-side implementation.

Text normalization

The differentiation feature carried over from version 2 is inline text flattening. Supertonic deals with complex surface forms – such as financial expressions $5.2MAnd phone numbers with area codes and extensions such as (212) 555-0142 ext. 402and date and time formats such as 4:45 PM on Wed, Apr 3, 2024And technical units such as 2.3h and 30kph – Without any pre-processing pipeline or audio annotations. The financial expression “$5.2M” should be read as “five point two million dollars,” and “$450,000” should be read as “four hundred and fifty thousand dollars.” All four competing systems failed. The technical unit “2.3h” should be read as “2.3 hours” and “30kph” as “thirty kilometers per hour”. All four competitors also failed in this category. Competitive systems evaluated include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

https://github.com/supertone-inc/supertonic

Never

Installing the Python SDK is pip install supertonic. On first run, the SDK automatically downloads the model assets from Hugging Face. Minimum example:

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Visual explanation of Marktechpost

1/7

summary

Supertonic 3: On-device text-to-speech,
Now in 31 languages

Supertonic 3 is a lightweight, open-source text-to-speech system from Supertone Inc. It runs entirely via the ONNX Runtime on your device – no cloud, no API calls, and no data leaving your device. Version 3 expands from 5 to 31 languages, adds expressive tags, reduces read failure, and remains compatible with the v2 ONNX interface.

31
Languages

~99 million
border

404 MB
ONNX Origins

Massachusetts Institute of Technology
Code license

What’s new in version 3

Four key improvements to Supertonic 2

Version 3 is a focused upgrade – same inference nodes, measurably better output.

  • 🌐
    31 languages – Expanded from version v2 in five languages ​​(en, ko, es, pt, fr). Now includes Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese and 20 other ISO codes, plus a special code na Reserve for unknown languages.

  • More stable reading – Reducing instances of repetition and skipping, especially in short and long speech. This was a known limitation in the 2nd edition which the 3rd edition addresses directly.
  • 🎭
    Expression marks – supports , and Embedded in the text, without any separate preprocessing or external model.
  • 🔊
    High speaker similarity – Improved similarity across the common language set compared to Supertonic 2. Voices are now more consistent across languages.

stabilizing

Get running in less than a minute

Install the Python SDK via pip. Upon first launch, model assets are automatically downloaded from Hugging Face – no manual setup required.

pip install supertonic

Quick start

Basic Python usage

The SDK automatically downloads model assets on first run. Select a voice, scroll the text using the language icon, and save the WAV output.

from supertonic import TTS

 Auto-downloads ONNX assets on first run
tts = TTS(auto_download=True)

 Select a preset voice (M1-M5 male, F1-F5 female)
style = tts.get_voice_style(voice_name="M1")

text = "A gentle breeze moved through the open window."

 synthesize() returns (wav_array, duration_in_seconds)
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

text = "I can't believe it  that actually worked!"
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

Languages

31 languages ​​supported+ na Reserve

All 31 languages ​​share the same ONNX model architecture and inference pipeline. use na Symbol for text whose language is unknown or outside the supported group.

On English
ko Korean
Ja Japanese
A Arab
PG Bulgarian
Customer services Czech
Da Danish
D German
Sh Greek
accord Spanish
et al Estonian
fi Finnish
father French
Hello Hindi
hour Croatian
he Hungarian
Identification card Indonesian
He – she Italian
liter Lithuanian
level Latvia
nl Dutch
pl Polish
a point Portuguese
Omani Riyal Romanian
Ru Russian
corona Slovak
sl Slovenian
Saint Swedish
R Turkish
UK Ukrainian
VI Vietnamese

Text normalization

It handles complex inputs without preprocessing

Supertonic 3 reads financial expressions, dates, phone numbers, and technical units correctly out of the box – no G2P module or voice annotations required. Below: Supertonic versus four major commercial/open source systems.

category Input example Supertonic 3 ElevenLabs/OpenAI/Gemini/Microsoft
Financial expression $5.2 million/$450,000 All four failed
Time and date 4:45 p.m., Wednesday, April 3 All four failed
Phone number (212) 555-0142 ext. 402 All four failed
Technical unit 2.3 hours at 30 km/h All four failed

Publishing and resources

Works everywhere – 11 platforms, no GPU required

Public ONNX assets run on the CPU in fixed audio mode without any dependency on the GPU. The browser is supported via WebGPU and WASM onnxruntime-web. Audio output is 16-bit WAV; Batch inference is supported.

🐍PythonONNX runtime
🟨Node.jsServer-side JS
🌐BrowserWebGPU/WASM
JavaJVM
⚙️C++High performance
🔷C.network
🔵He goesGo to runtime
🍎SWIFT/iOSlocal
🦀RustSystems
💙flutterCross platform
📄Code: Massachusetts Institute of Technologylicense
🤖Model: OpenRAIL-Mlicense

Key takeaways

  • Supertonic 3 expands language support from 5 (v2) to 31 languages, growing from 66M to ~99M parameters with a total ONNX asset size of 404MB
  • New in version 3: Emoji tags (, , ), more stable reading of short and long speech, and improved speaker similarity compared to version 2
  • ONNX public interface compatible with v2 – Existing integrations are upgraded without changing the inference code
  • Reading accuracy was measured according to VoxCPM2; Version 3 remains within the competitive WER/CER range while being much smaller
  • RTF/throughput numbers for v3 are not published; The 167x faster than real-time figure is the standard for v2 and should not be assumed to be identical to v3
  • The original output for 16-bit WAV files Ensuring high-fidelity sound for engineering applications

verify GitHub repo and Hugging the face space. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply