Alibaba Coin Team Introduces Qwen3.5-LiveTranslate-Flash: Real-time Multimedia Translation Across 60 Languages ​​with 2.8 Second Response Time


Interpretation is one of the most difficult problems in applied AI. You ask a model to translate speech before the speaker finishes the sentence. Every extra second of delay shatters the illusion of real-time communication. Alibaba’s Qwen team has been eliminating this with every release. Their latest models, Qwen3.5-LiveTranslate-Flashbrings it Latency is up to 2.8 seconds It expands input language coverage to 60 languages.

https://qwen.ai/blog?id=qwen3.5-livetranslate

A meaningful leap from the previous version

Qwen3-LiveTranslate-Flash handles 18 input languages ​​with a response time of about three seconds. Qwen3.5-LiveTranslate-Flash This leads to 2.8 secondsexpands input coverage to 60 languages, and adds speech output in 29 languages. This is more than a 3x expansion in language coverage on the input side. For developers building multilingual products, this reduces the need to switch models for each language in most global enterprise scenarios.

The latency improvement comes from the technology for processing what the team calls “read units.” Instead of waiting for a complete sentence to arrive before producing output, the model decides when enough meaning has accumulated in a passage to commit to translation. It streams the output continuously while the speaker is still talking. This is the same basic logic as semantic unit prediction but with a tighter implementation that shave off an additional 200 ms.

The vision is now a first-class entry

Most translation systems treat audio as the only input signal. This works well in clean studio conditions. It crashes in a crowded conference room, a noisy business hall, or anywhere with overlapping sounds and poor acoustics.

Qwen3.5-LiveTranslate-Flash takes a different approach. It analyzes visual information in parallel with on-screen audio text, physically displayed objects, lip movements, and gestures. When a word is phonetically ambiguous or the audio flow deteriorates, visual context fills the gap and increases the accuracy of the translation decision. This is not a secondary feature. In real-world publishing, audio quality is rarely guaranteed. Having a vision channel means that the model handles the messy reality of live interpretation more gracefully than audio-only systems.

Audio reproduction occurs in real time

This is the part that stands out the most in the Qwen3.5 release. Standard translation systems replace the speaker’s voice with a generic synthetic voice. Qwen3.5-LiveTranslate-Flash instead reproduces the distinctive voice features of a native speaker during the translation itself. One spoken sentence is enough for the model to make this phonetic adaptation.

To listeners on the receiving end, the translated output sounds like the same person speaking the target language rather than a robotic substitute. This is important in simultaneous interpretation of live conferences, multilingual live broadcasts or international client calls. The experience feels significantly more humane than what current systems offer.

Configure domain-specific keywords

One consistent failure mode of translation models in professional settings is proper nouns and specialized vocabulary. A model that translates a medical abstract may consistently mistranslate the name of the medication. The legal interpretation session is divided into a technical legal period.

Qwen3.5-LiveTranslate-Flash addresses this problem by configuring dynamic keywords at runtime. Developers can enter a glossary of brand names, medical terms, legal terms, or technical vocabulary, and the form handles these terms more reliably. This is not available in most general-purpose translation APIs, and fills a real gap for domain-specific enterprise deployments.

Standard performance

In FLEURS and CoVoST2 – two major standards for multilingual speech translation – Qwen3.5-LiveTranslate-Flash outperforms major commercial alternatives. FLEURS tests translation quality across a wide range of language pairs under real audio conditions. CoVoST2 covers 21 translation directions of speech, making it a practical proxy for the performance of multilingual pipelines.

Visual explanation of Marktechpost







What does he do?
Qwen3.5-LiveTranslate-Flash at a glance
Qwen3.5-LiveTranslate-Flash is an API-only, lightweight real-time translation model from Alibaba’s Qwen team. It takes audio and video frames as input and simultaneously outputs translated text and speech. The model uses a WebSocket-based protocol via Alibaba Cloud Model Studio.

cumin
2.8 seconds
For each symbol to output audio
Input languages
60
Speech + visual input
Speech output
29
Languages ​​with sound
protocol
WebSocket
Constant connection

  • Enhanced understanding of vision – Lip movements, gestures, and on-screen text all influence the translation decision along with voice

  • Real-time audio reproduction – Reproduce the audio file of the native speaker in the translated output from a single spoken sentence

  • Semantic unit prediction -Commits to ejecting syllables before the end of a full sentence, allowing a continuous flow without waiting for a full utterance

  • Dynamic keyword configuration – Insert domain-specific glossaries into runtime for technical, medical, or legal terms

Before you start
Basic requirements
You need an Alibaba Cloud account with access to Model Studio and a valid DashScope API key. Model available via qwen3-livetranslate-flash-realtime Form ID

Create an Alibaba Cloud account

Sign up for alibabacloud.com And activate Alibaba Cloud Model Studio in your account control panel.

Get your DashScope API key

Go to Model Studio → API Keys. Create a key and store it as an environment variable DASHSCOPE_API_KEY. Never encrypt them in source files.

Install a Python dependency

stabilizing websocket-client Package for WebSocket communication. To capture sound, also stabilize pyaudio.

Check your audio setting

The model accepts 16 kHz mono and 16-bit PCM audio on the input. Make sure your microphone or audio source can output in this format before connecting.

crush

 Install dependencies
pip install websocket-client pyaudio

 Set your API key as an environment variable
export DASHSCOPE_API_KEY="your_key_here"

Step 3 – Contact
Establish a WebSocket connection
The model uses the WebSocket protocol for two-way persistent communication. You can authenticate via a Bearer token in the connection header using your DashScope API key.

Python

import json, websocket, os

API_KEY = os.getenv("DASHSCOPE_API_KEY")
API_URL = (
    "wss://dashscope-intl.aliyuncs.com"
    "/api-ws/v1/realtime"
    "?model=qwen3-livetranslate-flash-realtime"
)

def on_open(ws):
    print("Connected to Qwen3.5-LiveTranslate-Flash")

def on_message(ws, message):
    data = json.loads(message)
    print("Translation event:", data)

def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=["Authorization: Bearer " + API_KEY],
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)
ws.run_forever()

The connection remains open for the full session. You can’t call back for every speech. Send audio clips and image frames continuously through the same jack.

Step 4 – Stream Audio
Configure and stream audio input
After connecting, send a session configuration event to set the source language and the target language. Then broadcast the PCM audio segments continuously. The form is used session.input_audio_transcription.language To select the input language.

Python

import base64, pyaudio

 Audio input config: 16kHz, 16-bit PCM mono
INPUT_RATE    = 16000
INPUT_CHUNK   = 1600   100ms per chunk
INPUT_FORMAT  = pyaudio.paInt16
INPUT_CHANNELS = 1

def on_open(ws):
     1. Send session config first
    session_cfg = {
        "type": "session.update",
        "session": {
            "input_audio_transcription": {
                "language": "zh"   source: Chinese
            },
            "translation": {
                "target_language": "en"   target: English
            }
        }
    }
    ws.send(json.dumps(session_cfg))

     2. Stream microphone audio
    pa = pyaudio.PyAudio()
    stream = pa.open(
        rate=INPUT_RATE, channels=INPUT_CHANNELS,
        format=INPUT_FORMAT, input=True,
        frames_per_buffer=INPUT_CHUNK
    )
    while True:
        chunk = stream.read(INPUT_CHUNK)
        audio_b64 = base64.b64encode(chunk).decode()
        ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_b64
        }))

Do not send audio before session.update The event has been recognized. Wait for the server session confirmation event before streaming audio clips.

Step 5 – Enter the vision
Send video frames to improve visibility
Qwen3.5-LiveTranslate-Flash reads lip movements, gestures and on-screen text from video frames along with audio. Send base64-encoded JPEG frames at regular intervals during the session. Even a low frame rate dramatically improves accuracy in noisy audio conditions.

Python

import cv2, base64, threading, time

def stream_video_frames(ws):
    cap = cv2.VideoCapture(0)   0 = default camera
    while True:
        ret, frame = cap.read()
        if not ret:
            break
         Encode frame as JPEG → base64
        _, buf = cv2.imencode(".jpg", frame)
        img_b64 = base64.b64encode(buf).decode()
        ws.send(json.dumps({
            "type": "input_image_buffer.append",
            "image": img_b64
        }))
        time.sleep(0.5)   ~2fps is sufficient

 Run video streaming in a separate thread
threading.Thread(
    target=stream_video_frames,
    args=(ws,), daemon=True
).start()

Vision input is optional but recommended for live human speech scenarios. For pre-recorded audio files without a camera feed, you can omit image frames entirely and rely on audio alone.

Step 6 – Field accuracy
Dynamic keyword configuration
For technical, medical, legal, or brand-specific vocabulary, you can enter a glossary of keywords at the start of the session. The model uses this list to significantly improve translation reliability for terms that standard training data may handle inconsistently.

Python

 Add to your session.update payload
session_cfg = {
    "type": "session.update",
    "session": {
        "input_audio_transcription": {
            "language": "zh"
        },
        "translation": {
            "target_language": "en"
        },
         Inject domain keywords here
        "keywords": [
            {"source": "达芬奇机器人",  "target": "da Vinci Surgical System"},
            {"source": "腹腔镜",      "target": "laparoscope"},
            {"source": "实体瘤",      "target": "solid tumor"}
        ]
    }
}
ws.send(json.dumps(session_cfg))

  • Works with trade names, drug names, legal codes, and technical model numbers
  • Session keywords are scoped and do not persist across connections
  • Keep the list focused – only terms where mistranslation could lead to real errors
reference
Supported languages
Qwen3.5-LiveTranslate-Flash understands 60 input languages ​​and can produce speech output in 29 languages. The discs highlighted below are confirmed speech output languages. All dials represent supported inputs.

Chinese
English
French
German
Spanish
Japanese
Korean
Russian
Portuguese
Italian
Arab
Hindi
Turkish
Indonesian
Thai
Vietnamese
Greek
Mandarin
Cantonese
Wu dialect
Szechuan
Tianjin dialect
Beijing dialect
+ 37 others
Featured discs have confirmed support for speech (audio) output. Regular grains are for input only or are not confirmed for audio output. Check the target language pair specified in the Alibaba Cloud Model Studio documentation before creating audio output paths.

The form supports text output for all 60 input languages. Speech output is available for only 29 languages. If your route requires voice delivery and your target language isn’t on the confirmed list, plan a backup text-to-speech (TTS) step.

Key takeaways

  • Qwen3.5-LiveTranslate-Flash provides instant multimedia translation across 60 input languages ​​and 29 speech output languages ​​with a response time of 2.8 seconds.
  • The model uses enhanced visual understanding – reading lip movements, gestures and on-screen text – to maintain accuracy in noisy or degraded audio environments.
  • Real-time audio reproduction replicates the original speaker’s audio file in the translated output using just one spoken sentence.
  • Semantic unit prediction via Reading Unit processing enables continuous stream output without waiting for complete sentences, reducing latency to 2.8 seconds.
  • Dynamic keyword configuration allows developers to introduce domain-specific glossaries at runtime, improving translation reliability for technical, medical, and legal terms.

verify Technical details. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply