One model, three ways: ByteDance launches Lance tool for understanding, creating and editing images and video


Creating a single model that can understand and generate images and videos is harder than it seems. The two tasks go in opposite directions. Understanding the benefits of high-level semantic features closely aligned with language. Generation needs continuous low-level representations that preserve texture, geometry, and temporal dynamics. Most systems deal with this tension by separating the two into distinct constructs, and then linking them together.

The ByteDance research team took a different approach with bayonet. Rather than assembling separate components, the research team designed a model that integrates comprehension, generation, and editing across image and video modalities-co-trained from the beginning.

https://arxiv.org/pdf/2605.18678

What can Lance do?

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and video (X2V). In terms of comprehension, this covers picture and video captioning, visual question answering, OCR, visual foundation, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-based generation, image editing, and video editing – including consistent multi-turn editing across both methods.

This comprehensive capability is a milestone. While standard monolithic architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few that natively connects the entire image and video ecosystem across understanding and generation tasks.

https://arxiv.org/pdf/2605.18678

How architecture works

Architecture is based on two principles: Unified context modeling and Separate power paths.

For unified context, Lance transforms all inputs – text, images and videos – into a common, shared multimedia sequence. The text symbols come from the Qwen2.5-VL embedding layer. For comprehension-oriented visual input, the Qwen2.5-VL ViT encoder produces embedded visual-semantic codes. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial upsampling and 4× temporal upsampling. All of these distinct, heterogeneous types-textual, semantic-visual, and latent-visual-live in the same sequence. The model then operationalizes 3D causal attention generalized to the full context, with textual codes using causal attention and visual codes using 2D attention.

For discrete paths, Lance uses a dual-stream expert mix architecture configured from a Qwen2.5-VL 3B. The Comprehension Expert (LLMUND) deals with text and visual-semantic symbols, and produces output for multimodal reasoning and text generation. Generation Expert (LLMGEN) deals with latent VAE codes for visual synthesis and editing. Importantly, both experts are working on the same shared interleaved sequence, they share context, but they are not competing on the same parameters. The understanding expert is trained to deal with the prediction loss of the next symbol; The generative expert is trained with the goal of matching flow in a continuous latent space. The two losses are combined with configurable weights throughout training.

Modality aware rotary position encoder (MaPE)

Running ViT tokens, clean VAE state tokens, and noisy VAE target tokens through the same sequence creates a microproblem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone – and has no way to differentiate between these sets of tokens. When multiple sets of visual symbols occupy the same sequence, their positional boundaries become ambiguous, which may harm cross-task alignment.

Lance offers Modality aware rotary position encoder (MaPE) To fix this. MaPE applies a fixed time offset to each method group based on its index in the sequence. Spatial coordinates remain unchanged, so the intrinsic layout within images and videos is preserved. Temporal offset alone is sufficient to separate groups of tokens in global positional space without disrupting the temporal ordering within any individual video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 – a consistent decline across build, edit, and understand.

Training: four stages, one unified framework

Lance is trained through Four successive stageseach building on the last.

Pre-training (PT) It lays the foundation using approximately 1 billion images and texts and 140 million video text pairs, covering 1.5T training codes. This stage determines basic multimodal alignment and generative capacity. The VAE and ViT encoders are frozen here; Only the spine and connectors are trained.

Continuing Training (CT) It expands the task space by delivering multi-task data-editing samples, topic-based creation samples, and multimodal comprehension data-across nearly 300 billion tokens. A progressive data blend schedule gradually increases the proportion of more difficult tasks such as editing as training continues.

Supervised Fine Tuning (SFT) Emphasizes instruction following, editing accuracy, and identity consistency using high-quality formatted data across 72B tokens.

Reinforcement learning (RL) Group Relative Policy Optimization (GRPO), with PaddleOCR as the bonus model, is used to increase the accuracy of text display and image text alignment.

Everything fits within the maximum training budget of 128 GPUs.

results

Generate images. At GenEval, Lance earned 0.90 points overall, which tied with TUNA for first place among standardized models. Subcategory scores include counting (0.84), colors (0.97), and spatial placement (0.87). In DPG-Bench, Lance scored 84.67 overall, with particularly strong modeling relationships – although TUNA (86.76) and TUNA-2 (86.54) top this benchmark. To put parameter efficiency into perspective: the Janus-Pro-7B earned a score of 0.80 at GenEval; Show-o2 (7B) received a score of 0.76. Lance matches the highest result of the uniform model at active parameters 3B.

Video generation. In VBench, Lance achieved an overall score of 85.11 (using LLM rewrite), the highest among standardized models. The next best standardized model, TUNA, received a score of 84.06. The Lance also outperforms generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Edit photos. In GEdit-Bench, Lance has a 7.30 average/G_O, the highest among standardized models. It performs background change, texture adjustment, motion change, image beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining vulnerability.

Understand the video. In MVBench, Lance achieved an overall score of 62.0, the highest among standardized models. Show-o2 (7B), the second best unified model, received 55.7 points. Lance also outperforms many comprehension-only models with more parameters – it is worth noting that it is simultaneously trained for creation and editing.

Visual explanation of Marktechpost

Key takeaways

  1. Lance is a standardized multimodal archetype and a 3B tonic parameter Which deals with understanding, creating and editing images and video within a single cross-trained framework.
  2. Dual-stream expert mix architecture with modality-aware rotary position encoder (MaPE). It separates comprehension and generation pathways while keeping them within a shared and overlapping multimodal context.
  3. Lance scored 0.90 on GenEval and 85.11 on VBench,Highest overall score among the uniform models,trained within a maximum budget of 128 GPUs.
  4. On MVBench, Lance has a score of 62.0which is the highest among unified models – beating Show-o2 (7B) with a score of 55.7, while also supporting creation and editing.
  5. Lance is open sourced under Apache 2.0With weights available on the hugging face.

verify Paper, typical weights and project page. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply