ByteDance Debuts Lance, a 3B-Parameter Multimodal Model for Image and Video Understanding, Generation and Editing

News

5/21/2026, 8:50:18 AM

ByteDance Debuts Lance, a 3B-Parameter Multimodal Model for Image and Video Understanding, Generation and Editing

ByteDance’s research team has released Lance, a 3B-parameter multimodal model that jointly trains understanding, generation, and editing for both images and videos rather than stitching separate subsystems together. The model is presented as open source and is initialized from a Qwen2.5 — VL 3B backbone, aiming to cover high-level semantic tasks and low-level continuous visual synthesis within one compact architecture. This unified approach reduces engineering bridges between perception and synthesis and gives developers configurable trade — offs between reasoning and generation.

Lance exposes three output families — X2T (text), X2I (images) and X2V (videos)—and supports a wide set of capabilities. For understanding it handles image/video captioning, visual question answering, OCR, visual grounding and multimodal reasoning; for generation it covers text-to-image, text-to-video, image — to-video and subject — driven generation; and it also provides image and video editing with multi — turn consistency across modalities.

At the input level Lance converts text, images and videos into a single interleaved multimodal sequence. Text tokens use Qwen2.5 — VL embeddings; understanding — oriented visual inputs are processed by a Qwen2.5 — VL ViT encoder into compact semantic visual tokens; and generation — oriented inputs are encoded by a Wan2.2 3D causal VAE into continuous latent tokens with 16× spatial and 4× temporal downsampling. The model applies a generalized 3D causal attention mechanism across the combined context, using causal attention for text tokens and bidirectional attention for visual tokens.

To separate roles without parameter conflict, Lance employs a dual-stream mixture — of-experts initialized from Qwen2.5 — VL 3B: an understanding expert (LLMUND) that operates on text and semantic visual tokens, and a generation expert (LLMGEN) that operates on VAE latent tokens. Training optimizes the two streams with different objectives — next-token prediction for LLMUND and a flow-matching objective in continuous latent space for LLMGEN-with those losses combined under configurable weights throughout training.

Lance introduces Modality — Aware Rotary Positional Encoding (MaPE) to disambiguate co-located token groups: MaPE applies a fixed temporal offset per modality group while preserving spatial coordinates so intra — frame layout remains intact. Ablations reported in the paper show removing MaPE lowers GenEval from 80.94 to 80.56 and drops GEdit — Bench from 6.86 to 6.30, indicating that this positional separation materially improves cross — task alignment and editing quality.

For builders, Lance’s single — sequence design and decoupled experts reduce the need for bespoke bridges between understanding and synthesis pipelines and offer concrete levers to tune behavior. The combination of joint training, latent VAE conditioning with explicit spatiotemporal downsampling, and the MaPE mechanism provide knobs to balance multimodal consistency, editing fidelity, and runtime or resource trade — offs when integrating image/video understanding and creative workflows.

Sources

MarkTechPost AI · 5/21/2026

Replies (0)

No replies in this topic yet.

Back