One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

ByteDance introduces Lance, a new AI model for comprehensive image and video understanding, generation, and editing. It uniquely integrates these tasks within a single architecture, surpassing existing unified models in performance and efficiency across various benchmarks. Lance achieves this by treating all inputs as a shared multimodal sequence and employing a dual-stream mixture-of-experts approach.
ByteDance has introduced Lance, a groundbreaking AI model designed to seamlessly integrate the understanding, generation, and editing of both images and videos. Unlike previous approaches that separate these tasks into distinct architectures, Lance uses a single, unified framework trained jointly from the outset. This innovative design allows it to handle a wide array of visual tasks, from captioning and visual question answering to text-to-video generation and intricate editing.
The model's architecture is built on two core principles: unified context modeling and decoupled capability pathways. All inputs—text, images, and videos—are converted into a single, shared multimodal sequence. This allows the model to process diverse data types cohesively. Furthermore, Lance employs a dual-stream mixture-of-experts system to efficiently manage understanding and generation tasks, ensuring that both share context without competing for the same parameters.
Lance demonstrates superior performance across various benchmarks, outperforming many existing unified models. For image generation, it matches top scores on GenEval and performs strongly on DPG-Bench, achieving this with significantly fewer activated parameters than comparable models. In video generation, Lance leads all unified models on VBench, even surpassing several dedicated generation-only models.
To address the complexity of integrating diverse token types within a single sequence, Lance incorporates Modality-Aware Rotary Positional Encoding (MaPE). This mechanism ensures accurate positional encoding for different modalities, preventing ambiguity and improving cross-task alignment. The model undergoes a four-stage training process, including pre-training, continual training, supervised fine-tuning, and reinforcement learning, to refine its capabilities and ensure high accuracy in instruction following and content generation. This comprehensive training regimen, combined with a sophisticated architecture, positions Lance as a significant advancement in multimodal AI.
Related articles
The AI world is getting ‘loopy’
AI models are taking a significant leap forward with the adoption of "agentic loops," where AI agents continuously prompt each other to improve code and solve complex problems. This approach, though potentially resource-intensive, promises to unlock new levels of autonomous problem-solving and efficiency in AI applications.
Codex-maxxing for long-running work
Codex is increasingly being used by organizations to support long-running projects that go beyond a single prompt. This whitepaper by Jason Liu offers practical strategies for leveraging Codex as a persistent workspace, managing complex workflows and sustaining progress.
Nobel laureate John Jumper is leaving DeepMind for rival Anthropic
Nobel laureate John Jumper is departing Google DeepMind to join its competitor, Anthropic, after dedicating nearly nine years to DeepMind, where he led the AlphaFold team. Jumper, who shared a Nobel Prize for his work on AlphaFold, expressed gratitude for his time at DeepMind while looking forward to new endeavors.
