ByteDance and HKUST show QA supervision outperforms transcription for long‑document multimodal model

News

5/24/2026, 2:17:00 PM

ByteDance and HKUST show QA supervision outperforms transcription for long‑document multimodal model

ByteDance Seed and HKUST introduce MMProLong, a Qwen2.5‑VL-based multimodal model trained with automatically generated question — answer supervision on long, image‑heavy documents;

ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) published a joint study on May 24, 2026, reporting that MMProLong — a multimodal long‑context model — learns to find and extract relevant passages more effectively when trained on question — answer examples than when trained to transcribe pages. This outcome matters because robust passage localization within document‑length contexts is the core barrier to scaling multimodal models to collections of pages or frames, and the team’s results point to lower‑cost supervision strategies that improve that ability.

MMProLong uses Alibaba’s open Qwen2.5‑VL as its base and was trained with a pipeline that combines OCR parsing, automatic question generation using ByteDance’s Seed 2.0, and re‑embedding to produce long‑context training examples. In controlled comparisons the researchers contrasted two supervision regimes: having the model transcribe full or partial pages versus inserting a generated question alongside the full document so the model must locate the answer within the long context. in some cases pure OCR transcription training actually degraded long‑document performance.

The study highlights a practical training economy: despite a relatively small training budget of 128,000 tokens, MMProLong remained stable when evaluated at much longer input lengths (256,000 and 512,000 tokens) and outperformed larger open models such as InternVL3‑38B and Gemma3‑27B on long‑document tasks. The authors place this finding against a broader trend in multimodal work, where labs tout million‑token context windows but publish little about the supervision needed to teach models how to search and extract information inside those windows.

Three empirical lessons emerge. First, mixing shorter and longer documents during training outperformed a regime focused only on the maximum context length: long‑context ability depends on flexible search across distances, not exclusively exposure to very long examples. Second, the principal bottleneck for long documents is locating the relevant passage rather than the downstream reasoning once that passage is available; tasks weighted toward extraction produced the largest gains. Third, MMProLong retained short‑task capability without explicit short examples so long as the task format remained question — answer rather than transcription.

For builders, the implications are concrete: prioritize supervision that forces retrieval inside large contexts (QA pairs paired with full‑document context), invest in passage retrieval and re‑embedding pipelines, and train on a diversified mixture of document lengths rather than only very long examples. The authors suggest this route yields a cost‑efficient path to robust long‑context performance without relying solely on larger model sizes or exhaustive page‑level transcription.

Sources

The Decoder AI · 5/24/2026

Replies (0)

No replies in this topic yet.

Back