
Tsinghua University researchers have released WorldReasonBench, a new benchmark showing that modern AI video generators can render visually convincing clips yet repeatedly fail to model real‑world cause‑and‑effect. The gap matters because generators that look realistic can still produce implausible or misleading sequences when used for tasks that require physical, social or informational fidelity.
WorldReasonBench contains roughly 400 test cases organized across four reasoning areas: world knowledge (physics, weather, cultural norms), human‑centered scenes (object handling, social interaction), logical reasoning (math, geometry, experiments), and information‑based tasks (reading data and diagrams). The suite is further divided into 22 subcategories and is paired with WorldRewardBench, a preference dataset of about 6,000 human‑annotated video comparisons intended to support broad, reproducible evaluation.
Evaluation uses a two‑stage scoring pipeline. A process‑aware method asks structured questions to verify whether a generated video reaches a correct end state in a plausible way; that is followed by human ratings that judge reasoning quality, temporal consistency and visual aesthetics. The authors released the task catalog and the WorldRewardBench comparisons alongside their paper to enable reproducible assessments.
The team benchmarked five closed‑source commercial systems — Sora 2, Kling, Wan 2.6, Seedance 2.0 and Veo 3.1 — Fast — and six open‑source models — LTX 2.3, Wan 2.2 (14B), UniVideo, HunyuanVideo 1.5, Cosmos — Predict 2.5, and LongCat — Video. On the study’s core reasoning metric, commercial generators scored roughly twice what open‑source models achieved, with no statistical overlap between the groups.
Across repeated runs, ByteDance’s Seedance 2.0 finished first in nearly nine out of ten statistical re‑runs and won the overall human ratings. Veo 3.1 — Fast performed best on world knowledge tasks, while Sora 2 led on human‑centered scenes. Despite those rank differences, every tested system shared the same dominant failure modes: incorrect physics in motion, broken causal chains across multi‑step scenarios, and errors when preserving textual or numeric information.
Logical reasoning proved the weakest dimension for all models; information‑based tasks were the next most difficult, especially when transitions needed to be physically grounded or when text and numbers had to be preserved exactly. The study also finds that commercial models score higher on dynamic, process‑based correctness, whereas open‑weight models show their largest gains when given detailed, step‑by‑step prompts — a pattern that highlights prompt dependence rather than robust, inherent world modeling.
Sources
Replies (0)
No replies in this topic yet.