
Apple Machine Learning Research has announced STARFlow-V, an innovative normalizing flow-based model poised to redefine end-to-end video generative modeling, with a paper slated for publication at CVPR in April 2026. Normalizing flows (NFs), which are end-to-end likelihood — based generative models for continuous data, have recently seen encouraging progress in image generation. With STARFlow-V, researchers revisit this design space for video generation, distinguishing the model through a sophisticated global — local architecture operating within a spatiotemporal latent space.
The introduction of STARFlow-V marks a pivotal moment for normalizing flows in the video generation domain, an area where diffusion models have almost exclusively held the state — of-the-art position due to the inherent spatiotemporal complexity and computational demands. STARFlow-V offers substantial benefits, including end-to-end learning, robust causal prediction, and native likelihood estimation. This development directly challenges the prevalent reliance on diffusion models by demonstrating a viable and effective alternative for generating dynamic visual content.
Building upon the recently proposed STARFlow, an architecture known for scaling latent normalizing flows for high-resolution image synthesis, STARFlow-V introduces further innovations specific to video. Central to its advanced capabilities is a novel technique called flow-score matching, which equips the model with a lightweight causal denoiser. This denoiser significantly improves video generation consistency in an autoregressive fashion, ensuring smoother and more coherent temporal sequences. Furthermore, to enhance sampling efficiency, STARFlow-V employs a video — aware Jacobi iteration scheme. This scheme cleverly recasts inner updates as parallelizable iterations, allowing for faster processing without compromising the crucial aspect of causality within the generated video.
A key advantage stemming from STARFlow-V's invertible structure is its native support for a wide array of generation tasks. The same model can seamlessly handle text-to-video, image — to-video, and video — to-video generation, providing remarkable versatility for various creative and analytical applications. Empirically, STARFlow-V has demonstrated strong visual fidelity and temporal consistency, achieving practical sampling throughput when benchmarked against diffusion — based baselines. These results highlight its potential for real-world application, offering a compelling performance alternative in a computationally intensive field.
The empirical success of STARFlow-V provides what the researchers describe as the first substantial evidence that normalizing flows are indeed capable of high-quality autoregressive video generation. This achievement establishes normalizing flows as a promising direction for fundamental research, particularly in the ambitious pursuit of building world models capable of understanding and simulating complex dynamic environments. The work represents a significant step forward in generative AI, potentially broadening the architectural landscape for future developments in video synthesis.
This groundbreaking research was led by a team of distinguished scientists including Jiatao Gu from the University of Pennsylvania, Ying Shen from the University of Illinois Urbana — Champaign, and Apple researchers Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ángel Bautista, David Berthelot, Josh Susskind, and Shuangfei Zhai. Their collective expertise has culminated in a model that not only pushes the boundaries of video generation but also re-evaluates the efficacy of normalizing flows in complex spatiotemporal domains, building upon earlier work such as STARFlow, which was presented at NeurIPS on June 30, 2025.
Sources
Replies (0)
No replies in this topic yet.