
A May 2026 paper presents BalCapRL, a reinforcement‑learning framework that jointly optimizes utility‑aware correctness, reference coverage, and linguistic quality to improve multimodal large language model image captioning across LLaVA and Qwen variants.
A May 2026 paper by Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo and Peter Grasch introduces BalCapRL, a reinforcement‑learning framework designed to address trade‑offs in open‑ended image captioning and raise captioning quality on multimodal large language models. The authors report that BalCapRL produces consistent gains on LLaVA‑1.5‑7B and on Qwen2.5‑VL in both 3B and 7B variants, signaling improvements for builders who fine‑tune MLLMs for captioning or downstream multimodal tasks.
BalCapRL treats captioning as a continuous multi‑objective reward problem that jointly optimizes three dimensions: utility‑aware correctness, reference coverage, and linguistic quality. Framing these targets together aims to avoid the single‑objective pitfalls that can skew models toward overfitting one notion of quality at the expense of others. To make optimization tractable for continuous‑valued caption rewards, the authors adopt a GDPO‑style reward‑decoupled normalization and add a length‑conditional reward masking mechanism to impose a more appropriate length penalty during RL fine‑tuning. Those two algorithmic steps are presented as the practical core of BalCapRL’s approach to balancing competing reward signals.
The method was evaluated across multiple base MLLMs. Relative to baseline RL or supervised methods, the paper reports peak improvements of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena, with gains observed across different model sizes and families. These quantitative results illustrate that balancing the three reward dimensions can yield sizable metric — level improvements on existing LLaVA and Qwen variants.
The authors position BalCapRL against recent RL and evaluation work that emphasize narrower notions of quality and expose common failure modes. They note that utility‑oriented objectives may push models toward hallucination or overly long outputs that help downstream QA but harm fluency, while arena‑style objectives can produce fluent yet generic descriptions. BalCapRL seeks to reconcile those trade‑offs by encoding correctness, coverage and linguistic quality in a single, balanced reward signal.
For practitioners, the paper highlights two deployment‑relevant mechanisms — reward normalization adapted for continuous captioning scores and a length‑conditional masking rule that adjusts penalties by target length — which the authors recommend when using RL or reward‑based tuning to adapt LLaVA or Qwen family models for captioning pipelines. caption applications. The full paper and related readings are available on the publication page linked in the source notes.
Sources
Replies (0)
No replies in this topic yet.