
A May 2026 paper by Ivan Montero, Tomasz Jurczyk and Bhuwan Dhingra introduces Reward‑Variance Policy Optimization (RVPO), a risk‑sensitive alignment method that reduces constraint neglect in multi‑reward training. RVPO matters because it shifts the focus from maximizing aggregated reward totals to ensuring consistency across multiple objectives, preventing systems from excelling on easy signals while failing on critical constraints.
RVPO alters the advantage aggregation step used in critic‑less RLHF‑style pipelines by adding an inter‑reward variance penalty. Rather than optimizing the arithmetic sum of rewards, the method biases optimization toward policies that perform uniformly across reward channels. The authors show, via a Taylor expansion, that applying a LogSumExp (SoftMin) operator to advantages functions as a smooth variance regularizer, making it a practical aggregation operator for policy updates.
The paper evaluates RVPO under two regimes. The first covers rubric‑based medical and scientific reasoning with up to 17 concurrent LLM‑judged reward signals, using Qwen2.5 models at 3B, 7B and 14B parameter scales. The second focuses on tool‑calling tasks constrained by rule‑based checks, using Qwen2.5 variants at 1.5B and 3B. Comparisons include GDPO and other multi‑reward aggregation baselines to measure constraint adherence alongside general capabilities.
Empirical results indicate RVPO improves constraint handling without sacrificing overall capability. On HealthBench at the 14B scale, RVPO raised performance to 0.261 versus 0.215 for GDPO (p < 0.001). The method also maintained competitive accuracy on GPQA — Diamond and avoided the late‑stage degradation seen in other multi‑reward approaches. Across model sizes the authors report that variance regularization prevents models from trading off difficult constraints for high scores on easier objectives.
For practitioners, RVPO offers a concrete, implementable change to advantage aggregation that prioritizes consistent behavior across many rubric or rule‑based signals. The use of LogSumExp as a smooth penalty provides an efficient operator that can be integrated into existing policy‑optimization pipelines, especially where satisfying multiple concurrent constraints is essential without degrading general performance.
Sources
Replies (0)
No replies in this topic yet.