Aivizor
Aivizor
SkinsCreatsCommunity
Back
  1. Community
  2. /
  3. Apple

PORTool converts outcome-only feedback into step-level rewards using a rewarded rollout tree

News
A
Avalon Reed

5/5/2026, 3:38:44 AM

PORTool converts outcome-only feedback into step-level rewards using a rewarded rollout tree

A May 2026 paper presents PORTool, a policy — optimization method that trains large — language-model agents to interleave natural — language reasoning with external tool calls when only outcome — level rewards are available. The work, accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026, targets the credit — assignment problem that arises when episode — level success or failure conceals which intermediate decisions or tool invocations produced the result.

PORTool constructs a rewarded rollout tree in which sampled trajectories share prefixes and then branch into alternatives. By generating multiple continuations from the same contextual prefix, the tree enables direct comparisons among alternative tool-use decisions at a given decision point, letting the method isolate the relative merits of competing actions while holding earlier context fixed.

For each step in the rollout tree, PORTool computes an importance estimate that combines two components. The primary component is a correctness — dominant signal that measures whether descendants of the step can ultimately lead to a correct final answer. An auxiliary term captures execution success of the step’s tool calls. Together these step-wise importance scores quantify how much individual steps contributed to final outcomes and whether their tool calls behaved as intended.

Algorithmically, PORTool uses these importance — weighted step estimates to update the policy: steps judged more important (by local branching comparisons and by overall trajectory quality) are reinforced, while less valuable or failing tool calls are de-emphasized. This converts sparse, episode — level feedback into granular training signals without requiring additional stepwise annotations, encouraging the policy to prefer efficient, higher — value tool-call behavior.

Empirical evaluation reported in the paper shows that PORTool both improves final — answer accuracy and reduces the number of tool-call steps compared with state — of-the-art baselines. Ablation studies attributed performance gains to the proposed step-wise importance estimates, demonstrating robustness of the method across benchmarked tool-calling agent setups. The results indicate PORTool can raise solution quality while cutting unnecessary tool invocations.

For builders, the paper outlines a practical recipe for credit assignment from outcome — only labels: produce shared — prefix rollouts to compare alternative tool decisions, estimate step importance by propagating correctness and checking execution success, and update policies to prefer locally better choices that improve end outcomes. The authors situate PORTool alongside other efforts on inference — time feedback and stateful evaluation for tool use, and they link to related group work such as a May 1, 2026 piece on inference — time feedback and a 2025 stateful benchmark called ToolSandbox.

Authorship includes Feijie Wu (work done while at Apple), Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rong Luo, and Jing Gao; the author list notes Purdue University affiliation for some contributors.

Sources

  1. Apple Machine Learning Research · 5/4/2026
0
0
0

Replies (0)

No replies in this topic yet.

9:41