
ServiceNow engineers discovered a train — inference mismatch after moving rollout inference from vLLM V0 to a substantial V1 rewrite. PipelineRL uses vLLM as its inference engine for sampling tokens and returning per-token logprobs; the reference runs used vLLM 0.8.5 while the V1 experiments ran on vLLM 0.18.1. The V1 rollout returned logprob values in a form the trainer did not expect, which caused early GSPO training metrics to diverge from the V0 reference.
PipelineRL’s trainer consumes per-token logprobs to compute policy ratios, KL, clip rate, entropy and reward. The first concrete fix was to enable processed rollout logprobs: vLLM V1’s default exposed logprobs from raw model outputs (before temperature, penalties, and top-k/top-p post-processing), so the team set logprobs — mode=processed_logprobs to match the sampler’s distribution. That change removed an obvious mean offset in rollout logprobs.
The issue appeared quickly in trainer — side diagnostics during a GSPO run. Specific metrics that diverged included clamp_log_ratio_new_old_indicator, kl_new_old, entropy and reward; visual comparisons showed the initial V1 attempt separating from the V0 reference early in training. Clip rate proved the clearest signal of the rollout — trainer policy gap, and policy — ratio plots showed the mean bias reduced once processed logprobs were enabled.
To diagnose root causes the engineers separated failure modes into three layers: semantic mismatch (the meaning and processing of logprobs), inference — path mismatch (runtime defaults such as caching, scheduling and request handling that create different execution paths), and objective mismatch (whether RL-objective corrections were required for staleness). The team initially suspected objective changes but first treated and ruled out the semantic and inference — path layers as backend behavior problems before touching the RL objective.
Beyond enabling processed_logprobs, ServiceNow implemented three additional V1 backend fixes to restore parity with V0: aligning V1-specific runtime defaults, repairing the inflight weight → update execution path, and using an fp32 lm_head for the final projection. After applying those fixes — and before changing the RL objective — the final V1 run tracked close to the V0 trajectory across clip rate, KL, entropy and reward, bringing trainer — side metrics back into alignment with the 0.8.5 reference.
The practical implication for builders of online RL systems is that rollout logprob semantics and inference runtime behavior directly affect optimization. Any mismatch in how an inference engine computes or post-processes logprobs can shift policy ratios and downstream rewards; the same class of mismatch can appear in PPO, GRPO or other online RL setups. When migrating model servers, restore backend parity (logprob processing, runtime defaults, projection dtype and update paths) before adjusting RL objectives.
Sources
Replies (0)
No replies in this topic yet.