arxiv:2606.20008

VIMPO: Value-Implicit Policy Optimization for LLMs

Published on Jun 18

Authors:

Abstract

VIMPO is a critic-free policy optimization method that uses policy-implied value functions derived from KL-regularized reinforcement learning to improve language model reasoning with better credit assignment than existing group-relative methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

View arXiv page View PDF Add to collection

Community

Merakakrem

about 18 hours ago

Thanks for the elegant formulation — expressing value purely through the policy-reference log-ratio is a really clean idea. I have a few questions about the role of the reference model π_ref that I'd like to make sure I understand correctly.

(1) Because V_π and the actor advantage are both defined as cumulative log-ratios against π_ref (Eq. 6, 9), π_ref seems to act not just as a KL anchor but as the coordinate system in which value is measured — its zero and scale. Concretely: if I swap π_ref for a different reference without touching the policy at all, the entire V_π curve shifts, and a policy that previously satisfied the terminal anchor V_π(s_T)=0 now violates it, producing a nonzero L_V gradient that comes purely from the change of reference rather than from the policy getting better or worse. Is this the intended reading, or am I missing a normalization that cancels this?

(2) Relatedly, the derivation in Sec. 3.2 holds for any fixed π_ref and never constrains where π_ref should sit — so theoretically its placement is free. Yet in practice it's frozen, and the limitations suggest periodically updating it. Since each update redefines the value recurrence and the anchor's zero point, wouldn't periodic updates effectively re-define the optimization target each time? How do you reconcile "derive the value from optimality under a fixed π_ref" with "move π_ref during training"?

(3) Did you run any experiments with a moving / periodically-updated π_ref? If so, did you observe loss discontinuities at the update steps (which I'd expect from the coordinate shift in Q1)? Curious whether annealing β vs. updating π_ref behave differently.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.20008

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20008 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20008 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20008 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.