LLMs之RL之GDPO:《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读
导读:本文识别并定量说明了在多奖励强化学习中广泛使用的 GRPO 会导致“奖励信号压缩/信息丢失”的结构性问题,从而影响训练分辨率与稳定性;为此提出 GDPO(对每个奖励分量先做组内归一化、再汇总并做 batch-wise 归一化)来保留跨奖励差异、控制数值尺度并显著提升训练稳定性与下游性能;实验证明在工具调用、数学与编程推理任务上 GDPO 一致优于 GRPO,论文同时提供了关于 reward 设计与权重管理的实务建议与未来研究方向,适合直接应用于多偏好对齐的 RL 微调流水线。
>> 背景痛点
● 多奖励整合挑战:多奖赏(multi-reward)强化学习在对话/推理/工具调用等场景被广泛采用以同时优化准确性、格式、长度、约束等多种人类偏好,但如何把这些异质奖励高效且稳定地整合进策略优化仍存在难题。
● GRPO 的内在压缩问题:现行常用的 Group Relative Policy Optimization(GRPO)方法先将多个奖励相加再在组内归一化,这会把具有不同组合意义的 rollout 奖励压缩为相同或极为相近的 advantage,从而丢失跨奖励维度的重要区分信号,影响训练精度与稳定性。
● 训练不稳定与性能退化风险:论文观察到在多奖励场景下直接使用 GRPO 有时导致优势估计不准确、学习信号分辨率下降,出现收敛差或训练早期失败(部分设置下 correctness reward 开始下降)。
>> 具体的解决方案
● 方法总体:GDPO 提出:Group reward-Decoupled Normalization Policy Optimization(GDPO),通过对每一类奖励单独进行组内归一化(decoupled group-wise normalization),再将这些归一化后的 reward-advantages 求和,最后做 batch-wise advantage normalization,以兼顾保留跨奖励差异与数值稳定性。
● 保持奖励区分度:GDPO 的核心在于先单独标准化每个奖励,避免“先求和再归一化”导致的不同奖励组合被压缩为相同 advantage 的问题,从而为优化提供更有辨识度的训练信号。
● 批次级归一化保障稳定:在将各奖励的归一化优势求和后,GDPO 再做 batch-wise advantage normalization,以保证当奖励数量增加时数值尺度不会膨胀并改善训练稳定性(论文指出去掉该步偶发收敛失败)。
● 奖励优先级/权重处理:论文还提供如何调整奖励函数与权重以反映不同偏好优先级的系统性说明,便于在实际中对偏好权重进行可解释调整。
>> 核心思路步骤(可操作化流程)
● 步骤 1 — 构造多奖励目标:为目标任务设计若干互补但可能冲突的奖励分量(如 correctness、format、length、bug_ratio 等),并定义其度量函数。
● 步骤 2 — 逐奖励组内归一化:在每个问题/每组 rollout 内,分别对每个奖励维度计算组内(group-wise)归一化优势,而不是先把奖励求和。
● 步骤 3 —汇总并做批次归一化:将每一维的归一化优势在样本上求和得到总优势,再对该总优势进行 batch-wise 归一化,保持数值稳定且避免随奖励数量增加而放大。
● 步骤 4 — 策略更新与监控:用 GDPO 得到的 advantage 进行策略更新(类似 GRPO 的更新范式),同时监控各奖励的收敛行为与训练稳定性,如有必要调整 reward 权重或归一化设置。
>> 实验设计与评价(论文中的实验要点)
● 三类任务覆盖:在工具调用(tool calling)、数学推理(math reasoning)与代码推理(coding reasoning)三类任务上比较 GDPO 与 GRPO 的表现,指标涵盖正确率、格式遵守、长度约束、代码通过率与 bug 比率等。
● 结果概览:Across-the-board,GDPO 在收敛性、下游准确率与约束遵守方面均优于 GRPO;示例:在 AIME 数学任务上,GDPO 在某些模型上带来高达 6.3%(DeepSeek-R1-1.5B)和 2.3%(Qwen3-4B-Instruct)的准确率提升,同时更好地保持响应长度约束。论文中多次展示训练曲线与稳态性能证明了改进。
● 对比消融:论文还检验了移除 GRPO 中标准差归一化 (std normalization) 的变体,发现仅去掉 std 并不能从根本上恢复丢失的信息,GDPO 的分解式归一化在保持 distinct advantage groups 数量方面更具优势。
>> 优势(论文方法/贡献的强项)
● 信息保留更好:GDPO 通过对每个奖励分量独立归一化,能保留更多“不同奖励组合”的细粒度差别,从而提升学习信号的分辨率。
● 更高训练稳定性:加入 batch-wise advantage normalization 后,GDPO 在多奖励场景下显著降低训练崩溃或早期退化的风险,得到更平滑的收敛曲线。
● 通用性与实用性:论文在三类不同任务与多种模型上复现对比实验,表明 GDPO 在多奖励 RL 优化中具有广泛适用性,是直接替代 GRPO 的工程可行方案。
● 操作建议伴随:不仅提出算法,还给出关于权重调整与 reward 设计的系统性指南,方便工程实践落地。
>> 后续结论观点(经验、建议与工程/研究导向)
● 实务建议:在多奖励微调/强化学习流水线中,应优先考虑对每个奖励分量采取分开归一化以防信息压缩,并保留批次级归一化以维护数值稳定。
● 设计奖励时慎选权重:论文强调对奖励权重和优先级的系统性调整是必要的——不同优先级的偏好需要在归一化与加权步骤中得到体现以避免次优偏置。
● 对现有 GRPO 变体的反思:去掉标准差归一化(GRPO w/o std)虽能带来少量改进,但不足以替代对奖励解耦的做法,提示未来算法设计应更关注保留跨奖励信息的能力。
● 研究方向:可进一步研究在极多数奖励、稀疏奖励或奖励相互冲突更严重的场景下 GDPO 的扩展(例如自适应权重调度、层级优先级编码),以及更严格的理论解析奖励解耦对收敛性的影响。
目录
《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读
Abstract
Figure 1:(a): An overview of GDPO, which performs group-wise normalization per reward and then applies batch-wise advantage normalization to preserve a stable numerical range independent of reward count and improve update stability. (b): Median and IQR reward curves over five runs of Qwen2.5-Instruct-1.5B tool-calling RL, demonstrating that GDPO consistently converges to higher correctness and format reward score than GRPO.图 1:(a):GDPO 的概述,它对每个奖励执行组内归一化,然后应用批内优势归一化,以保持稳定的数值范围,不受奖励数量的影响,并提高更新的稳定性。(b):Qwen2.5-Instruct-1.5B 工具调用 RL 五次运行的中位数和四分位距奖励曲线,表明 GDPO 一直收敛到比 GRPO 更高的正确性和格式奖励得分。
1、Introduction
6 Conclusion
《GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization》翻译与解读
地址 | 论文地址:https://arxiv.org/abs/2601.05242 |
时间 | 2026年01月08日 |
作者 | NVIDIA |
Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization. | 随着语言模型能力的不断增强,用户期望它们不仅能够给出准确的回答,还能在各种场景中展现出与人类多样偏好相一致的行为。为了实现这一目标,强化学习(RL)流程已开始采用多种奖励机制,每种奖励机制都捕捉到一种不同的偏好,以引导模型朝着这些期望的行为发展。然而,近期的研究在多奖励设置下默认采用组相对策略优化(GRPO),却未对其适用性进行考察。在本文中,我们证明直接将 GRPO 应用于标准化不同的滚动奖励组合会导致它们坍缩为相同的优值,从而降低训练信号的分辨率,导致次优收敛,甚至在某些情况下出现早期训练失败。随后,我们引入了组奖励解耦标准化策略优化(GDPO),这是一种新的策略优化方法,通过解耦单个奖励的标准化过程,更忠实地保留它们之间的相对差异,从而实现更精确的多奖励优化,并显著提高训练的稳定性。我们通过工具调用、数学推理和编程推理这三项任务将 GDPO 与 GRPO 进行了比较,评估了正确性指标(准确率、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO 始终优于GRPO,这表明其在多奖励强化学习优化方面的有效性和通用性。 |
Figure 1:(a): An overview of GDPO, which performs group-wise normalization per reward and then applies batch-wise advantage normalization to preserve a stable numerical range independent of reward count and improve update stability. (b): Median and IQR reward curves over five runs of Qwen2.5-Instruct-1.5B tool-calling RL, demonstrating that GDPO consistently converges to higher correctness and format reward score than GRPO.图 1:(a):GDPO 的概述,它对每个奖励执行组内归一化,然后应用批内优势归一化,以保持稳定的数值范围,不受奖励数量的影响,并提高更新的稳定性。(b):Qwen2.5-Instruct-1.5B 工具调用 RL 五次运行的中位数和四分位距奖励曲线,表明 GDPO 一直收敛到比 GRPO 更高的正确性和格式奖励得分。
1、Introduction
As language models continue to advance in capability, expectations for their behavior have grown accordingly. Demand for models to not only provide accurate responses but also exhibit behaviors aligned with a wide range of human preferences across diverse scenarios has continued to increase. These preferences span efficiency [1, 2, 3], safety [4], response coherence and logic [5, 6], gender biases [7] and many other objectives. Meeting such heterogeneous requirements within a single model is a challenging task. Reinforcement learning (RL) has emerged as the de facto training pipeline for aligning large language models to fulfill such diverse human preferences. In particular, recent RL-based approaches have begun to incorporate multiple rewards into training, with each reward designed to capture different human preferences and collectively guide models toward human-favored behaviors. Despite this growing interest in multi-reward RL, recent work [1, 3, 5] has largely focused on the reward design itself and often directly relied on applying Group Relative Policy Optimization (GRPO) directly for multi-reward RL optimization, often without examining whether GRPO is well-suited for optimizing combinations of heterogeneous rewards. In this paper, we revisit the applicability of GRPO in multi-reward settings and show that directly applying GRPO to normalize different combinations of rollout rewards can cause them to collapse into identical advantage values, which effectively limits the precision of the training signal, as illustrated in Fig. 2. This collapse removes important distinctions across reward dimensions and leads to inaccurate policy updates, suboptimal reward convergence, and, in many cases, early training failure. To overcome these challenges, we propose Group reward-Decoupled Normalization Policy Optimization (GDPO) which decouples the group-wise normalization of each individual reward as illustrated in Fig. 1(a), to ensure that distinctions across different reward combinations are better preserved and more accurately reflect the relative differences in model responses. This leads to more precise multi-reward optimization and substantially improved training convergence. After this decoupled group-wise normalization, we apply batch-wise advantage normalization to ensure that the magnitude of advantage does not increase as the number of individual rewards increases. | 随着语言模型能力的不断提升,人们对它们行为的期望也相应提高。人们不仅希望模型能提供准确的响应,还希望它们在各种场景中展现出与广泛的人类偏好相一致的行为。这些偏好涵盖了 2, 3]、安全性[4]、响应的连贯性和逻辑 6]、性别偏见[7]以及许多其他目标。在一个单一模型中满足如此多样化的需要是一项艰巨的任务。 强化学习(RL)已成为将大型语言模型与各种人类偏好对齐的默认训练流程。特别是,最近基于强化学习的方法开始在训练中纳入多个奖励,每个奖励旨在捕捉不同的偏好,并共同引导模型朝着人类偏好的行为发展。尽管人们对多奖励强化学习的兴趣日益浓厚,但近期的研究[1, 3, 5]主要集中在奖励设计本身,并且常常直接将组相对策略优化(GRPO)应用于多奖励强化学习的优化,而很少探究 GRPO 是否适合用于优化异质奖励的组合。 在本文中,我们重新审视了 GRPO 在多奖励环境中的适用性,并表明直接将 GRPO 应用于归一化不同组合的滚动奖励会导致它们收敛到相同的优值,这实际上限制了训练信号的精度,如图 2 所示。这种收敛消除了奖励维度之间的重要差异,导致策略更新不准确、奖励收敛不理想,并且在很多情况下会导致训练提前失败。 为了克服这些挑战,我们提出了组奖励解耦归一化策略优化(GDPO),如图 1(a) 所示,它将每个个体奖励的组级归一化解耦,以确保不同奖励组合之间的差异得到更好的保留,并更准确地反映模型响应的相对差异。这带来了更精确的多奖励优化,并显著提高了训练收敛性。在进行这种解耦的组级归一化之后,我们应用批级优势归一化,以确保优势的大小不会随着个体奖励数量的增加而增大。 |
We compare GDPO and GRPO across three tasks: tool calling, math reasoning, and code reasoning. These tasks cover a wide range of objectives, including tool-calling accuracy and format correctness, mathematical reasoning accuracy and adherence to reasoning-length constraints, and code pass rate and bug ratio. Across all tasks, GDPO converges better. For example, in Fig. 1(b), training Qwen2.5-1.5B-Instruct with GDPO attains both higher correctness and format compliance than GRPO on the tool-calling task. On challenging math tasks, GDPO consistently outperforms GRPO. For instance, training DeepSeek-R1-1.5B and Qwen3-4B-Instruct with GDPO yields up to 6.3% and 2.3% higher accuracy on AIME compared to GRPO, while keeping more responses short simultaneously. Taken together, these results demonstrate the effectiveness and generalizability of GDPO, showing it to be a better alternative to GRPO for multi-reward RL optimization. Our contributions are as follows: • Analysis of GRPO reward collapse. We demonstrate that applying GRPO naively for multi-reward RL optimization can collapse distinct rollout reward combinations into identical advantage values, thereby diminishing the resolution of the learning signal. • Remediation of GRPO reward collapse. We propose GDPO, which performs group-wise decoupled normalization of each reward separately to better preserve cross-reward distinctions and enable more accurate multi-reward optimization. • In addition to GDPO, we provide a systematic overview of how to modify reward functions and adjust reward weights to more faithfully align with preferences of varying priority. • We carry out extensive experiments on three tasks: tool calling, math reasoning, and code reasoning, and compare the effectiveness of GDPO on optimizing a wide range of rewards corresponding to accuracy, format correctness, length constraints, and code quality. In all settings, GDPO consistently outperforms GRPO, showing improved training convergence and stronger downstream performance that align more closely with a diverse set of preferences. | 我们通过三个任务对 GDPO 和 GRPO 进行了比较:工具调用、数学推理和代码推理。这些任务涵盖了广泛的目标,包括工具调用的准确性和格式正确性、数学推理的准确性和推理长度约束的遵循情况,以及代码的通过率和错误率。在所有任务中,GDPO 的收敛效果更好。例如,在图 1(b) 中,使用 GDPO 训练 Qwen2.5-1.5B-Instruct 在工具调用任务上比 GRPO 达到了更高的正确性和格式合规性。在具有挑战性的数学任务上,GDPO 也始终优于 GRPO。例如,使用 GDPO 训练 DeepSeek-R1-1.5B 和 Qwen3-4B-Instruct 在 AIME 上的准确率分别比 GRPO 高出 6.3% 和 2.3%,同时保持了更多回答的简短性。 综上所述,这些结果证明了 GDPO 的有效性和通用性,表明它在多奖励强化学习优化方面是 GRPO 更好的替代方案。 我们的贡献如下: • 对 GRPO 奖励崩溃的分析。我们证明,若在多奖励强化学习优化中直接应用 GRPO,可能会将不同的滚动奖励组合压缩为相同的优值,从而降低学习信号的分辨率。 • 解决 GRPO 奖励压缩问题。我们提出了 GDPO,它对每个奖励分别进行组内解耦归一化,以更好地保留跨奖励的差异,并实现更准确的多奖励优化。 • 除了 GDPO 之外,我们还系统地概述了如何修改奖励函数和调整奖励权重,以更忠实地与不同优先级的偏好保持一致。 • 我们在三个任务上进行了大量实验:工具调用、数学推理和代码推理,并比较了 GDPO 在优化对应于准确性、格式正确性、长度限制和代码质量的广泛奖励方面的有效性。在所有设置中,GDPO 始终优于 GRPO,表现出更好的训练收敛性和更强的下游性能,更紧密地与多样化的偏好保持一致。 |
6 Conclusion
In contrast to prior work that focuses on designing new reward functions for multi-reward reinforcement learning while assuming GRPO is the default optimization method, this study revisits a fundamental but often overlooked question: whether GRPO is actually suitable for multi-reward optimization. Our analysis shows that applying GRPO directly to the summed reward can cause different reward combinations to collapse into the same advantage values. This collapse eliminates important distinctions across reward dimensions, produces inaccurate policy updates and weaker optimization performance, and can in many cases lead to early training failure. To address this limitation, we introduce Group-wise Decoupled Policy Optimization (GDPO), a simple and effective modification to GRPO tailored for multi-reward reinforcement learning. GDPO performs normalization separately for each reward to preserve cross-reward differences, and it incorporates batch-wise advantage normalization to maintain a stable numerical range as additional rewards are included. These changes result in better convergence behavior and models that more faithfully reflect the intended preference structure. We further present a systematic study for incorporating human preference priorities into the training process and explain how reward functions can be adjusted when the difficulty disparity between objectives is large. Through extensive experiments on tool calling, math reasoning, and coding reasoning, we show that GDPO consistently outperforms GRPO. Its advantages hold across different numbers of rewards, across different models, and across different reward functions. Overall, our findings establish GDPO as a more stable, accurate, and preference-aligned optimization method than GRPO for multi-reward reinforcement learning, making it a strong foundation for aligning language models with diverse human preferences in real-world settings. | 与以往专注于为多奖励强化学习设计新奖励函数,同时默认使用 GRPO 作为优化方法的研究不同,本研究重新审视了一个基本但常被忽视的问题:GRPO 是否真的适合多奖励优化。我们的分析表明,直接将 GRPO 应用于总和奖励会导致不同的奖励组合坍缩为相同的优值。这种坍缩消除了奖励维度之间的重要差异,导致策略更新不准确和优化性能较弱,并且在许多情况下会导致训练提前失败。 为了解决这一局限性,我们引入了组解耦策略优化(GDPO),这是一种针对多奖励强化学习对 GRPO 进行的简单而有效的改进。GDPO 对每个奖励分别进行归一化以保留跨奖励的差异,并采用批量优值归一化以在包含更多奖励时保持稳定的数值范围。这些改变带来了更好的收敛行为,并且模型能更忠实地反映预期的偏好结构。我们进一步系统地研究了如何将人类偏好优先级纳入训练过程,并解释了当目标难度差异较大时如何调整奖励函数。通过在工具调用、数学推理和编码推理方面的大量实验,我们表明 GDPO 一直优于 GRPO。其优势在不同奖励数量、不同模型以及不同奖励函数的情况下都存在。 总体而言,我们的研究结果表明,对于多奖励强化学习,GDPO 是一种比 GRPO 更稳定、更准确且更符合偏好的优化方法,这使其成为在现实世界环境中将语言模型与多样化的人类偏好对齐的坚实基础。 |