强化学习如何在大模型中应用？（2）RLHF-PPO-育师

强化学习如何在大模型中应用？（2）RLHF-PPO

文章目录

强化学习如何在大模型中应用？（2）RLHF-PPO
- 1. RLHF-PPO的四个模型
- - 1.1 策略模型 / Actor Model
  - - 1.1.1 Actor Model的作用
    - 1.1.2 Actor Model 的Loss计算
    - 1.1.3 NLP任务中即时奖励r t r_trt的Token分配问题
  - 1.2. 评判模型 / Critic Model
  - - 1.2.1 Critic Model的作用
    - 1.2.2 Critic Model的训练过程
    - - 1.2.2.1 策略采样轨迹（Actor 收集数据）
      - 1.2.2.2 Critic 的目标值目标是什么？
      - 1.2.2.3 怎么构造这个“目标值”？
      - 1.2.2.4 得到 Critic 的训练目标
    - 1.2.3 Critic Model的Loss
  - 1.3 参考模型 / Reference Model
  - - 1.3.1 Reference Model的作用
    - 1.3.2 Reference Model对Actor Model的策略约束
  - 1.4 奖励模型 / Reward Model
  - - 1.4.1 Reward Model的作用
    - 1.4.2 Reward Model的训练过程
    - 1.4.3 Reward Model的Loss
    - - 1.4.3.1 Sigmoid 形式
      - 1.4.3.2 Softmax形式
- 2. RLHF-PPO 系统梳理
- - 2.1 训练流程梳理
  - - 2.1.1 Step1：采样（Actor生成轨迹）
    - 2.1.2 Step2：KL惩罚计算
    - 2.1.3 Step3：Reward Model 计算整体评分：
    - 2.1.4 Step4：构造每步 Reward 序列
    - 2.1.5 Step5：Critic Model估计价值
    - 2.1.6 Step6：Advantage 估计（GAE）
    - 2.1.7 Step7：计算 PPO 目标 Ratio
    - 2.1.8 Step8：Actor 策略更新（PPO loss）
    - 2.1.9 Step9：Critic 更新（Value Loss）
    - 2.1.10 Step10：LM Loss（非必须）
  - 2.2 RLHF‑PPO 训练伪代码
  - 2.3 RLHF-PPO的优缺点
  - - 2.3.1 优点
    - 2.3.2 缺点
- 3. Preview

本文将在前文基础上以PPO为例，展开RLHF在NLP或LLM语境下的整体框架

想直接梳理整体RLHF-PPO流程的可以直接看第二部分哦

1. RLHF-PPO的四个模型

RLHF-PPO包含四个模型，在RLHF-PPO过程中，Actor/Critic Model需要训练，Reward/Reference Model不用训练

Actor Model：策略模型，要训练的目标语言模型，Actor模型直接用SFT训练出的模型进行初始化

Critic Model：评判模型，预计期望总收益，Critic模型在SFT模型的基础上加数值头后训练得到

Reward Model：奖励模型，计算即时收益，Reward模型在SFT模型的基础上加数值头后训练得到（这里是指奖励模型训练，不是指RLHF-PPO训练）

Reference Model：参考模型，它的作用是在RLHF阶段给语言模型增加一些“约束”，防止语言模型训歪。Reference 模型直接用SFT训练出的模型进行初始化

后续对各个Model分开讲解的时候，也会带到其他的Model，所以需要大家先对四个模型有个简单的了解

整体框架如下图所示，具体四个模型是做什么的，如何交互，整个框架为什么这么设计，我会在下面详细展开

1.1 策略模型 / Actor Model

1.1.1 Actor Model的作用

Actor Model的作用是：在给定 prompt（前缀）s t s_tst下产生下一个 token（策略π θ \pi_\thetaπθ），并通过 PPO 迭代更新，使整段回复的回报最大

显而易见，Actor Model是参与RLHF-PPO的训练的，直接用SFT训练出的模型进行初始化（对应图中的Step1）

训练目的是让Actor模型能产生符合人类喜好的回复，输入给Actor模型用户查询，将产生的response和query一起计算损失，用于更新Actor模型

那Actor Model的Loss该如何计算呢？

1.1.2 Actor Model 的Loss计算

（1）最初的Actor Model Loss

强化学习的核心目标是最大化累积奖励：
max ⁡ π E τ ∼ π [ ∑ t = 0 T γ t r ( s t , a t ) ] \begin{equation}\max_\pi\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^T\gamma^tr(s_t,a_t)\right]\end{equation}πmaxEτ∼π[t=0∑Tγtr(st,at)]
其中，V t V_tVt表示t时刻预期收益，π ( a t ∣ s t ) \pi (a_t | s_t)π(at∣st)表示采取a t a_tat动作的概率

由上，Actor Model的损失可以定义为：
L o s s A c t o r = − ∑ t T V t log ⁡ π ( a t ∣ s t ) \begin{equation}Loss_{Actor}=-\sum_t^TV_t\log\pi(a_t|s_t)\end{equation}LossActor=−t∑TVtlogπ(at∣st)
当Critic评估的状态价值V t V_tVt为正时，应该增加该动作的概率

更直观的来看：当某步“好”（V t V_tVt大）时，就增大该步被采样的对数概率；反之减小

（2）将V t V_tVt换成优势函数A t A_tAt

V t V_tVt是Critic模型预测出来的，采取行动a t a_tat后的实际收益为Q ( s a , a t ) = r t + γ V t + 1 Q(s_a, a_t) = r_t + \gamma V_{t+1}Q(sa,at)=rt+γVt+1

PPO这类On-Policy算法采用GAE计算的优势A t A_tAt：
A t = ( r t + γ V t + 1 − V t ) + γ λ A t + 1 \begin{equation}A_t=(r_t+\gamma V_{t+1}-V_t)+\gamma\lambda A_{t+1}\end{equation}At=(rt+γVt+1−Vt)+γλAt+1
λ \lambdaλ是GAE的超参，不了解GAE的可以看一下我的这篇文章：GAE-Paper2Code

替换后，Loss变成：
L o s s A c t o r = − A t l o g π ( a t ∣ s t ) \begin{equation}Loss_{Actor}=-A_tlog\pi(a_t|s_t)\end{equation}LossActor=−Atlogπ(at∣st)
（3）引入重要性采样

为了允许同一批数据多轮（minibatch）更新（这也是PPO原论文surrogate objective提到的关键点）

引入重要性采样后的Loss变成（消除对数计算）：
L o s s A c t o r = − A t π ( a t ∣ s t ) π o l d ( a t ∣ s t ) \begin{equation}Loss_{Actor}=-A_t\frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)}\end{equation}LossActor=−Atπold(at∣st)π(at∣st)
（4）引入PPO关键优化Clip

因为我们反复使用旧策略采到的样本，需用r t = π / π old r_t=\pi/\pi_{\text{old}}rt=π/πold做重要性修正

但比值可能很大，更新会炸，所以PPO用裁剪目标稳定训练，Loss变成：
L o s s A c t o r = − m i n ( A t π ( a t ∣ s t ) π o l d ( a t ∣ s t ) , A t c l i p ( π ( a t ∣ s t ) π o l d ( a t ∣ s t ) , 1 − ϵ , 1 + ϵ ) ) \begin{equation}Loss_{Actor}=-min(A_t\frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)},A_tclip(\frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)},1-\epsilon,1+\epsilon))\end{equation}LossActor=−min(Atπold(at∣st)π(at∣st),Atclip(πold(at∣st)π(at∣st),1−ϵ,1+ϵ))
这便是PPO Actor_Loss的最终形式

1.1.3 NLP任务中即时奖励r t r_trt的Token分配问题

由式（3）我们可知，优势A t A_tAt的计算需要即时奖励r t r_trt，那在NLP任务中奖励信号是如何分解到Token级的呢？

更直观一些来理解这个问题：模型生成的回复是一个Token一个Token逐步生成的，并且只有生成一个完整的回复时，Reward Model会反馈给Actor Model一个分数来衡量整体response的好坏（具体如何反馈会在后续Reward Model部分展开），但为了用 PPO 这类逐步策略优化算法，我们必须把这个整体分数拆分成每个 Token 的“奖励信号”，那具体奖励是如何分配到每个Token的呢？

如果不进行奖励Token级拆分，那Token级的奖励序列就会变成：
r 1 = 0 , r 2 = 0 , . . . , r T − 1 = 0 , r T = R M s c o r e \begin{equation}r_1=0,r_2=0,...,r_{T-1}=0,r_T=\mathrm{RM~score}\end{equation}r1=0,r2=0,...,rT−1=0,rT=RMscore
这也能 Work，但非常稀疏，PPO 性能会很差

接下来我们看看是如何拆分Token级的即时奖励的：

Token级的“即时奖励”由两部分组成：

（1）Token级的KL惩罚

为了维持生成内容不偏离原始语言模型（Reference Policy），每步都加一个 KL 惩罚作为“即时奖励”：
r t K L = − β ⋅ K L ( π θ ( ⋅ ∣ s t ) ∥ π r e f ( ⋅ ∣ s t ) ) \begin{equation}r_{t}^{\mathrm{KL}}=-\beta\cdot\mathrm{KL}{\left(\pi_{\theta}(\cdot\mid s_{t})\parallel\pi_{\mathrm{ref}}(\cdot\mid s_{t})\right)}\end{equation}rtKL=−β⋅KL(πθ(⋅∣st)∥πref(⋅∣st))
其中，π θ \pi_\thetaπθ是当前策略，π ref \pi_{\text{ref}}πref是冻结的参考 LM（比如 SFT 模型），β \betaβ是一个权重

（2）回合末整体从Reward Model获取的奖励

Reward Model 给出的整体奖励：
R R M ( x , y ) \begin{equation}R_{\mathrm{RM}}(x,y)\end{equation}RRM(x,y)
这个分数只在生成结束时出现

将两部分合在一起就可以得到每步的即时奖励r t r_trt：
r t = { − β ∗ K L t , t < T − β ∗ K L T + R R M ( x , y ) , t = T \begin{equation}r_t=\begin{cases}-\beta*\mathrm{KL}_t,&t<T\\-\beta*\mathrm{KL}_T+R_{\mathrm{RM}}(x,y),&t=T&\end{cases}\end{equation}rt={−β∗KLt,−β∗KLT+RRM(x,y),t<Tt=T
把KL散度展开也可以写为：
r t = { − β ∗ ( log ⁡ π ( a t ∣ s t ) π r e f ( a t ∣ s t ) ) , t ≠ T − β ∗ ( log ⁡ π ( a t ∣ s t ) π r e f ( a t ∣ s t ) ) + r t , t = T \begin{equation}r_t=\begin{cases}-\beta*(\log\frac{\pi(a_t|s_t)}{\pi_{ref}(a_t|s_t)}),&t\neq T\\-\beta*(\log\frac{\pi(a_t|s_t)}{\pi_{ref}(a_t|s_t)})+r_t,&t=T&\end{cases}\end{equation}rt={−β∗(logπref(at∣st)π(at∣st)),−β∗(logπref(at∣st)π(at∣st))+rt,t=Tt=T
我们可以用更直观的角度来理解上式：

当 t < T 时（当一个回复的Token没有生成结束时）：考虑的是“生成这个 token 跟参考策略相比差了多少”，是逐 token的即时奖励。

当 t = T 时（一个回复的Token全部生成结束时）：考虑的是策略模型是否遵从了参考模型的约束，和真正的即时收益r t r_trt

也只有最后时刻的r t r_trt计算对整个用户查询+回复的奖励值，其余时刻的即时奖励就用策略模型与参考模型之间的距离。

Reward Model 只负责评估整个回复的好坏，但我们通过“终止奖励＋逐 Token KL 惩罚”的组合，把这个整体评分有效分配到了每一 Token上，供 PPO 计算 advantage 和更新策略。

1.2. 评判模型 / Critic Model

1.2.1 Critic Model的作用

在Actor Model部分的Loss函数部分，曾提到“V t V_tVt是Critic模型预测出来的”

那Critic Model的作用也很清晰了：负责估计当前状态s t s_tst的“未来收益”V ( s t ) V(s_t)V(st)，作为基线，用来计算优势A t A_tAt，从而指导策略更新，显而易见Critic Model也要参与RLHF-PPO的训练

那问题就来了，Critic Model是怎么训练的？

1.2.2 Critic Model的训练过程

在 PPO 里，Critic 不是为了评价 Reward Model 的分数本身，而是为了给 Actor 提供一个价值基线（value baseline），从而计算优势（Advantage），让策略更新更稳健、方差更低

Critic Model的训练过程如下：

1.2.2.1 策略采样轨迹（Actor 收集数据）

先用当前的策略模型（也就是 Actor）去生成很多轨迹（Trajectory），如下所示

状态: s_0 → 生成 token a0 → 得到奖励 r0 → 进入 s_1 状态: s_1 → 生成 token a1 → 得到奖励 r1 → 进入 s_2 ... 直到 generate 完整 response（终止）

这些轨迹的组成为：

①s t s_tst（状态）

②a t a_tat（token）

③r t r_trt（即时奖励，经 Reward Model + KL 组合）

④s t + 1 s_{t+1}st+1（下一个状态）

1.2.2.2 Critic 的目标值目标是什么？

Critic Model输出：
V ϕ ( s t ) \begin{equation}V_\phi(s_t)\end{equation}Vϕ(st)
这是 Critic 对“**从s t s_tst开始未来所有奖励总和”**的估计值

但实际上 Critic 不能直接观测真实未来回报，它只能用采样的轨迹来“近似”真实值

1.2.2.3 怎么构造这个“目标值”？

当然是广义优势估计GAE了：
A t G A E = δ t + γ λ A t + 1 G A E \begin{equation}A_t^{\mathrm{GAE}}=\delta_t+\gamma\lambda A_{t+1}^{\mathrm{GAE}}\end{equation}AtGAE=δt+γλAt+1GAE
其中：
δ t = r t + γ V ϕ ( s t + 1 ) − V ϕ ( s t ) \begin{equation}\delta_t=r_t+\gamma V_\phi(s_{t+1})-V_\phi(s_t)\end{equation}δt=rt+γVϕ(st+1)−Vϕ(st)
具体的GAE内容可以看我之前的文章：GAE-Paper2Code

GAE把这种误差做了指数平滑，得到一个更稳健的优势估计

1.2.2.4 得到 Critic 的训练目标

从优势估计的定义可得：
A t G A E = ( r t + γ V ϕ ( s t + 1 ) − V ϕ ( s t ) ) + γ λ A t + 1 G A E \begin{equation}A_t^{\mathrm{GAE}}=(r_t+\gamma V_\phi(s_{t+1})-V_\phi(s_t))+\gamma\lambda A_{t+1}^{\mathrm{GAE}}\end{equation}AtGAE=(rt+γVϕ(st+1)−Vϕ(st))+γλAt+1GAE
可以将上式改写为Critic的目标价值：
V t t a r g e t = V ϕ ( s t ) + A t G A E \begin{equation}V_t^{\mathrm{target}}=V_\phi(s_t)+A_t^{\mathrm{GAE}}\end{equation}Vttarget=Vϕ(st)+AtGAE
展开一下：
V t t a r g e t = V ϕ ( s t ) + ( r t + γ V ϕ ( s t + 1 ) − V ϕ ( s t ) ) + γ λ A t + 1 G A E \begin{equation}V_t^\mathrm{target}=V_\phi(s_t)+(r_t+\gamma V_\phi(s_{t+1})-V_\phi(s_t))+\gamma\lambda A_{t+1}^\mathrm{GAE}\end{equation}Vttarget=Vϕ(st)+(rt+γVϕ(st+1)−Vϕ(st))+γλAt+1GAE
整理一下：
V t t a r g e t = ( r t + γ V ϕ ( s t + 1 ) ) + γ λ A t + 1 G A E \begin{equation}V_t^{\mathrm{target}}=(r_t+\gamma V_\phi(s_{t+1}))+\gamma\lambda A_{t+1}^{\mathrm{GAE}}\end{equation}Vttarget=(rt+γVϕ(st+1))+γλAt+1GAE
上式即为Critic理想的目标值，它汇总了①实际观察到的奖励r t r_trt；②Critic 对下一个状态价值的估计；③Advantage 平滑的信息

1.2.3 Critic Model的Loss

Critic Model 的Loss与Actor-Critic模型的评估模型损失一致

用 **MSE（均方差）**让自己预测的V ϕ ( s t ) V_\phi(s_t)Vϕ(st)尽可能接近由采样 + GAE 得出的“目标价值”V t target V_t^{\text{target}}Vttarget：
L o s s C r i t i c = ( V ϕ ( s t ) − V t a r g e t ( s t ) ) 2 \begin{equation}Loss_{Critic}=(V_\phi(s_t)-V^{\mathrm{target}}(s_t))^2\end{equation}LossCritic=(Vϕ(st)−Vtarget(st))2
展开可得：
L o s s C r i t i c = [ V ϕ ( s t ) − ( r t + γ V ϕ ( s t + 1 ) + γ λ A t + 1 G A E ) ] 2 \begin{equation}Loss_{Critic}=[V_\phi(s_t)-(r_t+\gamma V_\phi(s_{t+1})+\gamma\lambda A_{t+1}^{GAE})]^2\end{equation}LossCritic=[Vϕ(st)−(rt+γVϕ(st+1)+γλAt+1GAE)]2

1.3 参考模型 / Reference Model

1.3.1 Reference Model的作用

在RLHF-PPO框架中，Reference Model 是一个不更新、不训练的语言模型，它在 RLHF-PPO 训练过程中始终保持不变

它的存在不是为了提供奖励或做价值判断，而是约束 Actor 的行为

那具体是如何约束的呢？

1.3.2 Reference Model对Actor Model的策略约束

在2.3中我们曾详细展开NLP任务中即时奖励r t r_trt的Token分配问题，其中Token级的即时奖励包含的：
r t K L = − β ⋅ K L ( π θ ( ⋅ ∣ s t ) ∥ π r e f ( ⋅ ∣ s t ) ) \begin{equation}r_t^\mathrm{KL}=-\beta\cdot KL(\pi_\theta(\cdot|s_t)\parallel\pi_\mathrm{ref}(\cdot|s_t))\end{equation}rtKL=−β⋅KL(πθ(⋅∣st)∥πref(⋅∣st))
就是来自Reference Model的

具体的优化目标就是：
max ⁡ π E x ∼ D , y ∼ π [ r ( x , y ) − β D K L ( π ( y ∣ x ) ∣ ∣ π r e f ( y ∣ x ) ) ] \begin{equation}\max_\pi E_{x\sim\mathcal{D},y\sim\pi}[r(x,y)-\beta D_{KL}(\pi(y|x)||\pi_{ref}(y|x))]\end{equation}πmaxEx∼D,y∼π[r(x,y)−βDKL(π(y∣x)∣∣πref(y∣x))]
展开KL散度计算：
D K L ( π ∥ π r e f ) = E ( x , y ) ∼ π [ log ⁡ π ( y ∣ x ) − log ⁡ π r e f ( y ∣ x ) ] \begin{equation}D_{KL}(\pi\|\pi_{\mathrm{ref}})=E_{(x,y)\sim\pi}\left[\log\pi(y|x)-\log\pi_{\mathrm{ref}}(y|x)\right]\end{equation}DKL(π∥πref)=E(x,y)∼π[logπ(y∣x)−logπref(y∣x)]
优化目标变为：
max ⁡ π E ( x , y ) ∼ π [ r ( x , y ) − β ( log ⁡ π ( y ∣ x ) − log ⁡ π r e f ( y ∣ x ) ) ] \begin{equation}\max_\pi E_{(x,y)\sim\pi}[r(x,y)-\beta(\log\pi(y|x)-\log\pi_{\mathrm{ref}}(y|x))]\end{equation}πmaxE(x,y)∼π[r(x,y)−β(logπ(y∣x)−logπref(y∣x))]
其中log ⁡ π ( y ∣ x ) \log\pi(y|x)logπ(y∣x)是由Actor Model产生的，log ⁡ π r e f ( y ∣ x ) \log\pi_{\mathrm{ref}}(y|x)logπref(y∣x)可以理解为是Reference Model的一个固定的常数分布，不影响Actor的优化方向

实际实现中，我们也常把这两项写成：log_probs与ref_log_probs

log_probs - ref_log_probs的值越小，说明策略模型和参考模型越接近，策略模型Actor Model没有训练歪

如果没有Reference Model，Actor 可能会疯狂偏向 Reward Model 的漏洞，更新也会极不稳定，生成的分布会失去语义连贯性

1.4 奖励模型 / Reward Model

1.4.1 Reward Model的作用

在Actor Model部分我们曾提到，Reward Model 本身不是直接给每个 token 一个即时奖励，它是用来对整个回复好坏进行评分的，这个分数代表“人类对于这个回复的偏好程度”

更宏观的描述是：我们最终希望让模型在生成回复时更符合人类偏好评价

注意：奖励模型输出的奖励值是基于句子级别的，在NLP场景中，状态s 0 = s_0 =s0=用户询问 x；动作a 0 = a_0 =a0=模型回复这个完整句子 y；reward =r ( x , y ) r(x,y)r(x,y)由 Reward Model 给出，这就是单步 MDP

Reward Model在RLHF-PPO也是不训练的，参数是冻结的，即使用的是已经训练好的Reward Model，那如何训练Reward Model呢？

如果使用回归分数进行训练，人类难以对每条回复打一个绝对分值（比如 0.72 / 0.95），但我们可以给两个回复做比较：哪个更好？

所以使用**对比学习（preference learning）**的方式来训练，使其能“判断哪个回复更好”（Step 2）

1.4.2 Reward Model的训练过程

Reward Model的如何进行对比学习呢？

针对同一个用户输入x，Reward Model 训练时用pairwise 比较：

y_w：人类认为更好的回复（preferred）

y_l：人类认为不太好的回复（dispreferred）

Reward Model 的任务就是让y_w的分数比y_l的分数高

1.4.3 Reward Model的Loss

下面给出两种等价的Reward Model的Loss

1.4.3.1 Sigmoid 形式

L o s s R e w a r d = − E ( x , y w , y l ) ∈ D [ log ⁡ σ ( r ( x , y w ) − r ( x , y l ) ) ] \begin{equation}Loss_{Reward}=-\mathbb{E}_{(x,y_w,y_l)\in\mathcal{D}}\left[\log\sigma(r(x,y_w)-r(x,y_l))\right]\end{equation}LossReward=−E(x,yw,yl)∈D[logσ(r(x,yw)−r(x,yl))]

其中：

r ( x , y ) r(x, y)r(x,y)是 Reward Model 对某条回复输出的分数

σ ( ⋅ ) \sigma(\cdot)σ(⋅)是 sigmoid

当r ( x , y w ) r(x,y_w)r(x,yw)大于r ( x , y l ) r(x,y_l)r(x,yl)时，sigmoid 趋近于 1 → log 趋近于 0 → loss 小

换句话说：让 Reward Model 分数反映人类真实偏好排序

1.4.3.2 Softmax形式

L o s s R e w a r d = − E ( x , y w , y l ) ∈ D [ log ⁡ e r ( x , y w ) e r ( x , y w ) + e r ( x , y l ) ] \begin{equation}Loss_{Reward}=-\mathbb{E}_{(x,y_w,y_l)\in\mathcal{D}}\left[\log\frac{e^{r(x,y_w)}}{e^{r(x,y_w)}+e^{r(x,y_l)}}\right]\end{equation}LossReward=−E(x,yw,yl)∈D[loger(x,yw)+er(x,yl)er(x,yw)]

这就是一个二分类对比损失：让r ( x , y w ) r(x,y_w)r(x,yw)的指数比r ( x , y l ) r(x,y_l)r(x,yl)大，因为 softmax 自然会把大的分数推向更高概率

这两者的Loss是等价的，Sigmoid是优化一个“偏好差”的概率，Softmax是将偏好问题转化成 2 类的分类

从Loss形式也可以看出来，Reward Model就是一个打分模型

2. RLHF-PPO 系统梳理

2.1 训练流程梳理

从四个模型角度出发讲述RLHF-PPO框架是比较发散的，不利于大家建立整体的算法框架，下面就从训练的角度出发带大家走一遍RLHF-PPO的循环流程，为了方便大家理解每一步的具体操作，我把每个阶段都分为了四个部分：对象（只包括四种模型）、输入（不包括各种超参）、操作、输出

2.1.1 Step1：采样（Actor生成轨迹）

对象： Actor 策略模型π θ R L ( a t ∣ s t ) \pi_{\theta}^{\mathrm{RL}}(a_t|s_t)πθRL(at∣st)

输入：用户的 query（prompt）x xx

操作：Actor逐Token采样生成一条完整回复，示例如下：

s0 = prompt a0 = token1 s1 = prompt + token1 a1 = token2 … aT = tokenT sT = prompt + all tokens

输出：（1）一条Trajectory：

(s0,a0,πθ(a0|s0)), (s1,a1,πθ(a1|s1)), ... (sT,aT,πθ(aT|sT))

（2）当前策略的log probability—log ⁡ π θ ( a t ∣ s t ) \log\pi_\theta(a_t|s_t)logπθ(at∣st)

（3）完整回复y yy

之后我们需要算整体评分+逐步即时奖励

2.1.2 Step2：KL惩罚计算

对象： Reference Model

输入：Actor 生成的同一轨迹中的每一个 tokena t a_tat

操作：计算每步 KL 值K L t \mathrm{KL}_tKLt
K L t = log ⁡ π θ ( a t ∣ s t ) π r e f ( a t ∣ s t ) \begin{equation}\mathrm{KL}_t=\log\frac{\pi_\theta(a_t|s_t)}{\pi_\mathrm{ref}(a_t|s_t)}\end{equation}KLt=logπref(at∣st)πθ(at∣st)
输出：每步的 KL 惩罚r t K L r_t^\mathrm{KL}rtKL
r t K L = − β ⋅ K L t \begin{equation}r_t^\mathrm{KL}=-\beta\cdot\mathrm{KL}_t\end{equation}rtKL=−β⋅KLt

2.1.3 Step3：Reward Model 计算整体评分：

对象： Reward Model

输入：采样阶段Actor生成的完整回复y yy，以及用户输入的promptx xx

(prompt x, generated reply y)

操作：计算Actor生成的reply的reward

输出：一个标量 reward 评分r R M ( x , y ) r_{RM}(x,y)rRM(x,y)

2.1.4 Step4：构造每步 Reward 序列

对象： None

输入：（1）Step2 中计算的每步的 KL 惩罚r t KL r_t^{\text{KL}}rtKL

（2）Step3 中 Reward Model 输出的整体分数r RM ( x , y ) r_{\text{RM}}(x,y)rRM(x,y)

操作：组成每步的即时奖励序列
r t = { r t K L , t < T r T K L + r R M ( x , y ) , t = T \begin{equation}r_t=\{\begin{array}{cc}r_t^\mathrm{KL},&t<T\\r_T^\mathrm{KL}+r_\mathrm{RM}(x,y),&t=T\end{array}\end{equation}rt={rtKL,rTKL+rRM(x,y),t<Tt=T
输出：完整的Reward序列r t r_trt

r0, r1, …, rT

2.1.5 Step5：Critic Model估计价值

对象： Critic Model

输入：step1中轨迹序列中的状态序列s t s_tst

s0, s1, …, sT

操作：预测每个状态的预期总回报

输出：每步状态的价值预测V ϕ ( s t ) V_\phi(s_t)Vϕ(st)——用于计算优势

V(s0), V(s1), ..., V(sT)

2.1.6 Step6：Advantage 估计（GAE）

对象： None

输入：（1）Step4生成的完整Reward序列r t r_trt

（2）Step5生成的每步状态的价值预测V ϕ ( s t ) V_\phi(s_t)Vϕ(st)

操作：（1）先计算TD error
δ t = r t + γ V ϕ ( s t + 1 ) − V ϕ ( s t ) \begin{equation}\delta_t=r_t+\gamma V_\phi(s_{t+1})-V_\phi(s_t)\end{equation}δt=rt+γVϕ(st+1)−Vϕ(st)
（2）再用 GAE 平滑累计
A t G A E = δ t + γ λ A t + 1 G A E \begin{equation}A_t^{GAE}=\delta_t+\gamma\lambda A_{t+1}^{GAE}\end{equation}AtGAE=δt+γλAt+1GAE
输出：优势估计序列A t G A E A_t^{GAE}AtGAE

A0, A1, ..., AT

2.1.7 Step7：计算 PPO 目标 Ratio

对象： None

输入：（1）当前策略概率π θ ( a t ∣ s t ) π_\theta(a_t|s_t)πθ(at∣st)（用来在Step1采样的策）

（2）旧策略概率π old ( a t ∣ s t ) π_{\text{old}}(a_t|s_t)πold(at∣st)

操作：计算比值
r t ( θ ) = π θ ( a t ∣ s t ) π o l d ( a t ∣ s t ) \begin{equation}r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\mathrm{old}}(a_t|s_t)}\end{equation}rt(θ)=πold(at∣st)πθ(at∣st)
输出：Ratio序列r t ( θ ) r_t(\theta)rt(θ)——用于PPO Clip Loss

r0(θ), r1(θ), ..., rT(θ)

2.1.8 Step8：Actor 策略更新（PPO loss）

对象： Actor Model

输入：（1）Step6生成的优势估计序列A t G A E A_t^{GAE}AtGAE

（2）Step7生成的Ratio序列r t ( θ ) r_t(\theta)rt(θ)

操作：最小化 PPO Clip Loss
L p o l i c y = − E t [ min ⁡ ( r t ( θ ) A t , c l i p ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A t ) ] \begin{equation}L_{\mathrm{policy}}=-\mathbb{E}_t\left[\min(r_t(\theta)A_t,\mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t)\right]\end{equation}Lpolicy=−Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
输出：Actor 模型参数更新后的梯度方向 → 生成新的 Actor 模型（新策略）

2.1.9 Step9：Critic 更新（Value Loss）

对象： Critic Model

输入：（1）Step5生成的每步状态的价值预测V ϕ ( s t ) V_\phi(s_t)Vϕ(st)

（2）Step6生成的优势估计序列A t G A E A_t^{GAE}AtGAE

操作：（1）构造V t t a r g e t V_t^{\mathrm{target}}Vttarget
V t t a r g e t = V ϕ ( s t ) + A t G A E \begin{equation}V_t^{\mathrm{target}}=V_\phi(s_t)+A_t^{GAE}\end{equation}Vttarget=Vϕ(st)+AtGAE
（2）最小化Loss
L v a l u e = E t [ ( V ϕ ( s t ) − V t t a r g e t ) 2 ] \begin{equation}L_{\mathrm{value}}=\mathsf{E}_t\left[(V_\phi(s_t)-V_t^{\mathrm{target}})^2\right]\end{equation}Lvalue=Et[(Vϕ(st)−Vttarget)2]
输出：Critic 模型参数更新后的梯度方向 → 生成新的 Critic 模型

2.1.10 Step10：LM Loss（非必须）

对象： Actor Model

输入：通用预训练数据x ′ x'x′

操作：防止策略忘掉预训练行为，加 LM loss
L L M = − E x ′ [ log ⁡ π θ ( x ′ ) ] \begin{equation}L_{\mathrm{LM}}=-\mathsf{E}_{x^{\prime}}[\log\pi_\theta(x^{\prime})]\end{equation}LLM=−Ex′[logπθ(x′)]
输出：增加到 Actor 的总损失里（可选）

2.2 RLHF‑PPO 训练伪代码

# ---------------------------------------------# 1) 初始化模型# ---------------------------------------------# 初始化 Actor 和 Referenceactor=SFT_pretrained_model()# 需要训练reference=deepcopy(actor)# 冻结，不训练# 初始化 Reward Model （用偏好数据对比学习）reward_model=RewardModel()train_reward_model(reward_model,preference_data)# 初始化 Criticcritic=ValueModel()critic.initialize_like(actor)# 价值网络初始化可从 Reward/Actor中共享部分权重# PPO 超参数beta=0.1# KL 惩罚系数gamma=1.0# 折扣lam=0.95# GAE 参数epsilon=0.2# PPO cliplm_coef=0.02# （可选）LM 保留损失权重buffer=ExperienceBuffer()# ---------------------------------------------# 2) 训练循环# ---------------------------------------------forepochinrange(num_epochs):# -----------------------------------------# Step A) 采样轨迹# -----------------------------------------buffer.clear()for_inrange(rollouts_per_epoch):prompt=sample_prompt(train_data)# Actor 生成回复序列 y = [a0,a1,...,aT]trajectory=actor.generate(prompt)# trajectory 包含: states [s0,...,sT], actions [a0,...,aT], log_probs# -------------------------------------# Step B) 计算 reward 序列# -------------------------------------# 1) Reward Model 整体分数r_rm=reward_model.score(prompt,trajectory.reply)# RewardModel(r(x,y)) -> 一个标量# 2) 每步 KL 惩罚kl_rewards=[]fort,(s_t,a_t)inenumerate(trajectory.steps):logp_actor=actor.log_prob(a_t,s_t)logp_ref=reference.log_prob(a_t,s_t)kl=logp_actor-logp_ref r_kl=-beta*kl kl_rewards.append(r_kl)# 组合 instant rewardrewards=[]fortinrange(len(trajectory.steps)):ift<len(trajectory.steps)-1:rewards.append(kl_rewards[t])else:# 最后一步 add reward_model scorerewards.append(kl_rewards[t]+r_rm)# -------------------------------------# Step C) Critic 估计状态价值# -------------------------------------values=[critic.predict(s)forsintrajectory.states]# 存入经验池buffer.store(trajectory.states,trajectory.actions,trajectory.log_probs,rewards,values)# -----------------------------------------# Step D) 计算 Advantage 和 Value Target# -----------------------------------------# 使用 GAE 估计 Advantagefortrajinbuffer:advantages,value_targets=compute_GAE(traj.rewards,traj.values,gamma,lam)traj.advantages=advantages traj.value_targets=value_targets# -----------------------------------------# Step E) PPO 更新 Actor & Critic# -----------------------------------------for_inrange(update_epochs):forbatchinbuffer.minibatches():# 1) Policy Loss (PPO 带 clip)new_log_probs=actor.log_probs(batch.states,batch.actions)ratio=exp(new_log_probs-batch.old_log_probs)surrogate1=ratio*batch.advantages surrogate2=clip(ratio,1-epsilon,1+epsilon)*batch.advantages ppo_loss=-mean(min(surrogate1,surrogate2))# (可选）LM Loss 保持通用语言能力lm_loss=0ifuse_lm_loss:lm_loss=compute_lm_loss(actor,lm_train_data)# 2) Value Lossnew_values=critic.predict(batch.states)value_loss=mse(new_values,batch.value_targets)# 总损失total_loss=ppo_loss+value_loss+lm_coef*lm_loss# 3) 反向传播total_loss.backward()actor_optimizer.step()critic_optimizer.step()# -----------------------------------------# Step F) 更新 old policy# -----------------------------------------# PPO 下一轮把当前 Actor 的参数作为 old policyactor_old=deepcopy(actor)

2.3 RLHF-PPO的优缺点

2.3.1 优点

（1）与 Reward Model 能结合得很好

PPO 可以处理延迟的整体奖励（Reward Model 给整个回复一个分数），并把它转成可用的策略优化信号，使模型更符合人类偏好

（2）更新稳定、训练更可靠

PPO 的核心是clip（剪辑）策略更新比率这一技巧，不允许策略变化太大，大幅提高训练的稳定性，比传统策略梯度方法 SPG/REINFORCE 更稳

（3）样本利用率较高

相较于一些纯策略梯度方法（如简单 REINFORCE），PPO 允许在一次采样的轨迹上做多次小步更新，对数据的利用效率更高

（4）适合大模型训练

PPO 是目前工业级 RLHF 微调大语言模型的事实标准，不仅在 OpenAI 的 InstructGPT/ChatGPT 上成功验证，也在很多学术和实用场景中表现稳定

（5）可控性强

PPO 的一些组件（如 KL 惩罚系数、clip 范围等）可以用来显式控制策略更新幅度，避免模型迭代过程过度偏离原始语言模型行为，从而提升语言生成的可控性和安全性。

2.3.2 缺点

（1）训练过程复杂且成本高

完整 RLHF + PPO 需要四个模型（Actor/Policy、Reward Model、Critic、Reference），训练过程涉及多次采样/计算，各模型间有依赖关系，计算和工程复杂度高

（2）超参敏感

PPO 有很多超参（如clip ε、KL 惩罚系数β、GAE 的λ、折扣γ等），对训练稳定性和最终质量有大影响，需要大量调参

（3）Reward Model 的偏差会被放大

如果 Reward Model 给出的分数不准确或有系统偏见（例如对某类输出不公平），PPO 会优化到这种“偏差”，可能造成模型过度追求错误目标

（4）算力/样本效率仍然不够理想

尽管比 REINFORCE 更高效，但相比监督微调（SFT）这种方法在算力、标注（human feedback）需求上仍然昂贵，尤其是需要大量偏好数据

（5）依赖 Reward Model 的质量

Reward Model 本身需要大规模、质量高的偏好标注数据训练。标注不一致性、偏见或噪音都会直接影响 PPO 优化过程的有效性

3. Preview

RLHF-PPO讲完了，那下一个必须干GRPO