

我了解强化学习的基础知识,但是要阅读 arxiv PPO论文吗?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?

学习和使用 PPO 的路线图是什么?

What is the roadmap to learn and use PPO ?



To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".

从原始的 PPO纸:

From the original PPO paper:



1. The Clipped Surrogate Objective

The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.

对于原始的政策梯度(例如REINFORCE)---您应该熟悉,或者在您读懂此书之前先熟悉 ---用于优化神经网络的目标看起来像:

For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:

这是您在萨顿书中看到的标准公式,和其他 资源,其中A帽子可以是折现收益(如REINFORCE)或优势函数(如 GAE ).通过针对网络参数的这种损失采取梯度上升步骤,您将激励那些导致更高报酬的行动.

This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.

香草策略梯度法使用操作的对数概率(logπ(a | s))来跟踪操作的影响,但是您可以想象使用另一个函数来执行此操作. 本文基线和 anyrl-py 实现.

Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.



Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:

在图的左半部(A> 0),这是该动作对结果产生积极影响的地方.在图的右半部分(A

On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.

请注意,如果r值太大,则在左半部分会被裁剪.如果在当前政策下采取的行动比在旧政策下更有可能采取的行动,就会发生这种情况.发生这种情况时,我们不想贪婪地走得太远(因为这只是我们的政策的局部近似值和示例,因此如果走得太远就不会准确),因此我们限制了目标以防止从成长. (这将在向后传递中阻止渐变-导致渐变为0的扁平线).

Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).


On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.


So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.

但是为什么我们让r(θ)在图的最右边无限期地增长呢?乍一看似乎很奇怪,但是在这种情况下会导致r(θ)变得很大吗?在该区域中r(θ)的增长将由使我们的动作很多的梯度阶跃引起.更有可能,结果使我们的政策变得更糟.如果是这样,我们希望能够撤消该渐变步骤.恰好是L剪辑功能允许这样做.该函数在此处为负,因此坡度将告诉我们沿另一个方向行走,并且使动作发生的可能性与我们将其拧紧的程度成正比. (请注意,该图的最左侧有一个相似的区域,该区域的动作很好,我们无意间使该动作不太可能发生.)

But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)

这些撤消"区域解释了为什么我们必须在目标函数中包括怪异的最小化项.它们对应于未修剪的r(θ)A,其值比修剪的版本低,并通过最小化返回.这是因为它们朝着错误的方向迈出了一步(例如,行动很好,但我们偶然降低了行动的可能性).如果我们没有在目标函数中包含最小值,则这些区域将是平坦的(梯度= 0),并且将防止我们纠正错误.

These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.


Here is a diagram summarizing this:


And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.


Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.

PPO使用每个收集数据的 N 个并行参与者来运行策略,然后对这些数据的微型批次进行采样,以使用Clipped Surrogate Objective函数训练 K 个时期. .请参阅下面的完整算法(大约参数值是: K = 3-15, M = 64-4096, T (水平)= 128 -2048):

PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):

并行参与者部分已由 A3C论文普及,并已成为一种相当标准的收集方式数据.

The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.

新颖的部分是,它们能够在轨迹样本上运行 K 个梯度上升纪元.正如他们在论文中指出的那样,最好对数据进行多次传递来运行香草策略梯度优化,以便您可以从每个样本中学到更多信息.但是,这对于香草方法在实践中通常是失败的,因为它们对本地样本采取了太多的步骤,这破坏了政策.另一方面,PPO具有内置机制来防止过多的更新.

The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.


For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.



And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.


[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.


