本文介绍了Q-learning 和 SARSA 与贪婪选择等价吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


Q-learning 和 SARSA 的区别在于 Q-learning 比较当前状态和可能的最佳下一个状态,而 SARSA 比较当前状态和实际下一个状态.

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.

如果使用贪心选择策略,即 100% 的时间选择具有最高动作值的动作,那么 SARSA 和 Q-learning 是否相同?

If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?


好吧,实际上并非如此.SARSA 和 Q-learning 之间的一个主要区别在于,SARSA 是一种 on-policy 算法(它遵循正在学习的策略),而 Q-learning 是一种 off-policy 算法(它可以遵循任何策略(满足某些收敛要求).

Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is learning) and Q-learning is an off-policy algorithm (it can follow any policy (that fulfills some convergence requirements).

请注意,在以下两种算法的伪代码中,SARSA 选择 a' 和 s' 然后更新 Q 函数;而 Q-learning 首先更新 Q-function,下一个要执行的动作是在下一次迭代中选择的,从更新后的 Q-function 导出,不一定等于选择更新 Q 的 a'.

Notice that in the following pseudocode of both algorithms, that SARSA choose a' and s' and then updates the Q-function; while Q-learning first updates the Q-function, and the next action to perform is selected in the next iteration, derived from the updated Q-function and not necessarily equal to the a' selected to update Q.


In any case, both algorithms require exploration (i.e., taking actions different from the greedy action) to converge.

SARSA 和 Q-learning 的伪代码摘自 Sutton 和 Barto 的书:强化学习:简介(HTML 版本)

The pseudocode of SARSA and Q-learning have been extracted from Sutton and Barto's book: Reinforcement Learning: An Introduction (HTML version)

这篇关于Q-learning 和 SARSA 与贪婪选择等价吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-07 19:13