the difference between on and off policy in RL

简单说一下，首先这个on-policy和off-policy是指的我们每一步的策略，也就是我们的探索的目标值是从什么地方来的。

首先看看Google上面有人对这个的定义：

This is the definition: If the algorithm estimates the value function of the policy generating the data, the method is called on-policy.
Otherwise it is called off-policy.
- Csaba

图片1.png

这个就可以看出来差别了

然后我们举两个具体的例子，一个是q-learning，还有一个是sarsa

图片2.png

Sarsa是标准的on-policy算法，也就是图中的那个目标Q(s’,a’)也就是我们产生当前这个动作的东西，而q-learnig是一个贪婪算法，也就是每一步取最大，这样其实就是和我们的产生的这个动作的value不一样了，是独立的

回复列表：

By王炳宁 on Dec. 19, 2016 | 类别 ML

关于本站

子曰：“唯上知与下愚不移。”

the difference between on and off policy in RL

留下您的评论

回复列表：