Deep Reinforcement Learning with Double Q-learning: DDQN 简约不简单

论文链接：

在传统强化学习领域里面，学者们已经认识到了Q-learning 存在overestimate的问题。overestimation 会损害performance，因为overestimate很可能是不均匀的.造成overestimation的原因多种多样，根本原因还是我们不知道action value的真实值、

DQN的参数更新公式

\[ \theta_{t+1} = \theta_t + \alpha (Y_t - Q(S_t,A_t;\theta_t) )\nabla_{\theta_t} Q(S_t,A_t;\theta_t)\]

从公式上看\(Y_t^{DQN} = R_{t+1} + \gamma\underset{a}{max} Q(S_{t+1},a;\theta_t^{-})\). 我们既用了estimator \(Q\)找到能让value 最大的action \(a\)，并且继续使用这个estimator \(Q\)估计这个action value。有种权利不受约束自己监管自己的感觉。更好的方式是一个estimator找到具有最优value 的\(a\),另一个estimator去估计这个\(a\)对应的value。这就是action selection 和action evaluation

传统的 double Q-learning 提出的解决方式很简单就是使用了两个estimator 随机交替更新。

我们可以把double Q-learning 的公式写作下面的公式，\(\theta,\theta'\)分别对应两个estimator 的参数.我们交替更新\(\theta,\theta'\)

\[Y_t^{Double_Q} = R_{t+1} + \gamma Q(S_{t+1},\underset{a}{max}Q(S_{t+1},a;\theta_t);\theta_t^{'})\]

在DQN 中我们天然的就有两个estimator 一个是target policy 对应estimator,参数是\(\theta^{-}\),一个是 behavior policy 对应的estimator,参数是\(\theta\)。（不需要交替更新）

\[Y_t^{Double_DQN} = R_{t+1} + \gamma Q(S_{t+1},\underset{a}{max}Q(S_{t+1},a;\theta_t);\theta_t^{-})\]

代码

代码参考

def compute_td_loss(batch_size):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state      = Variable(torch.FloatTensor(np.float32(state)))
    next_state = Variable(torch.FloatTensor(np.float32(next_state)))
    action     = Variable(torch.LongTensor(action))
    reward     = Variable(torch.FloatTensor(reward))
    done       = Variable(torch.FloatTensor(done))

    q_values      = current_model(state)
    next_q_values = current_model(next_state)
    next_q_state_values = target_model(next_state)

    q_value       = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value = next_q_state_values.gather(1, torch.max(next_q_values, 1)[1].unsqueeze(1)).squeeze(1)
    expected_q_value = reward + gamma * next_q_value * (1 - done) # target

    loss = (q_value - Variable(expected_q_value.data)).pow(2).mean()

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss