本文介绍了Q-Learning值太高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我尝试在Golang中实现一个基本的Q-Learning算法。请注意,我是一般的钢筋学习和人工智能的新手,所以错误可能是我的。



以下是我如何实现m,n, K游戏环境:
在每个给定时间
$ b的值 Qmax 值,这是不正确的。所以可能你误解了Q学习算法。


I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.

Here's how I implemented the solution to an m,n,k-game environment:At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1

func (agent *RLAgent) learn(reward float64) {
    var mState = marshallState(agent.prevState, agent.id)
    var oldVal = agent.values[mState]

    agent.values[mState] = oldVal + (agent.LearningRate *
        (agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}

Note:

    agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approachagent.prevScore holds the reward to previous state-actionThe reward argument represents the reward for current step's state-action (Qmax)

With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow.I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!

Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):

Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000

The reward function returns:

    Agent won: 1Agent lost: -1Draw: 0Game continues: 0.5

But you can see that the minimum value is zero, and the maximum value is too high.

It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!

agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))

Any ideas on how to fix this?Is that kind of state-action value normal in Q-Learning?!


Update:After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).

After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:

Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718

and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.

解决方案

If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):

On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

这篇关于Q-Learning值太高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-07 19:11