问题描述
也许我的问题看起来很愚蠢.
Maybe my question will seem stupid.
我正在研究Q学习算法.为了更好地理解它,我尝试重新制作将这个FrozenLake 示例插入Keras代码.
I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.
我的代码:
import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))
def custom_loss(yTrue, yPred):
return K.sum(K.square(yTrue - yPred))
model.compile(loss=custom_loss, optimizer='sgd')
# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []
num_episodes = 2000
for i in range(num_episodes):
current_state = env.reset()
rAll = 0
d = False
j = 0
while j < 99:
j+=1
current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
action = np.reshape(np.argmax(current_state_Q_values), (1,))
if np.random.rand(1) < e:
action[0] = env.action_space.sample() #random action
new_state, reward, d, _ = env.step(action[0])
rAll += reward
jList.append(j)
rList.append(rAll)
new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
max_newQ = np.max(new_Qs)
targetQ = current_state_Q_values
targetQ[0,action[0]] = reward + y*max_newQ
model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
current_state = new_state
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
我运行它时效果不佳:成功事件发生的百分比:0.052%
When I run it, it doesn't work well: Percent of succesful episodes: 0.052%
plt.plot(rList)
原始Tensorflow代码要好得多:成功事件的百分比:0.352%
The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%
plt.plot(rList)
我做错了什么?
推荐答案
除了在注释中提到的@Maldus设置use_bias = False之外,您还可以尝试从更高的epsilon值(例如0.5、0.75)开始吗?一个技巧可能是仅在达到目标时才减小ε值.即不要在每个情节的结尾减少epsilon.这样,您的播放器就可以继续随机浏览地图,直到它开始收敛到一条好的路线上为止,然后减小epsilon参数将是一个好主意.
Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.
实际上,我在此要点中使用卷积层实现了类似的模型密集层.设法使它在2000集以下正常工作.可能会对其他人有所帮助:)
I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)
这篇关于Keras代码Q学习OpenAI健身房FrozenLake出了点问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!