我在自定义环境中使用了 RLLib 的 PPOtrainer,我执行了两次 trainer.train(),第一次成功完成,但是当我第二次执行它时,它崩溃并出现错误:



这是我的代码:

主文件

ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
ray.init()

trainer = PPOTrainer(env=MyEnv, config={
    "train_batch_size": 4000,
    "model": {
        "custom_preprocessor": "tree_obs_prep"
    }
})

for i in range(2):
    print(trainer.train())

我的环境文件
class MyEnv(rllib.env.MultiAgentEnv):
    def __init__(self, env_config):
        self.n_agents = 2

        self.env = *CREATES ENV*
        self.action_space = gym.spaces.Discrete(5)
        self.observation_space = np.zeros((1, 12))

    def reset(self):
        self.agents_done = []
        obs = self.env.reset()
        return obs[0]

    def step(self, action_dict):
        obs, rewards, dones, infos = self.env.step(action_dict)

        d = dict()
        r = dict()
        o = dict()
        i = dict()
        for i_agent in range(len(self.env.agents)):
            if i_agent not in self.agents_done:
                o[i_agent] = obs[i_agent]
                r[i_agent] = rewards[i_agent]
                d[i_agent] = dones[i_agent]
                i[i_agent] = infos[i)agent]
        d['__all__'] = dones['__all__']

        for agent, done in dones.items():
            if done and agent != '__all__':
                self.agents_done.append(agent)

        return  o, r, d, i

我不知道是什么问题,有什么建议吗?
这个错误是什么意思?

最佳答案

This 评论真的帮助了我:



就我而言,我不得不修改我的观察结果,因为代理无法学习策略,并且在训练的某个时间点(随机时间步长)返回的 Action 是 NaN

关于python - RLLib - Tensorflow - InvalidArgumentError : Received a label value of N which is outside the valid range of [0, N),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59272939/

10-12 21:53