Implementation of the Hide and Seek of the OpenAI — Part 1

4 min readMar 12, 2023

Collaboration is an essential function of multiplayer game such as a MOBA, and Soccer game. In the case of Reinforcement Learning, the transition probabilities should be stationary in order to be trained well. Due to this point, famous early study of the OpenAI tried to apply a additional method to deal with the fluctuating transition probabilities.

However, recent research of the DeepMind for MARL say that multiple agent game also can be converged to the Nash Equilibrium despite of unstable transition probability.

In theory such multi-agent systems may continue to explore forever. In
practice they often reach an equilibrium point and stop exploring. Efforts to understand how these systems work create tension with the dominant “single-agent paradigm” of artificial intelligence and cognitive science. In short, all the representations that matter are no longer inside the agent’s “head”, but rather are distributed in some fashion between the agent, the population, the environment, and the training protocol itself.

Hide and Seek Environment

In this series, the Multiagent Emergence Environments of OpenAI announced in 2019 will be used. That environment is more difficult because it has a higher state and action dimension than the previous MARL environment of same organization.

Trained agent of Multiagent emergence environments

The weight of the trained model is provided by the OpenAI , and as shown in the video above. You can see that agents from home team collaborate to prevent agents from opponent teams from breaking inside.

If you take a deeper look into the related paper, you can know that agent does not learn such cooperation behavior is not emerged at once, but through several curriculum.

The amazing fact about that experiment is that the agents automatically learn the corresponding processes by sharing the reward inside of same tea even if the corresponding processes are not hard-coded at the code.

Toy Environment Test

Before trying a complex environment, it would be good idea to experiment the the model and training method in a simple environment.

First, Tensorflow 2 and https://github.com/openai/multi-agent-emergence-environments are installed to run the below code.

Multi Agent Training in Toy Environment

The MLP, DDPG, and Simple Tag environment are used for that experiment. The agents are trained for a total of 30000 episodes and evaluated for every 10000 episodes. As a result of several test, setting the gradient clipping to 0.5 and changing the discount factor from 0.99 to 0.95, show a faster training speed in a multi agent environment than single agent case.

After training 10000 episode, the predator agents only have learned to simply chase the prey agent. It can be seen that collaboration behavior has not yet been emerged.

After training 20000 episodes, one agent starts to move in the different direction from the other two agents. They finally realize that these behavior is better than to obtain more reward.

After training over 30000 episodes, the predator agents begin to drive the prey agent into a corner. In this way, they can obtain the highest reward. As you see from previous step, that kind of collaboration behavior only can be learned from the curriculum learning.

As a result of the above experiment, it is confirmed that Multi Agent Reinforcement Learning is also converged like a single agent case without additional part to the RL algorithm.

Conclusion

In the next post, we will take a look at the training result by applying the method used in this post to the Hide and Seek environment. Something needs to be changed or added to the current model and training method in order to train properly because it is a much more complicated environment than this post.

Implementation of the Hide and Seek of the OpenAI — Part 1

Hide and Seek Environment

Toy Environment Test

Conclusion

Written by Dohyeong Kim

Responses (1)