Playing MOBA game using Deep Reinforcement Learning — part 3

3 min readDec 18, 2021

In the previous two posts, we prepare the basic component what we need to train the large scale MOBA game using the Deep Reinforcement Learning. In this post, we will see how to combine the code of Dota2 and Seed RL.

You can find a code for this post at https://github.com/kimbring2/MOBA_RL/tree/main/dota2.

Learning-based code for Dota2

The major difference between the Derk and Dota2 networks is the number of action network . In the former case, only one network is used because game is simple. However, the latter need to use six network for action selection. Therefore,the following changes are required when creating an agent network in Seed RL code to reflect the different.

Dota2 Agent Network

Like Derk agent case, Dota2 agent send an observation from the actor to the learner to select action. The difference is that action consist of multiple argument. Hence, we have to combine them to generate the final action to send it to the environment step.

Dota2 Action Selection from RL

The Actor-Critic DRL uses the loss calculated from the log probability value of the selected action for the policy distribution when training. Therefore, you need to calculate the loss for the added actions as well.

Dota2 Loss Calculation

Rule-based code for Dota2

In Dota2, unlike Derk, agent need to obtain items and abilities during the game. I save the names of item and ability in the list and use them in order when the gold and level of hero meet certain conditions because this part is a little difficult to implement using the DRL.

Dota2 Action Selection from Rule

Reward Setting

One of important thing of the DRL is reward setting. The agent will learn a completely different behavior depending on the reward we. Unlike Derk where reward is calculated automatically from inside of environment, we need to calculate the reward from raw observation.

Reward for Reinforcement Learning is basically aimed at acquiring the XP. Additionally, agents for giving a damage to enemy unit and recovery to home units have different weights for Last Hit. Finally, huge reward is given according to result of the game to set the long term strategy during the whole game.

Reward example for ShadowFiend

The most basic reward component we can use would be the XP, HP, and Last Hit which is the number of times hero kills creep in last attack and earns gold.

Training result of Dota2

If you train the agent using the basic reward what we set during 2 days, you can see the reward of two agent are increased to some point as like a follows graph.

It is not possible to completely confirm whether training is done successfully using only graph. Therefore, we need to launch the game client to see the behavior of agent visually as like a below video.

Evaluation Result of Agent

We can see that heros move to the middle lane where creeps only come and fight to earn the XP even though we do not use any Rule-based code for moving and attacking,

Conclusion

In this post, we look what we need to modify and add for training Dota2 using DRL like a Derk. As you can see, we need to use the little Rule-based method for item and ability management in Dota2. For moving and attacking of hero, we can use the DRL. Furthermore, we check that the training time is quite long compared to Derk.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com