AlphaStar implementation series — Reinforcement Learning

5 min readMay 20, 2021

I tried various attempts to implement the AlphaStar paper. However, I could not check any meaningful training results because of the huge size of the model. Therefore, I implemented the previous StarCraft 2 paper named StarCraft II: A New Challenge for Reinforcement Learning which code is already made by other people.

I was able to gain numerous useful codes from the Github repo of the simonmeister and Github repo of the xhujoy how to combine a loss of each action and to process complex observations of PySC2. Thanks again for the well-organized work to them.

All code is written in Python 3.6, Tensorflow 2 version.

This post is one of the series for the project of the AlphaStar implementation.

Warm-Up

Before we start with a complex PySC2 environment, let’s try to solve a simple environment using the Actor-Critic Reinforcement Learning algorithm that we are going to use. It is an official tutorial of Tensorflow https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic. I add an additional code to it for Starcraft 2 case. Therefore, it will be helpful to adapt about it to read the next section.

Difference of observation and actions

**Environment Spec of Starcraft2, the CartPole**

The above is a summary of the observations and actions of each environment. You can see Starcraft2 has more observation, action than the CartPole.

Let’s take a look at the model definition part of PySC2.

Model of PySC2

Comparing the model of CartPole which has only 2 networks for observation and action, you can see that model of PySC2 has 6 networks for observation and 14 networks for action. Furthermore, the CartPole model returns only 1 logit for action. On the other side, The PySC2 model returns a total of 14 logits for the action type and action argument. When calculating loss, we need to consider all logit.

Unlike the existing CartPole step function which returns only state, reward, and done, the step function of the PySC2 returns 10 states and 14 actions.

Step function of PySC2

The run_episode function of the CartPole can collect and train all states of one episode because the size of it is small. However, the function of PySC2 needs to divide the episode into 8 steps because the state size is large.

The compute_loss function of the CartPole calculates the loss for only one action logit. In the case of PySC2, the loss is calculated for the logit of one action type and 13 action arguments. Additionally, the loss of the unused argument is masked to 0.

The function for calculating the return of A2C Reinforcement Learning is the same as that of the CartPole. After much testing, I confirm the normalization of return is good for training stability.

Function for calculating the return for Reinforcement Learning

The train_step function of CartPole can use a gradient directly that is calculated from loss without any problem to train the network. Although, the function of PySC2 should clip the gradient clipping because it has more networks than the Cartpole.

Train_step function of pysc2

You can see the effect of gradient clipping below.

You can see the following gradient norm, reward graph if you train a network for the MoveToBeacon environment which is one of the minigames of PySC2 using the code of https://github.com/kimbring2/AlphaStar_Implementation/blob/master/run_reinforcement_learning.py.

Training history of the MoveToBeacon environment

In the case of Starcraft2, it is important to keep the spatial information of the screen, minimap because some actions should select the one point on the screen, minimap. Therefore, the original screen, minimap features are added to concatenated encoded feature of the screen, minimap, player feature. It is similar to a normal Residual Network except using the Conv network to match the channel number of the original feature and encoded feature because they are 3 dimension arrays.

The probability of successful training in the MoveToBeacon environment drops from 100% to below 50% without using the Residual part. It means the reward sum never rises up no matter how many times the network is trained.

In the case of one GPU case, the size of the network should be limited. For that, the unit list is manually selected because the unit_type feature of the screen feature takes a large portion of the network size. Additionally, it is likely to fail at training if the channel size of the screen encoder is not larger than the channel number of the screen feature.

Preprocess Screen function for PySC2

If you have enough GPU memory, the list of units can be larger. After increasing that, do not forget to add more channels to the screen encoder.

Feature Screen Network of PySC2

Conclusion and future work

In this post, we investigate how to train the network of PySC2 using the Reinforcement Learning method. In the next post, we are going to train networks using human expert data.

Thank you for reading.

AlphaStar implementation series — Reinforcement Learning

Warm-Up

Difference of observation and actions

Conclusion and future work

Written by Dohyeong Kim

No responses yet