Until the last post, we looked at how to extract human playing data from replay file and implement agent and network of AlphaStar. When these three are ready, we can now proceed training process of paper. The first step is training the network of agent with data extracted from replay through Supervised Learning.

Training process of AlphaStar

Model structure of TensorFlow for batch training

Unlike the inference process used in the step function agent, training requires multiple datasets to be entered at once by batch size. There, the network can deal to variable batch size.

Core function for batch datasets

When using reshape function in the Core created in the previous post, tensor size of that part will be set automatically according to the batch size if you put -1 in size argument of reshape function. The model can be used regardless of the batch size if same method are applied to the reshape function of encoders, heads.

Getting policy logit from batch datasets

After changing the model structure to the batch training structure, policy logit of network can be calculated from batch dataset.

Unroll function of agent class

As a result of examining pseudocode of AlphaStar, this part is handled by the unroll function of agent class. It can be implemented as above code.

Supervised training from replay file

After getting the prediction value of the batch datasets using the unroll function, the loss value needs to be calculated through the difference between this and the action of the replay data.

Supervised training of AlphaStar

This part needs to use a tf.GradientTape () function of TensorFlow for using the calculated loss for training.

Action type of AlphaStar

Select_point, select_rect, select_control_group, select_unit, select_idle_worker action have their own unique argument. Therefore, it is convenient to separate them into different action types.

Action list of Terran agent

Loss function of AlphaStar

Action of AlphaStar consists various factor unlike other Reinforcement Learning environments. First, the type of action is defined, and then the argument for the defined action follows. As explained earlier, the action type is calculated by ActionTypeHead. If the prediction value of that head is the difference from the actual action type, loss value can be calculated from only that comparison. If not, the loss is calculated by comparing the predicted values ​​of SelectedUnitsHead, TargetUnitHead, ScreenLocationHead, and MinimapLocationHead with the arguments of the replay data.

Loss calculation of AlphaStar

If not, the loss needs to be calculated by comparing the predicted values ​​of SelectedUnitsHead, TargetUnitHead, ScreenLocationHead, and MinimapLocationHead with the argument of the replay data in order. The required arguments and order for each action is different. That informations can be found at


Supervised Learning is not complete because network structure or other parts not yet fully implemented. However, we are able to easily move to the next step once we have created the basic structure for training.

The code for this post can be found at

Written by

I am a Deep Reinforcement Learning researcher of South Korea. My final goal is making a AI robot which can cook, cleaning for me using Deep Learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store