Until the last post, we looked at how to extract human playing data from replay file and implement agent and network of AlphaStar. When these three are ready, we can now proceed training process of paper. The first step is training the network of agent with data extracted from replay through Supervised Learning.
Model structure of TensorFlow for batch training
Unlike the inference process used in the step function agent, training requires multiple datasets to be entered at once by batch size. There, the network can deal to variable batch size.
When using reshape function in the Core created in the previous post, tensor size of that part will be set automatically according to the batch size if you put -1 in size argument of reshape function. The model can be used regardless of the batch size if same method are applied to the reshape function of encoders, heads.
Getting policy logit from batch datasets
After changing the model structure to the batch training structure, policy logit of network can be calculated from batch dataset.
As a result of examining pseudocode of AlphaStar, this part is handled by the unroll function of agent class. It can be implemented as above code.
Supervised training from replay file
After getting the prediction value of the batch datasets using the unroll function, the loss value needs to be calculated through the difference between this and the action of the replay data.
This part needs to use a tf.GradientTape () function of TensorFlow for using the calculated loss for training.
Action type of AlphaStar
Select_point, select_rect, select_control_group, select_unit, select_idle_worker action have their own unique argument. Therefore, it is convenient to separate them into different action types.
Loss function of AlphaStar
Action of AlphaStar consists various factor unlike other Reinforcement Learning environments. First, the type of action is defined, and then the argument for the defined action follows. As explained earlier, the action type is calculated by ActionTypeHead. If the prediction value of that head is the difference from the actual action type, loss value can be calculated from only that comparison. If not, the loss is calculated by comparing the predicted values of SelectedUnitsHead, TargetUnitHead, ScreenLocationHead, and MinimapLocationHead with the arguments of the replay data.
If not, the loss needs to be calculated by comparing the predicted values of SelectedUnitsHead, TargetUnitHead, ScreenLocationHead, and MinimapLocationHead with the argument of the replay data in order. The required arguments and order for each action is different. That informations can be found at https://github.com/deepmind/pysc2/blob/master/pysc2/lib/actions.py.
Supervised Learning is not complete because network structure or other parts not yet fully implemented. However, we are able to easily move to the next step once we have created the basic structure for training.
The code for this post can be found at https://github.com/kimbring2/AlphaStar_Implementation.