Easy explanation of the Stabilizing Transformers for Reinforcement Learning with real code

7 min readNov 25, 2022

In my previous post, I explained how to use the Transformer Network to find the relationship between objects on one screen frame at once. However, we sometimes need to find the relationship between each screen frame.

For example, Agent should pick a key before opening the door for the next mission in the obstacle tower environment. In such cases, there is a delay between obtaining the key and opening the door. To solve that task, the Neural Network of the agent should keep the information for some time even if there is another reward signal while keeping it. It is easy problem from the perspective of a classic AI algorithm because we just need to keep that information in one variable. However, we need to use the other way if we need to train the network using backpropagation.

Many papers have been published to tackle that problem. For example, another paper of DeepMind titled Synthetic Returns for Long-Term Credit Assignment uses the Key-to-Door environment.

As you can see, this environment is a 2D version of Obstacle Tower.

Among various methods of papers, the method of DeepMind titled the Stabilizing Transformers shows prominently good performance. In this post, we are going to look around that method with Tensorflow code.

I divide the full code source of this post into small pieces to explain more easily.

Relational Environment

Before we use the heavy environment, it would be nice to test the model using a simple environment like the previous post. Therefore, I make the box world environment that is sequentially relational.

The visualization of Box-World environment

The agent will obtain the 10 rewards if it selects the red box in the first frame and the blue box in the last frame. The first and last box selections should occur together to receive the reward. Otherwise, only 0 rewards are given to the agent. Additionally, there are green boxes that will give a reward of 1 if you choose in the middle. The first and last box position is fixed to make the task easier. A network that has some memory functions is needed to run properly in that environment.

This environment only requires a basic Python package and is really light. Just copy the below code to the BoxWorldEnvMemory.py file.

Sequence Relational Environment

After putting the environment file in your workspace, you can test the environment in the OpenAI gym format using the below code.

Environment test code for BoxWorldEnvMemory

The action number and max_episode arguments are the number of places of boxes in one frame and the sequence length of each other. We can change them to test the relation capacity of the Transformer.

Transformer-XL

The Stabilizing Transformers uses the Transformer-XL that is similar to the classic Transformer but has a memory like an LSTM. I find a really good reference code for that from one of the projects to generate the text. As I run that code, it seems like the model generates more long sensible text than the LSTM-based model.

Let’s compare the Transformer-XL with the original one. We already use that to solve the stacking box environment. That can be drawn like the below image. As you can see, there is no memory transfer between the segment. That is why it can be used to find the relation of objects in one frame.

Classic Transformer architecture from paper

Otherwise, the Transformer-XL has a memory that is transferred between each segment. Even if the LSTM has also memory, it shows poor performance compared to it.

Below is the code for Transformer-XL. You can check some math functions for that in this post. Different from vanilla Transformers, you can the rel_enc_shift function for the shifting.

Transformer-XL model code from GitHub repo

Gated Transformer

The next component that is needed to apply the Transformer to a sequentially relational environment is the gating layer. That can be implemented practically like the below.

The paper says that placing the Layer-Norm is related to the Identity Map Reordering. Furthermore, performance is again boosted by using the Gating Layer. It is always awesome they find that kind of breakthrough.

You can also see how that theory is implemented using Tensorflow like the below code.

Code for the Gated Transformer

Interestingly, the training does not proceed at all when the gating layer is not used.

Relative Position Encodings

Finally, we need to add the relative position encodings to input data. As you remember, we also use positional encoding to find the relation between the objects of one screen. The difference is that we should add it to each sequence frame.

Positional Encoding code from the GitHub repo

You can see how to add the position encoding to input of model.

Gated Transformer XL

Using the Stabilizing Transformers with A2C RL

The most tricky part was modifying the Transformer-XL for the NLP task to Reinforcement Learning. The original paper uses the V-MPO which is the most advanced Model-Free RL technique currently. However, we can use the A2C if we use a more simple environment. You can see how to use the Gated Transformer-XL model previously explained with A2C.

It is really similar to DNN, LSTM based A2C code except for the dimension of logits, and baselines of A2C.

Training Result of DNN, LSTM, and Transformer

At first evaluation, the action_num is set as 3, and max_episode is set as 5. The Transformer has 32 head sizes and 12 layers. Like an experiment of Synthetic Returns for Long-Term Credit Assignment, I divide the reward into the green box collection and the red/blue box together a collection.

In the first test environment, the DNN and Stabilizing Transformers can collect the green box well but LSTM fails to do that. In the task of delay reward task, only Stabilizing Transformers succeeds. This result is really similar to the result of the reference paper even if the algorithm is a little different.

We can assume that LSTM and DNN have no ability to deal with delays between actions and rewards are long and when intervening unrelated events contribute variance to long-term return.

I am still not sure why LSTM shows bad performance at the collecting apple tasks. That works really well in a single-task environment.

Training Result of longer environment

For now, we know that only the Stabilizing Transformer has a memory ability to remember the long-term reward. There, it is tested against the box environment that has 8 lengths. You can check my testing file here.

We can see that agent takes a little longer time to learn that task. We should consider that when we use a heavy environment because training time will be multiplicated to multiple scales.

Training Result of the wider environment

The one more thing we need to check about the Stabilizing Transformer is the number of action sizes. Does it work well when the action size is increased? As I say in the first part of this post, the number of places where boxes can be placed means action size. Let’s change it from 3 to 4 with 5 environment lengths.

As you can see, the training time is affected by action size. However, the model of the agent manages to learn it.

Conclusion

In this post, the Stabilizing Transformer of DeepMind is tested with a simple toy environment with various parameters. The ability to integrate information over long time horizons and scale to massive amounts of data is a crucial ability for AI agents. We see that the Transformer also can be used for sequentially relational environments if we add more technique methods such as the relative position encodings, gated transformers, and identity map reordering to Transformer-XL for NLP.

The future work is applying that to the Obstacle Tower environment and comparing the result with the PPO and BC based SOTA algorithm.

Thank you for reading (: