Image to Latex using Vision Transformer

Dohyeong Kim
5 min readFeb 13, 2024

--

Introduction

Recent advancement in Transformer architecture allows us can transcribe the context of image to text easily than before.

For example, Image captioning with visual attention is one of the most popular applications of it.

Another possible usage of Image captioning is the im2markup application where a Latex formula image is converted to Latex formula text which can be used directly to write the paper.

Captured from https://im2markup.yuntiandeng.com/

For that purpose, we can build the model using a Transformer like the below one. This model consists of two Transformer models. The first one is for feature extraction from images. The second one is for generating the latex word.

Before running the source code of this post, please check you can run the Tensorflow code of https://www.tensorflow.org/text/tutorials/image_captioning link. Furthermore, you can find the original code of Vision Transformer from this link. I combine two codes with minimal change.

Additionally, I use the preprocessed latex dataset of https://github.com/ritheshkumar95/im2latex-tensorflow link. You can find the code for preprocessing.

Finally, the whole notebook files and dataset that I use to train and evaluate are also available from the below links.

Below is a pre-trained model. Please use the Tensorflow 2.13.1 version to run that.

Prepare Dataset

After downloading the dataset and Notebook file, please run the below code to check whether every setting is prepared well.

After that, we can utilize the above code to build the data pipeline to train and test the Tensorflow model.

Finally, check the image and label match each other using the below code.

If the previous steps were done well, you can see the output in the below image.

Build the model

In the original code of the image captioning tutorial of TensorFlow, the MobileNet-V3-Small is used as a feature extractor. Additionally, the pre-trained weight of the ImageNet is used to reduce the training time.

However, we can not use the model because our data is not the common object of real life. I find the Tensorflow 2.0 model for Latex images from https://github.com/harishB97/Im2Latex-TensorFlow-2 link.

It consists of several Conv, MaxPool, and BatchNormalization networks.

Vision Transformer for Latex image encoding

It would be a better idea to visualize the generated patches. Below is a sample code for that.

Code for visualizing generated patched from latex image

When the image size is 160x160 and the patch size is 6x6, there are a total of 26x26=676 patches.

Patch visualizing

After that, let’s verify that our feature extractor actually can extract the feature from the Latex image. We can use the previously defined Tensorflow Dataset object for this purpose.

If the ViT model was made well, the extracted feature size would be printed like the below image.

Encoded patch information

Value 1 is the batch size, 676 is the total number of patches, and 512 is the embedding size of the patch.

We can use encoded features in the CNN-feature extractor-based image captioning model with minimal code change of the Captioner class.

Train the model

Additionally, we need to use the custom training and testing function because we can use the Keras fit function of the original code.

However, we can still use the masked_loss and masked_acc functions. Using them, we can define our custom training and testing function like the below one.

Using the above functions and the previous TensorFlow dataset, we can train and test the Transformer model for Latex data.

The Tensorboard logs such as loss and perplexity are also saved in the tensorboard folder.

Training loss of ViT extractor-based model

We can confirm that the training process works well from the above charts.

Evaluate the model performance

We can confirm the accuracy of the model reached around 96% from the Tensorboard chart. It is a not bad performance considering we train it from scratch.

Accuracy of ViT extractor-based model

Now, let’s print out the real label and predicted label to check the model performance more clearly. Below is a sample of that.

Below are a few samples of evaluation. We can confirm that the model translates the latex image well.

Evaluate case 1
Evaluate case 2
Evaluate case 3

You can also visualize the predicted result as an image. Below is the code for that.

Below images are visualized predicted results.

Like an original image captioning example, it is also a better idea to visualize the attention map to understand how the ViT-based model gives attention to each prediction.

We can use the default code of the original example. Below is a sample of the attention map. As you can see, it is a little bit hard to visualize it on one screen because the token number is higher than the original one.

Attention maps of ViT-based model

If we pick up a few attention maps from the above image, we can see that our model gives attention appropriately to each image part when generating the latex token.

Compare the ViT extractor to the CNN extractor

Because the original image captioning method used CNN as an image feature extractor, it would be helpful to compare the performance of the two models.

Below is the code of the CNN feature extractor. It is the same as previous released latex2image research.

Below is the Tensorboard display of the loss and accuracy of both model.

Performance comparison of ViT-based and CNN-based model for Latex transcription.

We can confirm that the ViT-based model learns faster than the CNN-based model and expect that accuracy is also higher.

Conclusion

Vision Transformer is better than CNN for the image to latex applications.

--

--

Dohyeong Kim
Dohyeong Kim

Written by Dohyeong Kim

I am a Deep Learning researcher. Currently, I am trying to make an AI agent for various situations such as MOBA, RTS, and Soccer games.

Responses (3)