Understanding Einsum Operations in Machine Learning with Pytorch and Self-Attention Mechanisms: A Step-by-Step Guide
If you are a machine learning researcher/engineer nowadays you should definitely be aware of einsum operations!
Personally speaking, I used to give up understanding git repos because of einsum operations. The reason: even though I felt pretty comfortable with tensor operations einsum was not in my arsenal.
Long story short, I decided I want to get familiar with the einsum notation. Since I am particularly interested in transformers and self-attention in computer vision, I have a huge playground.
In this article, I will extensively try to familiarize myself with einsum (in Pytorch), and in parallel, I will implement the famous self-attention layer, and finally a vanilla Transformer.
The code is totally educational! I haven’t trained any large self-attention model yet but I plan to. Truthfully speaking, I learned much more in the process than I initially expected.
If you want to delve into the theory first, feel free to check my articles on attention and transformer.
If not, let the game begin!
The code of this tutorial is available on GitHub. Show your support with a star!
**Why einsum?**
First, einsum notation is all about elegant and clean code. Many AI industry specialists and researchers use it consistently.
To convince you even more, let’s see an example:
You want to merge 2 dims of a 4D tensor, first and last.
“`
x = einops.rearrange(x, ‘b c h w -> (b w) c h’)
“`
Neat and clean!
Second reason: if you care about batched implementations of custom layers with multi-dimensional tensors, einsum should definitely be in your arsenal!
Third reason: translating code from PyTorch to TensorFlow or NumPy becomes trivial.
I am completely aware that it takes time to get used to it. That’s why I decided to implement some self-attention mechanisms.
**The einsum and einops notation basics**
If you know the basics of einsum and einops you may skip this section.
**Scaled dot product self-attention**
Implementation details and code explanation.
**Multi-Head Self-Attention**
Introduction to multiple heads in computations and implementation details.
**TransformerEncoder**
Building Transformer blocks and Transformer Encoder using the implemented modules.
**Conclusion**
It took me some time to solidify my understanding of self-attention and einsum, but it was a fun ride. In the next article, I will try to implement more advanced self-attention blocks for computer vision. Meanwhile, use our Github repository in your next project and let us know how it goes out.
Don’t forget to star our repository to show us your support!
If you feel like your PyTorch fundamentals need some extra practice, learn from the best ones out there. Use the code aisummer35 to get an exclusive 35% discount from your favorite AI blog 🙂
**Acknowledgments**
A huge shout out to Alex Rogozhnikov (@arogozhnikov) for the awesome einops lib.
Here is a list of other resources that significantly accelerated my learning on einsum operations, attention, or transformers.
Deep Learning in Production Book: Learn how to build, train, deploy, scale and maintain deep learning models. Understand ML infrastructure and MLOps using hands-on examples.
Overall, understanding einsum operations and utilizing them in your machine learning projects can greatly improve the efficiency and readability of your code. It may take some time to get used to, but the benefits are worth the effort. Happy coding!