Implementation of "Attention is all you need" paper by Vaswani et al, 2017
This is the architecture I implemented in PyTorch 👇
The model has more than 7 Million Parameters and the hyperparameters of the model are listed below:
embedding size = 256
vocab size = 1000
sequence length = 64
batch size = 64
head size = 4
Total Blocks in Encoder = 4
Total Blocks in Decoder = 4