# A Summary of Problems in Pytorch Tutorial of Translation with a Sequence to Sequence Network and Attention

Published:

I summarise some errors of the Pytorch tutorial of Translation with a Sequence to Sequence Network and Attention.

When I read the Pytorch tutorial for the first time, I followed the tutorial of Translation with a Sequence to Sequence Network and Attention to reproduce the Sequence-to-Seqence model with attention. Although the tutorial did not explicitly mention that it reproduced the work of Neural Machine Translation by Jointly Learning to Align and Translate, the tutorial mentioned that

there are other forms of attention that work around the length limitation by using a relative position approach. Read about “local attention” in Effective Approaches to Attention-based Neural Machine Translation.

This implied that this implementation was for Neural Machine Translation by Jointly Learning to Align and Translate, because this paper was listed in the tutorial. However, I found that the attention mechanism in this tutorial actually was not the version of Neural Machine Translation by Jointly Learning to Align and Translate. We could see the difference from the code in the tutorial:


class AttnDecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
super(AttnDecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.output_size = output_size
self.dropout_p = dropout_p
self.max_length = max_length

self.embedding = nn.Embedding(self.output_size, self.hidden_size)
self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
self.dropout = nn.Dropout(self.dropout_p)
self.gru = nn.GRU(self.hidden_size, self.hidden_size)
self.out = nn.Linear(self.hidden_size, self.output_size)

def forward(self, input, hidden, encoder_outputs):
embedded = self.embedding(input).view(1, 1, -1)
embedded = self.dropout(embedded)

attn_weights = F.softmax(
self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0),
encoder_outputs.unsqueeze(0))

output = torch.cat((embedded[0], attn_applied[0]), 1)
output = self.attn_combine(output).unsqueeze(0)

output = F.relu(output)
output, hidden = self.gru(output, hidden)

output = F.log_softmax(self.out(output[0]), dim=1)
return output, hidden, attn_weights

def initHidden(self):



We could find the

attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)


which is different from the equaion $$e _ { i j } = a \left( s _ { i - 1 } , h _ { j } \right)$$ in the Neural Machine Translation by Jointly Learning to Align and Translate.

The attention is calculated based on the hidden state and output of decoder and the encoder_outputs is not included during the calculation of attention. However, the above equation shows that the attention is calculated based on the hidden state of decoder and relevant hidden state of encoder.

This problem has been proposed from GitHub, such as #84.

Although the author of this Pytorch tutorial has updated his jupyter notebook in his GitHub, the BahdanauAttnDecoderRNN(nn.Module) still has some bugs, such as #23.

Therefore, I searched other version of reproduction, like mini seq2seq. The author refered to three implementations and combined them together. His project is a standard implementation of Neural Machine Translation by Jointly Learning to Align and Translate, though there is a small bug.

Tags: