Pytorch NLP sequence length of target in Transformer

Question

I'm trying to understand the code of Transformer (https://github.com/SamLynnEvans/Transformer).

If seeing the train_model function in "train" script, I wonder why need to use the different sequence length of trg_input from trg:

trg_input = trg[:, :-1]

In this case, the sequence length of trg_input is "seq_len(trg) - 1". It means that trg is like:

<sos> tok1 tok2 tokn <eos>

and trg_input is like:

<sos> tok1 tok2 tokn    (no eos token)

Please let me know the reason.

Thank you.

The related code is like below:

    for i, batch in enumerate(opt.train):
        src = batch.src.transpose(0, 1).to('cuda')
        trg = batch.trg.transpose(0, 1).to('cuda')

        trg_input = trg[:, :-1]
        src_mask, trg_mask = create_masks(src, trg_input, opt)
        preds = model(src, trg_input, src_mask, trg_mask)
        ys = trg[:, 1:].contiguous().view(-1)
        opt.optimizer.zero_grad()
        loss = F.cross_entropy(preds.view(-1, preds.size(-1)), ys, ignore_index=opt.trg_pad)
        loss.backward()
        opt.optimizer.step()


def create_masks(src, trg, opt):
    
    src_mask = (src != opt.src_pad).unsqueeze(-2)

    if trg is not None:
        trg_mask = (trg != opt.trg_pad).unsqueeze(-2)
        size = trg.size(1) # get seq_len for matrix
        np_mask = nopeak_mask(size, opt)
        if trg.is_cuda:
            np_mask.cuda()
        trg_mask = trg_mask & np_mask
        
    else:
        trg_mask = None
    return src_mask, trg_mask

Sean · Accepted Answer · 2020-09-13 04:55:40Z

That's because the entire aim is to generate the next token based on the tokens we've seen so far. Take a look at the input into the model when we get our predictions. We're not just feeding the source sequence, but also the target sequence up until our current step. The model inside Models.py looks like:

class Transformer(nn.Module):
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads, dropout):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)
        self.decoder = Decoder(trg_vocab, d_model, N, heads, dropout)
        self.out = nn.Linear(d_model, trg_vocab)
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        #print("DECODER")
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

So you can see that the forward method receives src and trg, which are each fed into the encoder and decoder. This is a bit easier to grasp if you take a look at the model architecture from the original paper:

The "Outputs (shifted right)" corresponds to trg[:, :-1] in the code.

Collectives™ on Stack Overflow

Pytorch NLP sequence length of target in Transformer

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related