I'm trying to understand the code of Transformer (https://github.com/SamLynnEvans/Transformer).
If seeing the train_model function in "train" script, I wonder why need to use the different sequence length of trg_input from trg:
trg_input = trg[:, :-1]
In this case, the sequence length of trg_input is "seq_len(trg) - 1". It means that trg is like:
<sos> tok1 tok2 tokn <eos>
and trg_input is like:
<sos> tok1 tok2 tokn (no eos token)
Please let me know the reason.
Thank you.
The related code is like below:
for i, batch in enumerate(opt.train):
src = batch.src.transpose(0, 1).to('cuda')
trg = batch.trg.transpose(0, 1).to('cuda')
trg_input = trg[:, :-1]
src_mask, trg_mask = create_masks(src, trg_input, opt)
preds = model(src, trg_input, src_mask, trg_mask)
ys = trg[:, 1:].contiguous().view(-1)
opt.optimizer.zero_grad()
loss = F.cross_entropy(preds.view(-1, preds.size(-1)), ys, ignore_index=opt.trg_pad)
loss.backward()
opt.optimizer.step()
def create_masks(src, trg, opt):
src_mask = (src != opt.src_pad).unsqueeze(-2)
if trg is not None:
trg_mask = (trg != opt.trg_pad).unsqueeze(-2)
size = trg.size(1) # get seq_len for matrix
np_mask = nopeak_mask(size, opt)
if trg.is_cuda:
np_mask.cuda()
trg_mask = trg_mask & np_mask
else:
trg_mask = None
return src_mask, trg_mask
