HmmâŚ
Short version: your trainer evaluation is not using the same generation setup as your manual model.generate() loop, and there is a small risk that compute_metrics is decoding the wrong tensor shape. Both will destroy an alignment-based metric like identity similarity.
I will walk through:
- How
Seq2SeqTrainer evaluates a T5 model.
- What is probably different from your manual evaluation.
- Concrete causes for the 85â100% vs 14% gap.
- Exact configuration and code changes to fix it.
- A small debugging checklist.
1. What Seq2SeqTrainer.evaluate() actually does
For seq2seq tasks, the intended pattern is:
- Use
Seq2SeqTrainer.
- Set
predict_with_generate=True.
- Trainer then calls
model.generate() inside prediction_step.
- The generated token IDs and labels are passed to
compute_metrics, where you decode and compute ROUGE, BLEU, etc. (Hugging Face)
The Hugging Face docs show exactly this pattern with T5 summarization:
Seq2SeqTrainer + predict_with_generate=True.
compute_metrics(eval_preds) gets (preds, labels) as integer token IDs.
- It replaces
-100 with pad_token_id, decodes both, then runs a text metric. (Hugging Face)
So in the âhappy pathâ your compute_metrics should be seeing generated IDs from model.generate, not logits.
Where generation parameters come from
In evaluation, Seq2SeqTrainer builds gen_kwargs for model.generate:
- If
Seq2SeqTrainingArguments.generation_max_length is set, it becomes max_length for generate.
- If not, it falls back to
model.generation_config.max_length.
- Similarly,
generation_num_beams overrides num_beams, otherwise it uses model.generation_config.num_beams. (GitHub)
The official summarization script clarifies that generation_max_length is used specifically to override model.generate(..., max_length=...) during evaluate and predict. (GitHub)
So: trainer.evaluate() does not magically reuse whatever arguments you used in your own manual model.generate calls. It only uses what you encode into Seq2SeqTrainingArguments or the modelâs generation_config.
2. What is different from your manual evaluation
Your current setup:
training_args = Seq2SeqTrainingArguments(
output_dir=f"./finetuning/temp",
predict_with_generate=True,
per_device_eval_batch_size=args.batch_size,
report_to='tensorboard',
logging_dir='./finetuning/logs/eval/' + args.ext,
fp16=True,
greater_is_better=True,
# no generation_max_length
# no generation_num_beams
)
In your manual evaluation you do something like (pseudo):
outputs = model.generate(
input_ids,
attention_mask=attn_mask,
max_length=MANUAL_MAX,
# or max_new_tokens=...
num_beams=MANUAL_BEAMS,
# maybe other options
)
But in the trainer:
generation_max_length is not set.
generation_num_beams is not set.
So during evaluate():
max_length will come from model.generation_config.max_length (which is often a small default).
num_beams will come from model.generation_config.num_beams (often 1 if you never changed it).
- If you used sampling or different beams manually, none of that is reflected in trainer evaluation. (Hugging Face Forums)
This is exactly the class of mismatch described in a HF forum thread where evaluation metrics during training and final trainer.evaluate() differed because generation_num_beams and generation_max_length were not aligned with num_beams and max_length used elsewhere. (Hugging Face Forums)
Your identity similarity metric is a global alignment score. It is very sensitive to:
- Truncated outputs.
- Different decoding search (greedy vs multi-beam).
- Outputs that diverge after a few tokens.
So if trainer.evaluate() is generating shorter or greedier outputs than your manual loop, you can easily drop from ~90% per sample to something like 10â20%.
3. Other likely issues in compute_metrics
3.1 Decoding logits instead of token IDs
Your metric function:
predictions, label_ids = eval_pred
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)
This assumes predictions is an integer array of shape (batch, seq_len).
However, in general Trainer semantics, EvalPrediction.predictions is âeverything your model returns apart from lossâ. If the model returns logits, past key values, etc., predictions can be a tuple with multiple arrays, often including (batch, seq_len, vocab_size) logits. This is documented in the forums: Trainer packs into predictions all non-loss outputs from the forward pass. (Hugging Face Forums)
- If you decode logits as IDs, you get garbage text and metrics close to random.
- With seq2seq and
predict_with_generate=True you should get generated IDs, but a misconfiguration, different trainer, or custom prediction_step can still leave you with logits in predictions.
A robust pattern is:
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
if predictions.ndim == 3:
# (batch, seq_len, vocab_size) logits
predictions = predictions.argmax(-1)
There is a very similar bug report in the Unsloth repo: the model generates coherent text in normal inference, but compute_metrics receives jumbled text during evaluation. The root cause there is also about how predictions are prepared and decoded in compute_metrics. (GitHub)
3.2 Multi-GPU and _state.is_main_process
You do:
if not _state.is_main_process:
return {'GAS':0,'Levensgtein_score':0,'Identity_Similarity_Score':0}
Using accelerate.PartialState().is_main_process is correct: Accelerate gathers all predictions and labels across ranks, and then compute_metrics is called on each process. The usual pattern is to compute metrics only on rank 0 and return {} on others. (Hugging Face)
Your zeros from non-main ranks normally do not get logged, so this guard is not the cause of the 14% score. It is fine to change it to return {} for clarity.
4. Concrete causes for your â85â100 vs 14â gap
Putting this together:
-
Generation config mismatch (most likely)
- Trainer uses
generation_max_length / generation_num_beams (or model defaults) during evaluation. (GitHub)
- Your manual loop uses different
max_length / max_new_tokens / num_beams.
- This yields different output texts, especially truncation and beam differences, which destroy a global identity similarity metric.
-
Potential decoding of logits
- If for any reason
predictions is (batch, seq_len, vocab_size) and you decode without argmax, you get nonsense text.
- That would turn a good model into low identity similarity numbers inside
compute_metrics. HF docs explicitly show that in non-generation setups, predictions is logits and you must argmax before decoding. (Hugging Face)
-
Dataset / preprocessing mismatch
- Manual evaluation might run on a different split or with extra cleaning steps applied; trainer uses
eval_dataset as passed.
- This could explain smaller deviations but not usually as drastic as 85â100 vs 14 by itself.
The first point matches closely the T5 summarization thread where ROUGE scores differ between training-time and final evaluation because generation_num_beams and generation_max_length did not match the num_beams and max_length used elsewhere. (Hugging Face Forums)
5. How to fix it: code changes
5.1 Make Trainer use the same generation config as your manual code
Take whatever you use in manual model.generate and set the equivalents in Seq2SeqTrainingArguments:
training_args = Seq2SeqTrainingArguments(
output_dir="./finetuning/temp",
predict_with_generate=True,
per_device_eval_batch_size=args.batch_size,
report_to="tensorboard",
logging_dir="./finetuning/logs/eval/" + args.ext,
fp16=True,
greater_is_better=True,
# match your manual generate()
generation_max_length=MANUAL_MAX_LENGTH, # e.g. 128 or 256
generation_num_beams=MANUAL_NUM_BEAMS, # e.g. 4
)
If you rely on max_new_tokens instead of max_length, set it in the modelâs generation config:
model.generation_config.max_new_tokens = MANUAL_MAX_NEW_TOKENS
Then either:
- Let
generation_max_length=None and rely on model config, or
- Compute
generation_max_length as max_input_len + max_new_tokens and set it explicitly.
The summarization example and the forum thread both confirm this is the correct way to keep evaluation behavior consistent. (GitHub)
5.2 Make compute_metrics robust to logits
Update the metric function:
def compute_metrics(eval_pred):
if not _state.is_main_process:
return {}
predictions, label_ids = eval_pred
if isinstance(predictions, (tuple, list)):
predictions = predictions[0]
# If we ever get logits, collapse to ids
if predictions.ndim == 3:
predictions = predictions.argmax(-1)
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
labels = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)
label_seq = tokenizer.batch_decode(labels, skip_special_tokens=True)
gas = []
i_sim_score = []
lev_score = []
for t, p in zip(label_seq, pred_seq):
global_alignment_score, identity_similarity_score = get_global_alignment_score(t, p, aligner)
gas.append(global_alignment_score)
i_sim_score.append(identity_similarity_score)
lev_score.append(get_levenshtein_score(t, p))
avg_gas = sum(gas) / len(gas)
avg_lev = sum(lev_score) / len(lev_score)
avg_i_sim_score = sum(i_sim_score) / len(i_sim_score)
return {
"GAS": avg_gas,
"Levensgtein_score": avg_lev,
"Identity_Similarity_Score": avg_i_sim_score,
}
This preserves your logic but guards against âlogits instead of IDsâ issues that are documented in Trainer discussions and real bug reports. (Hugging Face Forums)
5.3 Verify with trainer.predict
Run once:
predictions, label_ids, _ = trainer.predict(trainer.eval_dataset)
pred_ids = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_seq = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
Then compute identity similarity manually on these pred_seq and label_seq.
- If you still get ~14%, the difference is entirely from generation config or dataset.
- If you get ~85â100%, then
compute_metrics has an implementation error and you focus there.
6. Minimal debugging checklist
-
Print shapes inside compute_metrics on the main process:
print(predictions.shape, label_ids.shape).
- If
predictions.ndim == 3, you are seeing logits and must argmax.
-
Inspect a few decoded pairs:
- Print the first 3
(label_seq[i], pred_seq[i]).
- Compare to what you see in the manual loop for the same examples.
-
Set generation parameters explicitly:
- Add
generation_max_length and generation_num_beams that match your manual generate.
- Or configure
model.generation_config and let Trainer reuse it.
-
Ensure same dataset:
- Confirm that manual evaluation uses the same
eval_dataset passed into the trainer.
Summary
Seq2SeqTrainer with predict_with_generate=True uses model.generate() during evaluation, but the exact behavior depends on generation_max_length, generation_num_beams, and the modelâs generation_config, not on your separate manual generate calls. (Hugging Face)
- If you do not set those explicitly, trainer evaluation often produces shorter or different outputs than your manual evaluation, which is enough to drop a sensitive alignment-based metric like identity similarity from ~90% to ~14%. (Hugging Face Forums)
- There is an additional risk of decoding logits or complex tuples as IDs inside
compute_metrics; guarding with predictions.ndim == 3: predictions = predictions.argmax(-1) prevents this and matches documented Trainer behavior. (Hugging Face Forums)
- Align generation parameters, add the logits guard, and sanity-check
trainer.predict output. After that, trainer.evaluate() should report identity similarity close to what you observe with your manual model.generate() loop.