Having trouble to configure trainer for T5 model evaluation

tahmid1234 · December 9, 2025, 6:16pm

Hi, I am trying to evaluate the performance of my T5 model using a metric called identity similarity. When I use model.generate() to produce outputs, then decode and pass both the decoded output and the label to the evaluation function, I get good results (around 85–100% for each sample).

However, when I run trainer.evaluate(), the overall identity_similarity drops to around 14%. I understand that using model.generate() on each sample could lead to higher values, but the evaluation result seems unusually low. I am not sure if I am configuring the Trainer evaluation parameters correctly.

training_args = Seq2SeqTrainingArguments(
output_dir=f"./finetuning/temp",
predict_with_generate=True,
per_device_eval_batch_size=args.batch_size,
report_to=‘tensorboard’,
logging_dir=‘./finetuning/logs/eval/’ + args.ext,
fp16=True,
greater_is_better=True,
)

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
data_collator=collator,
)
def compute_metrics(eval_pred):
        if not _state.is_main_process:
            return {'GAS':0,'Levensgtein_score':0,'Identity_Similarity_Score':0}
        predictions, label_ids = eval_pred
        if isinstance(predictions, (tuple, list)):
            predictions = predictions[0]
        predictions = np.where(predictions!= -100, predictions, tokenizer.pad_token_id)
        pred_seq = tokenizer.batch_decode(predictions,skip_special_tokens=True)
        labels = np.where(label_ids!= -100, label_ids, tokenizer.pad_token_id)
        label_seq = tokenizer.batch_decode(labels,skip_special_tokens = True)
        gas = []
        i_sim_score = []
        lev_score = []
        for t,p in zip(label_seq,pred_seq):
            global_alignment_score, identity_similarity_score = get_global_alignment_score(t,p,aligner)
            gas.append(global_alignment_score)
            i_sim_score.append(identity_similarity_score)
            lev_score.append( get_levenshtein_score(t,p))

        avg_gas = sum(gas) / len(gas)
        avg_lev = sum(lev_score)/len(lev_score)
        avg_i_sim_score = sum(i_sim_score)/len(i_sim_score)
        print(f'primary state len_lev_score {len(lev_score)} len_gas {len(gas)} avg_gas {avg_gas} avg_lev {avg_lev} ')
        return {'GAS':avg_gas,'Levensgtein_score':avg_lev,'Identity_Similarity_Score':avg_i_sim_score}

John6666 · December 9, 2025, 9:37pm

Hmm…

Short version: your trainer evaluation is not using the same generation setup as your manual model.generate() loop, and there is a small risk that compute_metrics is decoding the wrong tensor shape. Both will destroy an alignment-based metric like identity similarity.

I will walk through:

How Seq2SeqTrainer evaluates a T5 model.
What is probably different from your manual evaluation.
Concrete causes for the 85–100% vs 14% gap.
Exact configuration and code changes to fix it.
A small debugging checklist.

1. What `Seq2SeqTrainer.evaluate()` actually does

For seq2seq tasks, the intended pattern is:

Use Seq2SeqTrainer.
Set predict_with_generate=True.
Trainer then calls model.generate() inside prediction_step.
The generated token IDs and labels are passed to compute_metrics, where you decode and compute ROUGE, BLEU, etc. (Hugging Face)

The Hugging Face docs show exactly this pattern with T5 summarization:

Seq2SeqTrainer + predict_with_generate=True.
compute_metrics(eval_preds) gets (preds, labels) as integer token IDs.
It replaces -100 with pad_token_id, decodes both, then runs a text metric. (Hugging Face)

So in the “happy path” your compute_metrics should be seeing generated IDs from model.generate, not logits.

Where generation parameters come from

In evaluation, Seq2SeqTrainer builds gen_kwargs for model.generate:

If Seq2SeqTrainingArguments.generation_max_length is set, it becomes max_length for generate.
If not, it falls back to model.generation_config.max_length.
Similarly, generation_num_beams overrides num_beams, otherwise it uses model.generation_config.num_beams. (GitHub)

The official summarization script clarifies that generation_max_length is used specifically to override model.generate(..., max_length=...) during evaluate and predict. (GitHub)

So: trainer.evaluate() does not magically reuse whatever arguments you used in your own manual model.generate calls. It only uses what you encode into Seq2SeqTrainingArguments or the model’s generation_config.

2. What is different from your manual evaluation

Your current setup:

training_args = Seq2SeqTrainingArguments(
    output_dir=f"./finetuning/temp",
    predict_with_generate=True,
    per_device_eval_batch_size=args.batch_size,
    report_to='tensorboard',
    logging_dir='./finetuning/logs/eval/' + args.ext,
    fp16=True,
    greater_is_better=True,
    # no generation_max_length
    # no generation_num_beams
)

In your manual evaluation you do something like (pseudo):

outputs = model.generate(
    input_ids,
    attention_mask=attn_mask,
    max_length=MANUAL_MAX,
    # or max_new_tokens=...
    num_beams=MANUAL_BEAMS,
    # maybe other options
)

But in the trainer:

generation_max_length is not set.
generation_num_beams is not set.

So during evaluate():

max_length will come from model.generation_config.max_length (which is often a small default).
num_beams will come from model.generation_config.num_beams (often 1 if you never changed it).
If you used sampling or different beams manually, none of that is reflected in trainer evaluation. (Hugging Face Forums)

This is exactly the class of mismatch described in a HF forum thread where evaluation metrics during training and final trainer.evaluate() differed because generation_num_beams and generation_max_length were not aligned with num_beams and max_length used elsewhere. (Hugging Face Forums)

Your identity similarity metric is a global alignment score. It is very sensitive to:

Truncated outputs.
Different decoding search (greedy vs multi-beam).
Outputs that diverge after a few tokens.

So if trainer.evaluate() is generating shorter or greedier outputs than your manual loop, you can easily drop from ~90% per sample to something like 10–20%.

3. Other likely issues in `compute_metrics`

3.1 Decoding logits instead of token IDs

Your metric function:

predictions, label_ids = eval_pred
if isinstance(predictions, (tuple, list)):
    predictions = predictions[0]
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)

This assumes predictions is an integer array of shape (batch, seq_len).

However, in general Trainer semantics, EvalPrediction.predictions is “everything your model returns apart from loss”. If the model returns logits, past key values, etc., predictions can be a tuple with multiple arrays, often including (batch, seq_len, vocab_size) logits. This is documented in the forums: Trainer packs into predictions all non-loss outputs from the forward pass. (Hugging Face Forums)

If you decode logits as IDs, you get garbage text and metrics close to random.
With seq2seq and predict_with_generate=True you should get generated IDs, but a misconfiguration, different trainer, or custom prediction_step can still leave you with logits in predictions.

A robust pattern is:

if isinstance(predictions, (tuple, list)):
    predictions = predictions[0]

if predictions.ndim == 3:
    # (batch, seq_len, vocab_size) logits
    predictions = predictions.argmax(-1)

There is a very similar bug report in the Unsloth repo: the model generates coherent text in normal inference, but compute_metrics receives jumbled text during evaluation. The root cause there is also about how predictions are prepared and decoded in compute_metrics. (GitHub)

3.2 Multi-GPU and `_state.is_main_process`

You do:

if not _state.is_main_process:
    return {'GAS':0,'Levensgtein_score':0,'Identity_Similarity_Score':0}

Using accelerate.PartialState().is_main_process is correct: Accelerate gathers all predictions and labels across ranks, and then compute_metrics is called on each process. The usual pattern is to compute metrics only on rank 0 and return {} on others. (Hugging Face)

Your zeros from non-main ranks normally do not get logged, so this guard is not the cause of the 14% score. It is fine to change it to return {} for clarity.

4. Concrete causes for your “85–100 vs 14” gap

Putting this together:

Generation config mismatch (most likely)
- Trainer uses generation_max_length / generation_num_beams (or model defaults) during evaluation. (GitHub)
- Your manual loop uses different max_length / max_new_tokens / num_beams.
- This yields different output texts, especially truncation and beam differences, which destroy a global identity similarity metric.
Potential decoding of logits
- If for any reason predictions is (batch, seq_len, vocab_size) and you decode without argmax, you get nonsense text.
- That would turn a good model into low identity similarity numbers inside compute_metrics. HF docs explicitly show that in non-generation setups, predictions is logits and you must argmax before decoding. (Hugging Face)
Dataset / preprocessing mismatch
- Manual evaluation might run on a different split or with extra cleaning steps applied; trainer uses eval_dataset as passed.
- This could explain smaller deviations but not usually as drastic as 85–100 vs 14 by itself.

The first point matches closely the T5 summarization thread where ROUGE scores differ between training-time and final evaluation because generation_num_beams and generation_max_length did not match the num_beams and max_length used elsewhere. (Hugging Face Forums)

5. How to fix it: code changes

5.1 Make Trainer use the same generation config as your manual code

Take whatever you use in manual model.generate and set the equivalents in Seq2SeqTrainingArguments:

training_args = Seq2SeqTrainingArguments(
    output_dir="./finetuning/temp",
    predict_with_generate=True,
    per_device_eval_batch_size=args.batch_size,
    report_to="tensorboard",
    logging_dir="./finetuning/logs/eval/" + args.ext,
    fp16=True,
    greater_is_better=True,

    # match your manual generate()
    generation_max_length=MANUAL_MAX_LENGTH,   # e.g. 128 or 256
    generation_num_beams=MANUAL_NUM_BEAMS,     # e.g. 4
)

If you rely on max_new_tokens instead of max_length, set it in the model’s generation config:

model.generation_config.max_new_tokens = MANUAL_MAX_NEW_TOKENS

Then either:

Let generation_max_length=None and rely on model config, or
Compute generation_max_length as max_input_len + max_new_tokens and set it explicitly.

The summarization example and the forum thread both confirm this is the correct way to keep evaluation behavior consistent. (GitHub)

5.2 Make `compute_metrics` robust to logits

Update the metric function:

def compute_metrics(eval_pred):
    if not _state.is_main_process:
        return {}

    predictions, label_ids = eval_pred

    if isinstance(predictions, (tuple, list)):
        predictions = predictions[0]

    # If we ever get logits, collapse to ids
    if predictions.ndim == 3:
        predictions = predictions.argmax(-1)

    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
    labels = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

    pred_seq = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    label_seq = tokenizer.batch_decode(labels,   skip_special_tokens=True)

    gas = []
    i_sim_score = []
    lev_score = []

    for t, p in zip(label_seq, pred_seq):
        global_alignment_score, identity_similarity_score = get_global_alignment_score(t, p, aligner)
        gas.append(global_alignment_score)
        i_sim_score.append(identity_similarity_score)
        lev_score.append(get_levenshtein_score(t, p))

    avg_gas = sum(gas) / len(gas)
    avg_lev = sum(lev_score) / len(lev_score)
    avg_i_sim_score = sum(i_sim_score) / len(i_sim_score)

    return {
        "GAS": avg_gas,
        "Levensgtein_score": avg_lev,
        "Identity_Similarity_Score": avg_i_sim_score,
    }

This preserves your logic but guards against “logits instead of IDs” issues that are documented in Trainer discussions and real bug reports. (Hugging Face Forums)

5.3 Verify with `trainer.predict`

Run once:

predictions, label_ids, _ = trainer.predict(trainer.eval_dataset)

pred_ids = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

pred_seq  = tokenizer.batch_decode(pred_ids,  skip_special_tokens=True)
label_seq = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

Then compute identity similarity manually on these pred_seq and label_seq.

If you still get ~14%, the difference is entirely from generation config or dataset.
If you get ~85–100%, then compute_metrics has an implementation error and you focus there.

6. Minimal debugging checklist

Print shapes inside compute_metrics on the main process:
- print(predictions.shape, label_ids.shape).
- If predictions.ndim == 3, you are seeing logits and must argmax.
Inspect a few decoded pairs:
- Print the first 3 (label_seq[i], pred_seq[i]).
- Compare to what you see in the manual loop for the same examples.
Set generation parameters explicitly:
- Add generation_max_length and generation_num_beams that match your manual generate.
- Or configure model.generation_config and let Trainer reuse it.
Ensure same dataset:
- Confirm that manual evaluation uses the same eval_dataset passed into the trainer.

Summary

Seq2SeqTrainer with predict_with_generate=True uses model.generate() during evaluation, but the exact behavior depends on generation_max_length, generation_num_beams, and the model’s generation_config, not on your separate manual generate calls. (Hugging Face)
If you do not set those explicitly, trainer evaluation often produces shorter or different outputs than your manual evaluation, which is enough to drop a sensitive alignment-based metric like identity similarity from ~90% to ~14%. (Hugging Face Forums)
There is an additional risk of decoding logits or complex tuples as IDs inside compute_metrics; guarding with predictions.ndim == 3: predictions = predictions.argmax(-1) prevents this and matches documented Trainer behavior. (Hugging Face Forums)
Align generation parameters, add the logits guard, and sanity-check trainer.predict output. After that, trainer.evaluate() should report identity similarity close to what you observe with your manual model.generate() loop.

Topic		Replies	Views
Trainer.evaluate() with text generation Beginners	5	3597	December 31, 2021
Using Trainer class with T5 - what is returned in EvalPrediction dict? 🤗Transformers	8	5372	February 14, 2022
[Urgent] trainer.predict() and model.generate creates totally different predictions 🤗Transformers	4	6997	February 1, 2021
Evaluation results (metric) during training is different from the evaluation results at the end 🤗Transformers	4	3329	September 26, 2022
Evaluate model at saved checkpoint 🤗Transformers	0	1310	June 22, 2021