LLM alignment typically relies on large and expensive reward models. What if a simple metric could replace them? In a new #NeurIPS2025 paper, Lambda’s Amir Zadeh and Chuan Li introduce BLEUBERI, which uses BLEU scores as the reward for instruction following: https://lnkd.in/eV3XHFQz With high-quality synthetic references, BLEU, a surprisingly simple score, matches human preferences at about 74 percent, which is close to the performance of 20B-scale reward models. BLEUBERI-trained models achieve competitive results on MT-Bench, ArenaHard, and WildBench, and they often produce responses that are more factually grounded. This makes alignment significantly cheaper while maintaining strong output quality.