RLHF, also called reinforcement learning from human preferences, is uniquely suited for tasks with goals that are complex, ill-defined or difficult to specify. For example, it would be impractical (or even impossible) for an algorithmic solution to define “funny” in mathematical terms—but easy for humans to rate jokes generated by a large language model (LLM). That human feedback, distilled into a reward function, could then be used to improve the LLM’s joke writing abilities.
In a 2017 paper, OpenAI’s Paul F. Christiano, alongside other researchers from OpenAI and DeepMind, detailed RLHF’s success in training AI models to perform intricate tasks like Atari games and simulated robotic locomotion.1 Expanding upon this breakthrough, video games continued to be an important proving ground for RLHF: by 2019, RLHF-trained AI systems like OpenAI Five and DeepMind’s AlphaStar had defeated top human professional players in the far more complex Dota 22 and StarCraft3, respectively.
Perhaps most importantly, OpenAI’s 2017 paper noted that its methodology—particularly the introduction of the proximal policy optimization (PPO) algorithm for updating model weights—greatly reduced the cost of gathering and distilling the necessary human feedback. This paved the way for the eventual integration of RLHF with the field of natural language processing (NLP), with the resulting advancements helping to usher both LLMs and RLHF into the vanguard of AI research.
The first release of code detailing the use of RLHF on language models came from in 2019 from OpenAI4, who went on to release the RLHF-trained InstructGPT in early 2022.5 This was a crucial step in bridging the gap between GPT-3 and the GPT-3.5-turbo models that powered the launch of ChatGPT.
RLHF has since been used in the training of state-of-the-art LLMs from OpenAI, DeepMind, Google6 and Anthropic.7