What Is Reinforcement Learning From Human Feedback (RLHF)?

Authors

Senior Staff Writer, AI Models

IBM Think

What is RLHF?

Reinforcement learning from human feedback (RLHF) is a machine learning technique in which a “reward model” is trained with direct human feedback, then used to optimize the performance of an artificial intelligence agent through reinforcement learning.

RLHF, also called reinforcement learning from human preferences, is uniquely suited for tasks with goals that are complex, ill-defined or difficult to specify. For example, it would be impractical (or even impossible) for an algorithmic solution to define “funny” in mathematical terms—but easy for humans to rate jokes generated by a large language model (LLM). That human feedback, distilled into a reward function, could then be used to improve the LLM’s joke writing abilities.

In a 2017 paper, OpenAI’s Paul F. Christiano, alongside other researchers from OpenAI and DeepMind, detailed RLHF’s success in training AI models to perform intricate tasks like Atari games and simulated robotic locomotion.¹Expanding upon this breakthrough, video games continued to be an important proving ground for RLHF: by 2019, RLHF-trained AI systems like OpenAI Five and DeepMind’s AlphaStar had defeated top human professional players in the far more complex Dota 2² and StarCraft³, respectively.

Perhaps most importantly, OpenAI’s 2017 paper noted that its methodology—particularly the introduction of the proximal policy optimization (PPO) algorithm for updating model weights—greatly reduced the cost of gathering and distilling the necessary human feedback. This paved the way for the eventual integration of RLHF with the field of natural language processing (NLP), with the resulting advancements helping to usher both LLMs and RLHF into the vanguard of AI research.

The first release of code detailing the use of RLHF on language models came from in 2019 from OpenAI⁴, who went on to release the RLHF-trained InstructGPT in early 2022.⁵ This was a crucial step in bridging the gap between GPT-3 and the GPT-3.5-turbo models that powered the launch of ChatGPT.

RLHF has since been used in the training of state-of-the-art LLMs from OpenAI, DeepMind, Google⁶ and Anthropic.⁷

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

How reinforcement learning works

Conceptually, reinforcement learning (RL) aims to emulate the way that human beings learn: AI agents learn holistically through trial and error, motivated by strong incentives to succeed.

To put that strategy into practice, a mathematical framework for reinforcement learning comprises the following components:

State space

The state space is all available information about the task at hand that is relevant to decisions the AI agent might make, including both known and unknown variables. The state space usually changes with each decision the agent makes.

Action space

The action space contains all of the decisions the AI agent might make. In the context of a board game, for example, the action space is discrete and well-defined: it consists of all the legal moves available to the AI player at a given moment. In the context of text generation, the action space is massive, comprising the entire “vocabulary” of tokens available to an LLM.

Reward function

Reward is the measure of success or progress that incentivizes the AI agent. In some instances, like board games, defining success–in this case, winning the game—is objective and straightforward. But when the definition of "success" is nebulous, designing an effective reward function can be a significant challenge. In a mathematical framework, this feedback must be translated into a reward signal: a scalar quantification of positive (or negative) feedback.

Constraints

A reward function could be supplemented by penalties—negative rewards—for actions deemed counterproductive to the task at hand. For example, an enterprise might want to prohibit a chatbot from using profanity or other vulgar language; a self-driving car model might be penalized for collisions or straying outside a lane.

Policy

A policy is, essentially, the strategy or “thought process” that drives an AI agent’s behavior. In plain mathematical terms, a policy (“π”) is a function that takes a state (“s”) as input and returns an action (“a”): π(s)→a.

The goal of an RL algorithm is to optimize a policy to yield maximum reward. In deep reinforcement learning, the policy is represented as a neural network which is continuously updated, per the reward function, during the training process. The AI agent learns from experience, much like humans do.

While conventional RL has achieved impressive real-world results in many fields, it can struggle to effectively construct a reward function for complex tasks where a clear-cut definition of success is hard to establish. The primary advantage of RLHF is its ability to capture nuance and subjectivity by using positive human feedback in lieu of formally defined objectives.

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

RLHF for large language models

One of RLHF’s most prominent applications has been to enhance the relevance, accuracy and ethics of LLMs—particularly for their use as chatbots.

LLMs, like all generative AI models, aim to replicate the probability distribution of training data. Though recent advances have furthered the use of LLMs as engines for chatbots, or even as reasoning engines for general-purpose AI, these language models are simply using patterns learned from its training data to predict the next word(s) in a given sequence that is initiated by a prompt. On a fundamental level, these models do not actually answer a prompt: they are appending text to it.

Without very specific instructions, language models have little ability to understand user intent. Though prompt engineering can help provide the context necessary for an LLM to cater its response to a user’s needs, it's impractical to require prompt engineering for every single exchange with a chatbot.

Furthermore, while out-of-the-box LLMs have been trained with conventional methods to produce grammatically coherent output, training LLMs to produce “good” output is an enigmatic problem. Concepts like truth, helpfulness, creativity or even what makes a code snippet executable are far more context-dependent than word meanings and linguistic structure.

To make language models better at human interaction, data scientists turned to reinforcement learning with human feedback. The RLHF-enhanced InstructGPT models meaningfully outperformed their GPT-3 predecessors, particularly in terms of following instructions, maintaining factual accuracy and avoiding model hallucinations.⁵ Likewise, research released by OpenAI upon the launch of GPT-4 showed that RLHF doubled accuracy on adversarial questions.⁸

The benefits of RLHF can even supersede the value of larger training datasets, allowing for more data-efficient model development: OpenAI noted that its labelers preferred outputs from the 1.3B-parameter version of InstructGPT over even outputs from the 175B-parameter version of GPT-3.⁵

How does RLHF work?

Training an LLM with RLHF typically occurs in four phases:

Pre-training models

RLHF is generally employed to fine-tune and optimize a pre-trained model, rather than as an end-to-end training method. For example, InstructGPT used RLHF to enhance the pre-existing GPT—that is, Generative Pre-trained Transformer—model. In its release announcement for InstructGPT, OpenAI stated that “one way of thinking about this process is that it ‘unlocks’ capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone.”⁵

Pre-training remains by far the most resource-intensive phase of RLHF. OpenAI noted that the RLHF training process for InstructGPT entailed less than 2 percent of the computation and data needed for the pre-training of GPT-3.

Supervised fine-tuning

Prior to the start of explicit reinforcement learning, supervised fine-tuning (SFT) is used to prime the model to generate its responses in the format expected by users.

As alluded to earlier, the LLM pre-training process optimizes models for completion: predicting the next words in a sequence began with the user’s prompt by replicating linguistic patterns learned during model pre-training. Sometimes, LLMs won’t complete a sequence in the way a user wants: for example, if a user’s prompt is, “teach me how to make a resumé,” the LLM might respond with “using Microsoft Word.” It’s a valid way to complete the sentence, but not aligned with user’s goal.

SFT therefore uses supervised learning to train models to respond appropriately to different kinds of prompts. Human experts create labeled examples, following the format (prompt, response), to demonstrate how to respond to prompts for different use cases, like question answering, summarization or translation.

This demonstration data, while powerful, is time-consuming and expensive to generate. Rather than create bespoke new examples, DeepMind introduced the approach of "applying a filtering heuristic based on a common written dialogue format (‘interview transcript’ style)” to isolate suitable prompt/response example pairings from within their MassiveWeb dataset.⁹

Reward model training

For human feedback to power a reward function in reinforcement learning, a reward model is needed to translate human preference into a numerical reward signal. Designing an effective reward model is a crucial step in RLHF, as no straightforward mathematical or logical formula exists to feasibly define subjective human values.

The main purpose of this phase is to provide the reward model with sufficient training data, comprised of direct feedback from human evaluators, to help the model learn to mimic the way human preferences allocate rewards to different kinds of model responses. This allows for training to continue offline without a human in the loop.

A reward model must intake a sequence of text and output a scalar reward value that predicts, numerically, how much a human user would reward (or penalize) that text. This output being a scalar value is essential for the output of the reward model to be integrated with other components of the RL algorithm.

While it might seem most intuitive to simply have human evaluators express their opinion of each model response in scalar form—like rating the response on a scale of one (worst) to ten (best)—it’s prohibitively difficult to get all human raters aligned on the relative value of a given score, much less get human raters aligned on what constitutes a “good” or “bad” response in a vacuum. This can make direct scalar rating noisy and challenging to calibrate.

Instead, a rating system is usually constructed by comparing human feedback for different model outputs. A common method is to have users compare two analogous text sequences—like the output of two different language models responding to the same prompt—in head-to-head matchups, then use an Elo rating system to generate an aggregated ranking of each bit of generated text relative to one another. A simple system might allow users to “thumbs up” or “thumbs down” each output, with outputs then being ranked by their relative favorability. More complex systems might ask labelers to provide an overall rating and answer categorical questions about the flaws of each response, then algorithmically aggregate this feedback into a weighted quality score.

The outcomes of whichever ranking systems are ultimately normalized into a scalar reward signal to inform reward model training.

Policy optimization

The final hurdle of RLHF is determining how—and how much—the reward model should be used to update the AI agent’s policy. One of the most successful algorithms used for the reward function that updates RL models is proximal policy optimization (PPO).

Unlike most machine learning and neural network model architectures, which use gradient descent to minimize their loss function and yield the smallest possible error, reinforcement learning algorithms often use gradient ascent to maximize reward.

However, if the reward function is used to train the LLM without any guardrails, the language model may dramatically change its weights to the point of outputting gibberish in an effort to “game” the reward model. PPO provides a more stable means of updating the AI agent’s policy by limiting how much the policy can be updated in each training iteration.

First, a copy of the initial model is created and its trainable weights are frozen. The PPO algorithm calculates a range of [1-ε, 1+ε], in which ε is a hyperparameter that roughly determines how far the new (updated) policy is allowed to stray from the old (frozen) policy. Then, it calculates a probability ratio: the ratio of the probability of a given action being taken by the old policy vs. the probability of that action being taken by the new policy. If the probability ratio is greater than 1+ε (or below 1-ε), the magnitude of the policy update may be clipped to prevent any steep changes that may destabilize the entire model.

The introduction of PPO provided an attractive alternative to its predecessor, trust region policy optimization (TRPO), which provides similar benefits but is more complicated and computationally expensive than PPO. While other policy optimization frameworks like advantage actor-critic (A2C) are also viable, PPO is often favored as a simple and cost-effective methodology.

Limitations of RLHF

Though RLHF models have demonstrated impressive results in training AI agents for complex tasks from robotics and video games to NLP, using RLHF is not without its limitations.

Human preference data is expensive. The need to gather firsthand human input can create a costly bottleneck that limits the scalability of the RLHF process. Both Anthropic¹⁰ and Google¹¹ have proposed methods of reinforcement learning from AI feedback (RLAIF), replacing some or all human feedback by having another LLM evaluate model responses, that have yielded results comparable to those of RLHF.

Human input is highly subjective. It’s difficult, if not impossible, to establish firm consensus on what constitutes “high-quality” output, as human annotators will often disagree on not only alleged facts, but also what “appropriate” model behavior should mean. Human disagreement thus precludes the realization of a genuine “ground truth” against which model performance can judged.

Human evaluators can be fallible, or even intentionally adversarial and malicious. Whether reflecting genuine contrarian views or intentionally trolling the learning process, human guidance to the model is not always provided in good faith. In a 2016 paper, Wolf, et al posited that toxic behavior should be a fundamental expectation of human-bot interactions and suggested the need for a method to assess the credibility of human input.¹² In 2022, Meta AI released a paper on adversarial human input (PDF) studying automated methods “to gain maximum learning efficiency from high quality data, while simultaneously being maximally robust to low quality and adversarial data.” The paper identifies various “troll” archetypes and the different ways they distort feedback data.

RLHF risks overfitting and bias. If human feedback is gathered from an overly narrow demographic, the model may demonstrate performance issues when used by different groups or prompted on subject matters for which the human evaluators hold certain biases.

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

What is reinforcement learning from human feedback (RLHF)?