Reinforcement learning with human feedback (RLHF) uses a partnered reward model to fine-tune a pre-trained model for complex, subjective tasks. An ML model cannot judge whether a piece of writing is evocative, but humans can, and those humans can teach a model to mimic their preferences.
With RLHF, humans train a reward model for the new task. The reward model’s job is to successfully predict how a human would react to a given input. Whereas standard model training penalizes errors, reward training incentivizes good performance.
Then, the reward model in turn teaches the foundation model how to behave, based on the preferences of the human trainers. Once the reward model is trained, it can train the foundation model without a human in the loop (HITL).
As with all types of machine learning, the model is not thinking critically, or even thinking at all. Rather, it is mathematically choosing the outcome that is most likely to match the preferences of its human trainers.