Glossary Entry

Reinforcement Learning from Human Feedback

A post-training method that aligns a language model to human preferences by training a reward model on human comparisons, then optimising the policy against that reward with RL.

RL Training LLMs

Also called: RLHF

Reinforcement learning from human feedback (RLHF) is the standard recipe for turning a raw pre-trained model into a helpful assistant. People rank or compare model outputs, those comparisons train a reward model, and a policy optimisation algorithm such as PPO then nudges the language model toward responses the reward model scores highly.

Reasoning models reuse the machinery of RLHF but swap the reward source. Instead of a learned human-preference reward, they often optimise against automatically verifiable rewards (a correct final answer or a valid output format), which is cheaper, less noisy, and harder to game on tasks with checkable solutions.