Reinforcement learning from human feedback (RLHF) is the standard recipe for turning a raw pre-trained model into a helpful assistant. People rank or compare model outputs, those comparisons train a reward model, and a policy optimisation algorithm such as PPO then nudges the language model toward responses the reward model scores highly.
Reasoning models reuse the machinery of RLHF but swap the reward source. Instead of a learned human-preference reward, they often optimise against automatically verifiable rewards (a correct final answer or a valid output format), which is cheaper, less noisy, and harder to game on tasks with checkable solutions.
