Skip to content

reinforcement learning from human feedback (RLHF)

Reinforcement learning from human feedback (RLHF) is a training technique that aligns a large language model with human preferences by using human comparisons of model outputs to shape the model’s behavior.

A typical RLHF pipeline runs in three stages:

  • Supervised fine-tuning on human-written prompts and responses to give the base model instruction-following behavior.
  • Reward model training, where annotators rank pairs of model outputs and a separate model learns to predict those preferences as a scalar reward.
  • Policy optimization with a reinforcement learning algorithm, typically proximal policy optimization (PPO), that maximizes the reward while a Kullback-Leibler divergence penalty discourages drift from the supervised baseline.

RLHF was popularized by InstructGPT and is used to align assistants such as ChatGPT, Claude, and Gemini. Variants include direct preference optimization (DPO), which skips the explicit reward model, and reinforcement learning from AI feedback (RLAIF), which replaces human annotators with another model. Known limitations include reward hacking, sensitivity to annotator agreement, and bias inherited from the preference data.

Tutorial

Build an LLM RAG Chatbot With LangChain

Large language models (LLMs) have taken the world by storm, demonstrating unprecedented capabilities in natural language tasks. In this step-by-step tutorial, you'll leverage LLMs to build your own retrieval-augmented generation (RAG) chatbot using synthetic data with LangChain and Neo4j.

intermediate ai databases data-science

For additional information on related topics, take a look at the following resources:


By Martin Breuss • Updated May 28, 2026