reinforcement learning
Reinforcement learning (RL) is a machine learning paradigm where an agent interacts with an environment by observing states, executing actions, and receiving reward feedback, with the goal of improving its policy to maximize cumulative (often discounted) reward.
Many RL problems are formalized as Markov decision processes (MDPs), which are defined by a set of states, actions, transition dynamics, and reward functions. When the agent’s observations are partial or noisy, the problem may instead be modeled as a partially observable Markov decision process (POMDP).
RL techniques include:
- Value-based methods, such as Q-learning and SARSA, which learn action-value or state-value functions and derive policies from them.
- Policy-based methods that directly optimize a parameterized policy.
- Actor–critic algorithms that combine both: the critic estimates values, and the actor updates the policy.
In practice, modern RL systems incorporate exploration strategies such as ε-greedy, entropy bonuses, UCB, and intrinsic motivation, along with experience replay (for off-policy learning), function approximation, and many stability enhancements like target networks, gradient clipping, trust regions, and so on.
Some RL methods are model-based, where the agent learns or uses an internal model of environment dynamics to plan actions.
Applications include game playing, robotics, autonomous control, recommendation systems, operations research, and more. RL systems are evaluated using metrics like return, sample efficiency, robustness, and adherence to safety constraints.
By Leodanis Pozo Ramos • Updated Nov. 4, 2025