reinforcement learning
Reinforcement learning (RL) is a machine learning paradigm in which an agent interacts with an environment by observing states, executing actions, and receiving reward feedback, with the goal of improving its policy to maximize cumulative (often discounted) reward.
Many RL problems are formalized as Markov decision processes (MDPs), defined by a set of states, actions, transition dynamics, and reward functions. When the agent’s observations are partial or noisy, the problem may instead be modeled as a partially observable Markov decision process (POMDP).
RL techniques include:
- Value-based methods, such as Q-learning and SARSA, that learn action-value or state-value functions and derive policies from them.
- Policy-based methods that directly optimize a parameterized policy.
- Actor–critic algorithms that combine both: the critic estimates values, the actor updates the policy
In practice, modern RL systems incorporate exploration strategies like ε-greedy, entropy bonuses, UCB, and intrinsic motivation, experience replay (for off-policy learning), function approximation, and many stability enhancements, such as target networks, gradient clipping, trust regions, and so on.
Some RL methods are model-based, where the agent learns or uses an internal model of environment dynamics to plan actions.
Applications span game playing, robotics, autonomous control, recommendation systems, operations research, and beyond. RL systems are evaluated by metrics like return, sample efficiency, robustness, and adherence to safety constraints.
By Leodanis Pozo Ramos • Updated Oct. 21, 2025