evaluation
Evaluation is the process of systematically measuring how well an AI system satisfies its intended objectives using well-defined datasets, metrics, and procedures.
In practice, evaluation typically begins offline with held-out or cross-validation splits and task-specific metrics such as accuracy, precision, recall, and F1. Then, it extends to production with online methods like A/B tests to assess real user impact, business outcomes, and safety.
For generative and language models, evaluation combines automatic metrics, such as BLEU, ROUGE, and perplexity, with human or model-graded judgments of properties like factuality, coherence, helpfulness, and toxicity.
By Leodanis Pozo Ramos • Updated Nov. 17, 2025