evaluation
Evaluation is the process of systematically measuring how well an AI system satisfies its intended objectives using well-defined datasets, metrics, and procedures.
In practice, evaluation typically begins offline with held-out or cross-validation splits and task-specific metrics such as accuracy, precision, recall, and F1. Then, it extends to production with online methods like A/B tests to assess real user impact, business outcomes, and safety.
For generative and language models, evaluation typically combines standardized task benchmarks, such as MMLU, HellaSwag, ARC, and HumanEval with task-specific automatic metrics like BLEU, ROUGE, and perplexity for translation, summarization, and language modeling. It also draws on human or model-graded judgments of properties like factuality, coherence, helpfulness, and toxicity.
By Leodanis Pozo Ramos • Updated July 2, 2026