evaluation
Evaluation is the process of systematically measuring how well an AI system satisfies its intended objectives using well-defined datasets, metrics, and procedures.
In practice, evaluation typically begins offline with held-out or cross-validation splits and task-specific metrics such as accuracy, precision, recall, and F1. Then, it extends to production with online methods like A/B tests to assess real user impact, business outcomes, and safety.
For generative and language models, evaluation typically combines standardized task benchmarks (such as MMLU, HellaSwag, ARC, and HumanEval), task-specific automatic metrics (like BLEU, ROUGE, and perplexity for translation, summarization, and language modeling), and human or model-graded judgments of properties like factuality, coherence, helpfulness, and toxicity.
By Leodanis Pozo Ramos • Updated May 12, 2026