evaluation

Evaluation is the process of systematically measuring how well an AI system satisfies its intended objectives using well-defined datasets, metrics, and procedures.

In practice, evaluation typically begins offline with held-out or cross-validation splits and task-specific metrics such as accuracy, precision, recall, and F1. Then, it extends to production with online methods like A/B tests to assess real user impact, business outcomes, and safety.

For generative and language models, evaluation combines automatic metrics, such as BLEU, ROUGE, and perplexity, with human or model-graded judgments of properties like factuality, coherence, helpfulness, and toxicity.

By Leodanis Pozo Ramos • Updated Nov. 17, 2025

AI Coding Glossary Share Feedback