Awesome Text Generation Evaluation: a curated list of evaluation metrics for Natural Language Generation (NLG)
This repository, called Awesome Text Generation Evaluation, contains a collection of resources and papers on reference-based and reference-free evaluation metrics for Natural Language Generation (NLG).
"If you can't measure it, you can't improve it. " - British Physicist William Thomson
Welcome to share your papers, thoughts, and ideas by submitting an issue!
- Survey
- Human Judgement Datasets
- Lexical Overlap as Evaluator
- Learned Metrics as Evaluator
- Explainability-driven Metrics as Evaluator
- Citation
Reference-free Evaluation Metrics for Text Generation: A Survey
Takumi Ito, Kees van Deemter, Jun Suzuki
arXiv 2025, [Paper]
21 Jan 2025
Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges
Jonas Becker, Jan Philip Wahle, Bela Gipp, Terry Ruas
arXiv 2024, [Paper]
29 Aug 2024
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger
arXiv 2022, [Paper] [GitHub]
21 Mar 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann, Elizabeth Clark, Thibault Sellam
Journal of Artificial Intelligence Research 2022, [Paper]
14 Feb 2022
Evaluation of Text Generation: A Survey
Asli Celikyilmaz, Elizabeth Clark, Jianfeng Gao
arXiv 2021, [Paper]
18 May 2021
People Overtrust AI-Generated Medical Advice despite Low Accuracy
Shruthi Shekar, Pat Pataranutaporn, Chethan Sarabu, Guillermo A. Cecchi, Pattie Maes
NEJM AI 2025, [Paper] [GitHub] [Dataset]
11 Aug 2024
AlpacaEval: An Automatic Evaluator for Instruction-following Language Models
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto
Online Leaderboard, [Leaderboard] [GitHub] [Dataset]
2023
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi
ACL 2023, [Paper] [GitHub] [Dataset]
29 May 2023
WebGPT: Browser-assisted Question-answering with Human Feedback
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
arXiv 2022, [Paper] [Official Dataset] [Processed Dataset]
1 Jun 2022
Learning to Summarize From Human Feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman
NeurIPS 2020, [Paper] [GitHub] [Dataset]
15 Feb 2022
Evaluating Question Answering Evaluation
Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner
Proceedings of the Second Workshop on Machine Reading for Question Answering 2019, [Workshop Paper] [Symposium Paper] [Presentation] [Dataset]
2019
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh
CVPR 2015, [Paper]
2015
chrF: Character N-gram F-score for Automatic MT Evaluation
Maja Popovic
Workshop on Statistical Machine Translation 2015, [Paper] [GitHub] [Hugging Face]
2015
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee, Alon Lavie
ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization 2005, [Paper]
2005
ROUGE: A Package for Automatic Evaluation of Summaries
Chin-Yew Lin
Proceedings of Workshop on Text Summarization Branches Out 2004, [Paper]
2004
BLEU: A Method for Automatic Evaluation of Machine Translation
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu
ACL 2002, [Paper]
2002
Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith
ACL 2019, [Paper] [GitHub]
2019
From Word Embeddings To Document Distances
Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger
ICML 2015, [Paper] [GitHub]
2015
BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam, Dipanjan Das, Ankur P. Parikh
ACL 2020, [Paper]
21 May 2020
COMET: A Neural Framework for MT Evaluation
Ricardo Rei, Craig Stewart, Ana C Farinha, Alon Lavie
EMNLP 2020, [Paper]
19 Oct 2020
BEER: BEtter Evaluation as Ranking
Miloš Stanojević, Khalil Sima’an
Workshop on Statistical Machine Translation 2014, [Paper]
2014
Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity
Brian Thompson, Matt Post
WMT 2020, [Paper]
28 Oct 2020
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020, [Paper] [GitHub] [Hugging Face]
24 Feb 2020
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen Eger
EMNLP 2019, [Paper] [GitHub]
26 Sep 2019
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan, Graham Neubig, Pengfei Liu
NeurIPS 2021, [Paper] [GitHub]
27 Oct 2021
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li
EMNLP 2023, [Paper] [GitHub]
26 Oct 2023
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong, Dacheng Tao
EMNLP 2023, [Paper] [GitHub]
20 Dec 2022
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
EMNLP 2022, [Paper] [GitHub]
26 Oct 2022
MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem
Stefano Perrella, Lorenzo Proietti, Alessandro Scirè, Niccolò Campolungo, Roberto Navigli
WMT 2022, [Paper] [GitHub]
2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger
arXiv 2022, [Paper] [GitHub]
21 Mar 2022
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui
ACL 2024, [Paper] [GitHub]
30 Aug 2023
RAGAs: Automated Evaluation of Retrieval Augmented Generation
Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert
EACL 2024, [Paper] [GitHub]
28 Apr 2025
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou
New Frontiers in Summarization Workshop 2023, [Paper] [GitHub]
24 Oct 2023
GPTScore: Evaluate as You Desire
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu
NAACL 2024, [Paper] [GitHub]
13 Feb 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
EMNLP 2023, [Paper] [GitHub]
23 May 2023
If you find our list useful, please consider citing our repo and toolkit in your publications. We provide a BibTeX entry below.
@misc{JiaAwesomeNLGEvaluation25,
author = {Jia, Shuyue},
title = {Awesome Text Generation Evaluation},
year = {2025},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/SuperBruceJia/Awesome-Text-Generation-Evaluation}},
}
@misc{JiaAwesomeSTS23,
author = {Jia, Shuyue},
title = {Awesome Semantic Textual Similarity},
year = {2023},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/SuperBruceJia/Awesome-Semantic-Textual-Similarity}},
}
@misc{JiaAwesomeLLM23,
author = {Jia, Shuyue},
title = {Awesome {LLM} Self-Consistency},
year = {2023},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/SuperBruceJia/Awesome-LLM-Self-Consistency}},
}
@misc{JiaPromptCraft23,
author = {Jia, Shuyue},
title = {{PromptCraft}: A Prompt Perturbation Toolkit},
year = {2023},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/SuperBruceJia/promptcraft}},
}