hallucination-leaderboard - Assessing LLM Hallucination Rates with HHEM-2.1 on a Continuous Basis

An Introduction to the Hallucination Leaderboard

The Hallucination Leaderboard offers a comprehensive insight into how often different Large Language Models (LLMs) generate inaccurate or fabricated information—what industry experts refer to as "hallucinations"—when they are tasked with summarizing documents. This leaderboard provides valuable data on various LLMs, showcasing their accuracy, reliability, and overall ability to maintain factual consistency.

Purpose of the Hallucination Leaderboard

The primary goal of the Hallucination Leaderboard is to evaluate and rank LLMs based on their tendency to introduce inaccuracies when summarizing content. This is achieved using Vectara's specialized tool, the Hughes Hallucination Evaluation Model (HHEM-2.1). Regular updates ensure that the leaderboard remains current as both the evaluation model and the LLMs evolve.

How the Rankings Work

LLMs are assessed using a benchmark created by HHEM-2.1, focusing on their hallucination rate, factual consistency, answer reliability, and how concisely they summarize documents. The table featured in the leaderboard displays models with varying hallucination rates, from the highly accurate Zhipu AI GLM-4-9B-Chat to others that show room for improvement, like Google Gemma-1.1-2B-it.

In Memory of Simon Mark Hughes

The leaderboard dedicates its efforts in memory of Simon Mark Hughes, highlighting the connection between the leaderboard's mission and his legacy. The work aims to ensure technology maintains accuracy and reliability, crucial aspects that reflect Hughes's values.

Data and Methodology

The evaluations utilize summaries drawn from open-source datasets, particularly those focusing on factual consistency in summary generation. Specifically, the documents used come from sources like the CNN/Daily Mail Corpus, and models are tested by asking them to create summaries that align closely with source material.

For accuracy in results, only 831 documents, which were accepted by every model, contributed to the final assessment. Each model's response to these documents was meticulously analyzed to determine their hallucination and factual consistency rates.

The Evaluation Model

HHEM-2.1 is the cornerstone model powering this leaderboard, a tool developed to withstand rigorous testing against state-of-the-art models. To gather data, each LLM received a set of documents via public APIs, tasked with producing summaries strictly based on the presented facts.

Prompt and API Usage

The models were prompted using a standardized request, ensuring they only rely on the content from the provided passage. Integration with various models, such as OpenAI's GPT-3.5 and GPT-4, as well as others from Anthropic and Mistral AI, was meticulously set through specific API endpoints, enabling precise evaluation of their summarization capabilities.

Contribution to Research

The Hallucination Leaderboard stands on the shoulders of prior research, borrowing from a pool of significant papers on inconsistency detection and factual accuracy evaluations. It follows established protocols from these studies, integrating their methodologies into a system that appraises modern LLMs' accuracy and reliability.

Conclusion

The Hallucination Leaderboard is a crucial resource for developers, researchers, and tech enthusiasts who aim to understand the intricacies of LLM performance. It lays bare the strengths and shortcomings of various models, fostering an environment where continuous improvement and transparency in artificial intelligence are prioritized. As LLMs continue to advance, the leaderboard plays an essential role in guiding their development towards greater factual accuracy and dependability.