Project Icon

babilong

Evaluate LLMs with the challenging long-context BABILong benchmark

Product DescriptionBABILong evaluates NLP models on their ability to handle long documents filled with disparate facts, incorporating bAbI data and PG19 texts for diverse reasoning tasks. The benchmark's 20 tasks, including fact chaining and deduction, challenge even advanced models like GPT-4. Contributions to the benchmark are encouraged to further collective insights into LLM capabilities.
Project Details