babilong
BABILong evaluates NLP models on their ability to handle long documents filled with disparate facts, incorporating bAbI data and PG19 texts for diverse reasoning tasks. The benchmark's 20 tasks, including fact chaining and deduction, challenge even advanced models like GPT-4. Contributions to the benchmark are encouraged to further collective insights into LLM capabilities.