en

#BABILong

BABILong evaluates NLP models on their ability to handle long documents filled with disparate facts, incorporating bAbI data and PG19 texts for diverse reasoning tasks. The benchmark's 20 tasks, including fact chaining and deduction, challenge even advanced models like GPT-4. Contributions to the benchmark are encouraged to further collective insights into LLM capabilities.

Terms of Use Privacy Policy Advertising Services

Feedback Email: [email protected]