Project Icon

LongBench

Evaluate Long Context Understanding for Bilingual Language Models

Product DescriptionLongBench is an open benchmark evaluating large language models' ability to comprehend long contexts in both Chinese and English. It covers six categories with 21 tasks, such as single- and multi-document QA, and summarization. Featuring a cost-effective automated evaluation process, LongBench assesses models across 14 English tasks, 5 Chinese tasks, and 2 code tasks, with contexts from 5k to 15k words across 4,750 test instances. LongBench-E provides balanced evaluations for different context lengths, aiding in understanding performance variations.
Project Details