LongBench
LongBench is an open benchmark evaluating large language models' ability to comprehend long contexts in both Chinese and English. It covers six categories with 21 tasks, such as single- and multi-document QA, and summarization. Featuring a cost-effective automated evaluation process, LongBench assesses models across 14 English tasks, 5 Chinese tasks, and 2 code tasks, with contexts from 5k to 15k words across 4,750 test instances. LongBench-E provides balanced evaluations for different context lengths, aiding in understanding performance variations.