#evaluation
ChainForge
This open-source platform simplifies comparative prompt engineering and LLM response evaluation. It enables users to simultaneously query multiple LLMs, offering quick comparisons in response quality across various prompts and models. Supporting model providers like OpenAI and Google PaLM2, the platform provides robust tools for setting evaluation metrics and visualizing results. With features like prompt permutations, chat turns, and evaluation nodes, it facilitates a thorough analysis of prompt and model efficiency. Encouraging experimentation and sharing, it includes functionalities for exporting results and integrating evaluations into research projects, making it a practical tool for researchers.
Medical_NLP
A detailed repository of medical NLP resources including evaluations, competitions, datasets, papers, and pre-trained models, maintained by third-party contributors. It features Chinese and English benchmarks like CMB, CMExam, and PromptCBLUE, highlights ongoing and past events such as BioNLP Workshop and MedVidQA, and catalogs diverse datasets like Huatuo-26M and MedMentions. The repository also provides access to open-source models like BioBERT and BlueBERT, and large language models including ApolloMoE, catering to researchers in the medical NLP sphere.
VBench
This toolkit evaluates video generative models with a structured approach covering 16 quality dimensions. It utilizes advanced evaluation techniques to provide objective assessments reflective of human perception, with regular updates to capture the latest advancements. A leaderboard is included for tracking continuous performance improvements, making it an essential resource for understanding model capabilities and trustworthiness in text-to-video and image-to-video contexts.
reward-bench
RewardBench offers a comprehensive framework for assessing reward model capabilities and safety, with support for Direct Preference Optimization (DPO). Featuring inference scripts for models like Starling and PairRM, and tools for analyzing and visualizing results, RewardBench ensures efficient model testing and logging. The command line interface facilitates easy setup, and compatibility with various dataset formats makes it a vital tool for researchers seeking accurate reward model evaluation.
instruct-eval
InstructEval is a platform designed to evaluate instruction-tuned LLMs including Alpaca and Flan-T5, using benchmarks like MMLU and BBH. It supports many HuggingFace Transformer models, allows qualitative comparisons, and assesses generalization on tough tasks. With user-friendly scripts and detailed leaderboards, InstructEval shows model strengths. Additional datasets like Red-Eval and IMPACT enhance safety and writing assessments, providing researchers with in-depth performance insights.
Telechat
TeleChat, a semantic large language model developed by China Telecom's AI unit, includes the TeleChat-1B, 7B, and 12B models which are open-source and trained on vast multilingual data. TeleChat-12B features improvements in structure and training that enhance performance in areas like Q&A, coding, and mathematics without exaggeration. The models support advanced deep learning techniques and excel in reasoning, understanding, and long-text generation for a range of uses.
unified-io-2
Unified-IO 2 offers advanced solutions in multimodal AI by integrating vision, language, audio, and action into one versatile toolset. It includes demo, training, and inference capabilities. Recent updates feature Pytorch code for improved audio processing and VIT-VQGAN integration, supporting complex datasets with robust pre-processing. Designed for both TPU and GPU use, it facilitates efficient training and evaluation with JAX. With T5X architecture, it provides clear data visualization and effective model optimization for specific tasks. Unified-IO 2 stands at the forefront of autoregressive model research, contributing significantly to AI advancement.
LawBench
This page provides an objective overview of LawBench, a benchmark for evaluating large language models (LLMs) in the Chinese legal system. LawBench highlights tasks such as legal entity recognition and crime amount calculation across three cognitive dimensions: memory, understanding, and application. Unique metrics like the waiver rate assess models' legal query responses, with evaluations on 51 LLMs offering insights into multilingual and Chinese LLM performance in various legal contexts.
carla_garage
Explore the complexities of end-to-end autonomous driving models by uncovering hidden biases through a CARLA-based research initiative. The repository provides efficient, configurable code, exhaustive documentation, and pre-trained models, presenting a solid foundation for autonomous driving research. Key features include dataset generation, model evaluation, and advanced training methods designed for parallel processing to boost research efficiency. Ideal for developers progressing in complex autonomous driving benchmarks, this resource bypasses promotional language, focusing on practical benefits relevant to the field.
AGIEval
AGIEval is a benchmark crafted to evaluate the problem-solving and cognitive capabilities of foundation models using tasks from exams like the Chinese Gaokao and American SAT. With the latest update to version 1.1, AGIEval offers MCQ and cloze tasks and provides performance evaluations across models such as GPT-3.5-Turbo and GPT-4o. This benchmark enables objective assessments and ensures researchers can identify model strengths and weaknesses.
starcoder2-self-align
StarCoder2-15B-Instruct-v0.1 introduces a transparent framework for code generation through self-alignment without relying on human annotations or proprietary data. This open-source project uses StarCoder2-15B to create instruction-response pairs, focusing on Python code generation with a process optimized for task validation through execution. The model addresses biases and format limitations, using advanced GPU resources to remain accessible to developers and researchers. It also highlights key evaluation outcomes and notes potential constraints, beneficial for Python-focused efforts.
GPTEval3D
Explore a novel method for evaluating text-to-3D generative models using GPT-4V for improved human alignment. This project introduces 110 text-aligned image prompts and facilitates background removal with advanced tools. Key components include clear installation steps, evaluation guidelines for text-to-3D models, and a structured scoring system for competitions. Future developments include visualization tools and a Text-to-3D Leaderboard. Designed for researchers aiming for a comprehensive evaluation approach, these metrics help models excel in the competitive arena.
babilong
BABILong evaluates NLP models on their ability to handle long documents filled with disparate facts, incorporating bAbI data and PG19 texts for diverse reasoning tasks. The benchmark's 20 tasks, including fact chaining and deduction, challenge even advanced models like GPT-4. Contributions to the benchmark are encouraged to further collective insights into LLM capabilities.
Feedback Email: [email protected]