#benchmark
YOLOX
YOLOX provides an efficient anchor-free object detection model, optimizing both accuracy and speed. Bridging research and practice, it supports PyTorch and MegEngine, includes JIT compile operations, and promises enhancements like YOLOX-P6. Explore the GitHub repository for demos and further insights.
Baichuan2
This open-source language model utilizes 2.6 trillion high-quality tokens to deliver top performance in Chinese, English, and multilingual benchmarks. With Baichuan2-13B-Chat v2, it excels in mathematical and logical reasoning. Available in 7B and 13B, both Base and Chat editions are offered for academic research and free commercial use upon official approval. Access detailed technical insights and download links for the latest versions.
FedScale
FedScale is an open-source platform for federated learning, featuring scalable deployment and extensive evaluation tools for diverse environments. It supports FL experiments with advanced APIs and diverse datasets, including tasks such as image classification and language modeling. FedScale offers scalable and extensible solutions for realistic FL training, proving itself as a vital resource for research and development in federated learning.
undici
The Undici HTTP/1.1 client for Node.js efficiently manages requests with high performance, supporting advanced features like streaming and pipelining. Benchmarks show superior speed compared to other libraries, managing numerous requests seamlessly. It supports diverse request body types with multiple response formats, allowing flexible configurations to optimize connections. Adhering to WHATWG fetch standards, it offers more control by relaxing certain header rules. Ideal for developers looking to improve application efficiency in Node.js.
opencv_zoo
Explore a curated selection of models optimized for OpenCV DNN, featuring detailed performance benchmarks on platforms like x86-64, ARM, and RISC-V. This guide provides hardware setup insights for devices from Intel, NVIDIA, and ARM, detailing inference times and showcasing usage examples from face detection to QR code parsing. Ideal for those interested in machine learning, these models offer accessible high-performance solutions for image processing and recognition, ensuring compatibility and efficient application.
GmSSL
GmSSL is an open-source cryptographic library developed by Peking University, providing support for national encryption algorithms, standards, and secure communication protocols. It is compatible with major operating systems and processors, as well as cryptographic hardware. Version 3 introduces lower memory demands, exclusive national algorithm support, enhanced security including TLS 1.3, and improved cross-platform support using the CMake build system. The library features extensive command-line tools and bindings for multiple programming languages, promoting easy integration and hardware compatibility.
Baichuan-7B
This open-source project introduces a commercially viable language model with 7 billion parameters based on the transformer architecture. It's optimized for both Chinese and English, demonstrating superior performance on benchmarks like C-Eval and MMLU. With 1.2 trillion tokens and a context length of 4096, the model employs advanced tokenization to enhance language compression efficiency and computational throughput. Compatible with Hugging Face and other platforms, this project provides a comprehensive training guide.
LLaMA-Factory
LLaMA-Factory streamlines the fine-tuning of large language models with advanced algorithms and scalable resources. It supports various models such as LLaMA, LLaVA, and Mistral. Offering capabilities like full-tuning, freeze-tuning, and different quantization methods, it enhances training speed and GPU memory usage efficiency. The platform facilitates experiment tracking and offers fast inference through an intuitive API and interface, suitable for developers improving text generation projects.
instruct-eval
InstructEval is a platform designed to evaluate instruction-tuned LLMs including Alpaca and Flan-T5, using benchmarks like MMLU and BBH. It supports many HuggingFace Transformer models, allows qualitative comparisons, and assesses generalization on tough tasks. With user-friendly scripts and detailed leaderboards, InstructEval shows model strengths. Additional datasets like Red-Eval and IMPACT enhance safety and writing assessments, providing researchers with in-depth performance insights.
tiktoken-go
The project offers a fast BPE tokeniser implemented in Go for integration with OpenAI models. It supports efficient tokenization with features like cache management and alternative BPE loaders, enhancing performance even without runtime dictionary downloads. With multiple encoding options available, it allows flexible and accurate token management for diverse applications. Benchmark tests show comparable performance to the original Tiktoken, serving as a reliable tool for developers.
llmperf-leaderboard
This evaluation provides insights on the performance, reliability, and efficiency of LLM inference providers. Key metrics such as output tokens throughput and time to first token are analyzed to assist developers and users in making informed decisions about model integrations. Transparent results and reproducible configurations support the optimization of streaming applications such as chatbots. Note that results may vary due to system load and provider traffic, with data updated as of December 19, 2023, providing a current overview of provider capabilities.
AutoWebGLM
AutoWebGLM enhances web navigation with the ChatGLM3-6B model, featuring HTML simplification and hybrid AI-human training for better browsing comprehension. It employs reinforcement learning to optimize real-world tasks, supported by the AutoWebBench bilingual benchmark. Open evaluation tools offer robust frameworks for testing and improving the agent's efficiency in web interactions.
MeViS
MeViS is a large-scale benchmark focused on improving video segmentation through motion expressions, specifically for complex environments. Concentrating on motion descriptors, it supports the advancement of language-guided technologies and enhances the ability to identify targets using diverse expressions where static imagery is insufficient.
llama2.mojo
This project enhances the Llama2 model inference using Mojo's SIMD and vectorization, offering a 250x speed increase in Python performance. It exceeds llama2.c by 30% and llama.cpp by 20% in multithreaded CPU tasks. Supported models are Stories (260K to 110M) and Tinyllama-1.1B-Chat-v0.2. Benchmarks on Apple M1 Max demonstrate its proficiency. Suitable for developers exploring efficient transformer models in Mojo.
AgentBench
AgentBench provides a framework for evaluating LLMs as agents in different settings. Version v0.2 features architecture updates, new tasks, and broader model testing. VisualAgentBench is introduced for training visual agents with large multimodal models in five environments. Together, these tools aid the development and evaluation of visual and language agents in diverse scenarios, enhancing autonomous capabilities.
MMMU
MMMU and MMMU-Pro provide robust benchmarks essential for evaluating multimodal models, focusing on tasks that require college-level knowledge and reasoning across diverse disciplines. These benchmarks consist of 11.5K questions covering six core fields, designed to test the models' proficiency in integrating visual and textual information. Specifically, MMMU-Pro introduces more stringent assessments by implementing vision-only input and expanding candidate options. Despite the challenging nature of these tasks, even advanced models like GPT-4V achieve only moderate accuracy, indicating significant potential for improvement in multimodal AI, moving towards expert AGI. For detailed evaluations, visit EvalAI, and access the datasets through Hugging Face.
EmbodiedScan
EmbodiedScan enhances embodied AI with a robust multi-modal 3D dataset, supporting effective visual grounding and scene interaction tasks in varied environments. The dataset offers over 5k scans, 1M ego-centric RGB-D views, and 160k categorized 3D boxes, bridging scene perception and language interaction. The Embodied Perceptron, a baseline framework, advances input processing for both structured tasks and real-world applications, with improvements such as dense semantic occupancy mapping and LVIS category compatibility.
datacomp
This competition aims to design effective datasets for pre-training CLIP models, prioritizing dataset curation. Participants focus on achieving high accuracy in downstream tasks by selecting optimal image-text pairs, with a fixed model setup. The competition offers two tracks, allowing varying computational resources: one with a provided data pool and another that accepts additional external data. With scales from small to xlarge, it covers different computational demands. The project offers tools for downloading, selecting subsets, training, and evaluation to support flexible and robust participation.
VBench
This toolkit evaluates video generative models with a structured approach covering 16 quality dimensions. It utilizes advanced evaluation techniques to provide objective assessments reflective of human perception, with regular updates to capture the latest advancements. A leaderboard is included for tracking continuous performance improvements, making it an essential resource for understanding model capabilities and trustworthiness in text-to-video and image-to-video contexts.
benchmark
Discover this open-source library that benchmarks C++ code snippets, akin to unit testing, requiring C++14. It integrates with Google Test and offers guidance on cmake installation, Python bindings, and configuration options. Explore stable and experimental APIs to improve code performance, ideal for developers seeking efficient benchmarking solutions.
AGIEval
AGIEval is a benchmark crafted to evaluate the problem-solving and cognitive capabilities of foundation models using tasks from exams like the Chinese Gaokao and American SAT. With the latest update to version 1.1, AGIEval offers MCQ and cloze tasks and provides performance evaluations across models such as GPT-3.5-Turbo and GPT-4o. This benchmark enables objective assessments and ensures researchers can identify model strengths and weaknesses.
autolabel
Autolabel is a Python tool optimizing text dataset processing with versatile Large Language Models (LLMs) including GPT-4, providing high accuracy while reducing time and costs compared to manual methods. It easily integrates with multiple LLM platforms and supports model benchmarking on Refuel's system. Designed for efficiency, Autolabel offers confidence estimation, caching, and state management, facilitating precise calibration for labeling tasks. The easy three-step setup enhances accessibility for NLP tasks like sentiment analysis, classification, and named entity recognition.
faker
Faker is a versatile library that creates fake data from Structs, which assists developers in efficiently testing their applications. It supports common data types, including integers, booleans, strings, and time values, while maintaining safety by limiting private and complex data types. Developers can tailor data using Struct tags to specify parameters like length, bounds, and uniqueness. Custom types are partially supported, and additional customization is achievable through the AddProvider function. By utilizing public Struct fields, Faker users can simulate realistic test scenarios effectively.
calvin
CALVIN provides an open-source platform for mastering complex robotic manipulation over extended timelines conditioned by human language. It features dynamic sensor suite specifications and a variety of language directives, surpassing traditional dataset limitations. With flexible training options on multi-GPU setups, accelerated data loading, customizable language models, and comprehensive sensory inputs like RGB and depth maps, CALVIN is crucial for advancing language-conditioned policy research in robotics.
Codec-SUPERB
Codec-SUPERB offers a rigorous platform for evaluating audio codec models in diverse speech tasks. It enhances speech information quality and promotes community collaboration with an easy-to-use codec interface and a transparent multiperspective leaderboard. Its standardized testing environment and unified datasets ensure fair comparisons, making it essential for advancing research in sound codec models.
loft
Discover LOFT, a benchmark to assess long-context language models in retrieval, reasoning, and additional tasks. With 35 datasets across different modalities, the benchmark evaluates capabilities in retrieval, RAG, SQL, and multi-hop reasoning. Resources include datasets, installation instructions, and evaluation scripts available from a central repository. Gain insights into each dataset, recognize task types, and use scripts for inference and assessment with VertexAI's gemini-1.5-flash-002 model. Understand how these models advance retrieval and reasoning approaches.
ksql
This overview examines a Golang library aimed at optimizing SQL database interactions with a well-designed API. While not centered on introducing new features, it emphasizes ease of use and efficient management of databases utilizing proven backends like pgx and database/sql. The library offers smooth integration with adapters for databases including PostgreSQL, MySQL, SQLServer, and SQLite, promoting intuitive debugging and error management. It encompasses examples of basic and advanced operations, underscoring efficient data querying and structure scanning.
babilong
BABILong evaluates NLP models on their ability to handle long documents filled with disparate facts, incorporating bAbI data and PG19 texts for diverse reasoning tasks. The benchmark's 20 tasks, including fact chaining and deduction, challenge even advanced models like GPT-4. Contributions to the benchmark are encouraged to further collective insights into LLM capabilities.
Feedback Email: [email protected]