LLM-Zoo: A Comprehensive Guide
In the evolving world of natural language processing (NLP), new large language models (LLMs) are emerging rapidly. The LLM-Zoo project aims to serve as a repository for these models, akin to a zoo where various species are housed and studied. This initiative gathers essential information about open-source and closed-source LLMs released after the debut of ChatGPT.
Project Objectives
The primary goal of LLM-Zoo is to compile a detailed list of the different LLMs developed globally, providing insights into their features and capabilities. The project tracks:
- Release Time: Each model's launch date.
- Model Size: The scale of these models in terms of parameters or data size.
- Languages Supported: The linguistic capabilities of each model.
- Domain: Specific fields or applications the model is designed to address.
- Training Data: Types of datasets used to train these models.
- Resource Links: Directories to more information, including GitHub repositories, HuggingFace models, demos, academic papers, and official blogs.
Project News and Updates
LLM-Zoo launched its first release on May 3, 2023. To ensure that the project remains current, updates are regularly applied. Community involvement is encouraged, with open invitations for contributions through issues or pull requests. This collaborative model enriches the repository with new models and updates existing entries.
Open-Sourced LLMs
The repository lists various open-source LLMs, each with detailed specifications:
- LLaMA: Released on February 27, 2023, supports English and is built on a massive dataset of 1 trillion tokens sourced from diverse platforms like CommonCrawl and Wikipedia.
- Alpaca and Vicuna: Developed with different versions, these models range in size and rely on datasets such as instruction-following data from InstructGPT and samples from sharedGPT.
- ChatGLM and Guanaco: These multilingual models are fine-tuned for performance across multiple languages, including English, Chinese, Japanese, and German.
- Dolly and ChatDoctor: Showcasing domain-specific applications, with ChatDoctor focusing on medical fields using diverse healthcare datasets.
- BELLE and Linly: Catering to general and Chinese-language NLP needs using extensive multilingual and instruction-following datasets.
- BAIZE, Koala, and Firefly: These cater to various NLP applications, utilizing resources like Quora data and OpenAI WebGPT for training.
Conclusion
The LLM-Zoo project stands as an essential resource in the NLP community, providing a comprehensive overview of emerging language models. By collecting and freely sharing this data, LLM-Zoo empowers researchers, developers, and enthusiasts to better understand the capabilities and potential of LLMs. This initiative not only tracks technological advancements but also encourages community engagement, fostering growth and innovation in the field of NLP.