awesome-japanese-llm - Insightful Overview of Japanese Language Models and Evaluation Benchmarks

Introduction to Awesome Japanese LLM Project

The "Awesome Japanese LLM" project provides a comprehensive compilation of information about Japanese language models (LLMs) that have been primarily trained using Japanese text. This project also offers insights into the benchmarks used for evaluating these models. The information has been gathered by volunteers and sourced from research papers and publicly available resources.

Understanding Japanese LLMs

Japanese Large Language Models (LLMs) are AI models that focus on processing the Japanese language. These models are developed to understand, generate, and manipulate text in Japanese, aiming to cater to the specific linguistic nuances of the language. This guide introduces several models and projects that have contributed to the Japanese AI landscape.

Model Overview

Full-Scratch Learning Models

These models are developed from scratch, employing various architectures and datasets for training. The primary goal is to create highly efficient LLMs capable of understanding and generating Japanese text.

1. LLM-jp-3 172B Series:

Architecture: Based on Llama.
Capabilities: Handles tokens up to 4,096.
Training Data: Utilized datasets such as llm-jp-corpus-v3, spanning a range of 0.7 trillion to 1.4 trillion tokens.
Development Entity: Large-Scale Language Model Research and Development Center (LLMC).
Licensing: Terms of use specific to this model.

2. Stockmark-100b:

Architecture: Llama-based.
Training Data: Includes collections like RedPajama and Japanese Wikipedia, amounting to 910 billion tokens.
Development Entity: Stockmark.
Licensing: MIT License.

3. PLaMo-100B-Pretrained:

Architecture: Llama.
Training Data: A proprietary dataset with Japanese CommonCrawl and RefinedWeb, totaling 2 trillion tokens.
Development Entity: Preferred Elements.
Licensing: Non-Commercial License.

4. Sarashina Series:

Architecture: GPT-NeoX and Llama for different versions.
Training Data: Varied datasets including Japanese CommonCrawl totaling up to 2.1 trillion tokens.
Development Entity: SB Intuitions.
Licensing: MIT License.

5. Tanuki-8×8B:

Architecture: MoE (Mixture of Experts) Model.
Training Data: Diverse web-based data, amounting to 1.7 trillion tokens.
Development Entity: Matsuo Lab LLM Development Project.
Licensing: Apache 2.0.

Other notable models include CyberAgentLM3, Fugaku-LLM, and various iterations of LLM-jp-13B, each contributing uniquely to Japanese language processing through different architectures and data strategies.

Contribution and Collaboration

The Japanese LLMs in the "Awesome Japanese LLM" project are products of collaborative efforts by research institutions and corporations. These entities have harnessed vast datasets and innovative technologies to refine language models that cater to Japanese linguistic needs.

Open Source and Licensing

Most models in this project are released under open-source licenses such as MIT or Apache 2.0, promoting transparency and community engagement. However, some models might have specific non-commercial or development entity-specific licenses.

Community Involvement

The project is maintained on GitHub, encouraging community contributions. Users can report errors or propose new models through GitHub Issues, ensuring the project stays updated with the latest advancements in Japanese LLM research.

Conclusion

The "Awesome Japanese LLM" project serves as a valuable resource for those interested in Japanese language models. It highlights the significant progress made in AI language processing for Japanese and reflects the collaborative spirit in the AI community. By leveraging various architectures and training datasets, these models showcase the potential of AI in enhancing language understanding and interaction.