Skywork - Open Source Bilingual Models Optimized for Versatile Applications

Introduction to the Skywork Project

Skywork is an innovative project developed by the Tiangong team under Kingsoft Corporation, opening new horizons with a series of large language models. The project has recently open-sourced several models, including Skywork-13B-Base, Skywork-13B-Chat, Skywork-13B-Math, and Skywork-13B-MM. Each model is also available in a quantized version to facilitate deployment and inference on consumer-grade graphics cards.

These open-source Skywork models are available for commercial use, provided users adhere to the project’s guidelines, ensuring no harmful activities are conducted.

Key Features of the Skywork Project

Skywork-13B-Base Model: This model is pre-trained on 3.2 trillion tokens of high-quality multi-language data, primarily in Chinese and English, and code. It showcases exceptional performance across various assessments and benchmarks, leading its class in effectiveness.
Skywork-13B-Chat Model: Enhanced specifically for dialogue tasks, especially in the field of creativity. With over 10,000 high-quality instruction datasets, this model is fine-tuned for ten creative tasks, approaching the capabilities of ChatGPT. A benchmark comprising around 500 samples for these tasks is also available.
Skywork-13B-Math Model: It excels in mathematical tasks, achieving top scores on the GSM8K evaluation and performing superbly in the MATH dataset and CMATH tasks, standing among the best at the 13 billion parameter scale.
Skywork-13B-MM Model: This multi-modal model allows users to engage in Q&A and dialogue tasks using image inputs.
Skywork/Skypile-150B Data Set: Sourced from high-quality Chinese webpages, this dataset provides approximately 600GB of data, comprising around 150 billion tokens, currently the largest open-source Chinese dataset available.

The Skywork project also shares insights on the evaluation methods, data composition studies, and infrastructure optimizations used during the training of the Skywork-13B models, fostering a deeper understanding in the community and advancing the development of Artificial General Intelligence (AGI).

For more detailed information about the training schemes and evaluation methodologies, one can refer to the technical report, the Skymath paper, and the SkyworkMM paper.

Recent Updates

December 7, 2023: The Chinese pretraining corpus of 150 billion tokens, after a safety review, has been reopened to the public, with availability at Huggingface (international) and Wisemodel (domestic).
November 2, 2023: The evaluation datasets MOCK_GSM8K_TEST and ChineseDomainModelingEval have been uploaded to Huggingface for model evaluation purposes.
October 31, 2023: The technical report titled "Skywork: A More Open Bilingual Foundation Model" is available on arXiv, providing more insights into assessment methods and technical details.
October 30, 2023: The Skywork-13B-Base, Skywork-13B-Math models, and Skywork/Skypile-150B dataset are open-sourced. The dataset includes over 150 billion high-quality Chinese tokens, making it the largest known open-source Chinese dataset.

Resources for Download

Skywork models and datasets are available across multiple platforms such as HuggingFace, ModelScope, Wisemodel, and OpenXLab. Interested users can download the models, datasets, and evaluation sets from the respective platforms.

Skywork not only provides models but also intermediate model archives at various training stages, which are invaluable for research into the evolution of large-scale model capabilities.

Model Architecture and Training

Skywork models utilize a streamlined architecture compared to traditional models like Llama-2-13B. For example, the Skywork-13B model has 52 layers with reduced FFN and Hidden dimensions, allowing it to maintain equivalent parameter counts as its predecessors while achieving better generalization during training. The model employs a Byte-Pair Encoding (BPE) tokenizer, creating an extensive vocabulary tailored for both language and code data processing.

The training methodology leverages a two-phase approach, covering general ability learning via comprehensive corpora and skill enhancement with specialized STEM data, culminating in the robust Skywork-13B-Base.

Ultimately, the Skywork project presents a significant leap in open-source models and data, promising extensive use and development in AI-driven applications.