rulm - Comprehensive Analysis of Russian Language Models and Dataset Contributions

RULM: Russian Language Models Project

The RULM (Russian Language Models) project is an endeavor focused on the implementation and comparison of advanced language models tailored for the Russian language. This project encompasses diverse aspects from dataset creation to model deployment and evaluation, aiming to enhance natural language processing (NLP) capabilities within Russian linguistic contexts.

RuTurboAlpaca

Dataset Overview

RULM utilizes a dataset named RuTurboAlpaca, hosted on HuggingFace, comprising ChatGPT-generated instructions in Russian. It follows the foundational principles of the original Alpaca but leverages the gpt-3.5-turbo model instead of text-davinci-003. The dataset generation is facilitated by the script generate_instructions.py, with prompt examples available in ru_instruct.txt.

Example prompt:

Instruction: Write a short story about two best friends.
Given names: Katya and Lena.
Response: A narrative unfolds describing Katya and Lena's unfaltering friendship from childhood through adulthood.

Model Details

Two primary models are adapted from the RuTurboAlpaca data: llama_7b_ru_turbo_alpaca_lora and llama_13b_ru_turbo_alpaca_lora. Although these models are functional, the Saiga models are recommended as they receive ongoing support and demonstrate superior performance metrics. These models have been trained on both Russian and English datasets.

Saiga Models

Dataset

The Saiga models are trained using ru_turbo_saiga, a dataset crafted from ChatGPT-generated chat interactions. Based on academic references like the Baize paper, this dataset includes prompted chats about varied topics. Scripts like generate_chat.py facilitate dataset assembly. An example interaction involves queries about knitting needles, with the bot providing insightful responses about different needle types and brand recommendations.

Model Variants

The Saiga models are represented across various complexities:

saiga_7b_lora
saiga_13b_lora
saiga_30b_lora
saiga2_7b_lora
saiga2_13b_lora
saiga2_70b_lora

Training leverages six datasets encompassing diverse linguistic and chat scenarios, aided by the script create_chat_set.py. These models aim to advance Russian language processing by employing robust datasets and cutting-edge model architectures.

GPT Role-play Realm

This component involves a dataset of over 200 GPT-generated characters, each with unique interactions. Example dialogues feature scenarios such as a "Cyber-Granny" sharing her culinary expertise while blending traditional wisdom with advanced technologies.

Evaluations and Comparisons

The project undertakes rigorous evaluations to compare different RULM models. Metrics include performance on tasks like RussianSuperGLUE, testing models on benchmarks such as RUSSE, RWSD, and RuCoS. Comparative analyses against models like ChatGPT and the newer Saiga variants reveal valuable insights into performance dynamics across different linguistic tasks.

Donations

Support for the RULM project is varied, with donation options provided for both international contributors and those residing in Russia. Details for supporting through PayPal or Cloudtips are available for interested patrons.

This project showcases a significant leap forward in developing robust, efficient, and versatile language processing models for the Russian language, driving innovation in multilingual NLP systems.