DeepSeek-MoE - Innovative MoE Architecture Balancing Efficiency and Performance

Introduction to DeepSeek-MoE

DeepSeekMoE 16B is an advanced language model belonging to the family of Mixture-of-Experts (MoE) models. As the name suggests, this model comprises an enormous 16.4 billion parameters that enable it to process language both in English and Chinese efficiently. What sets DeepSeekMoE 16B apart is its innovative architecture that implements two key strategies: fine-grained expert segmentation and shared experts isolation. These strategies allow the model to utilize fewer computational resources by activating only a fraction of its total capabilities, making it highly efficient in terms of performance.

DeepSeekMoE 16B has been trained entirely from scratch using a staggering 2 trillion words from both English and Chinese languages. Its performance can easily rival other prominent models like DeekSeek 7B and LLaMA2 7B, but only requires about 40% of the typical computational expenses. The creators of DeepSeekMoE 16B have made its base model and chat variant available to the public for research uses, and notably, they can run on a single GPU with 40GB of memory without the need for quantization.

Evaluation Results

The performance of DeepSeekMoE 16B has been measured across multiple benchmarks:

DeepSeekMoE 16B Base Model

Open LLM Leaderboard: The model consistently outperformed other open-source models with a similar number of activated parameters and holds its ground against models like LLaMA2 7B, which uses significantly more computations.
Internal Benchmarks: Compared to DeepSeek 7B (another dense language model from the same corpus), DeepSeekMoE 16B matches its performance while utilizing only 40.5% of the computation power. When compared with LLaMA2 7B, it surpassed its performance on many tasks despite using just 39.6% of computational resources.

DeepSeekMoE 16B Chat Model

Tested against DeepSeek 7B Chat and LLaMA2 7B SFT, the chat model shows comparable, if not superior, outcomes while still retaining low computational expenditure, reiterating its efficiency.

Model Downloads

DeepSeekMoE 16B's base and chat model versions are available for download, supporting a vast array of research and applications in both academic and commercial fields. It is crucial to abide by the model's provided licensing agreements, particularly for commercial endeavors.

Huggingface Downloads:
- DeepSeekMoE 16B Base
- DeepSeekMoE 16B Chat

Quick Start Guide

Installation

With Python version 3.8 and above, necessary dependencies can be installed straightforwardly by using the command:

pip install -r requirements.txt

Utilizing Huggingface's Transformers for Inference

Text Completion

Once set up, the model can be employed for text generation tasks using Huggingface's Transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-moe-16b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

The model can also serve chat applications:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Who are you?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

Note that the tokenization process automatically adds a start-of-sentence token by default.

Fine-tuning DeepSeekMoE

DeepSeekMoE can be tailor-fitted for specific needs through fine-tuning. Scripts and documentation are available facilitating the fine-tuning process using both standard and more advanced techniques, like DeepSpeed, optimizing performance on specific tasks.

License

The DeepSeekMoE project complies with the MIT License for its code repository and a specific model license for the models, enabling their commercial use under stipulated conditions.

For further inquiries or questions, users can contact the team through the designated email: [email protected].