DeepSeek-Coder Project Overview
Introduction to DeepSeek Coder
DeepSeek Coder is an advanced series of code language models, meticulously designed to handle substantial programming tasks. Each model within this series has been trained from the ground up using a staggering 2 trillion tokens. The dataset comprises 87% programming code and 13% natural language, across English and Chinese. These models come in various sizes, ranging from 1 billion to 33 billion parameters, catering to different user needs and computational resources.
DeepSeek Coder is pre-trained using a project-level code corpus with a large window size of 16,000 tokens, complemented by an additional fill-in-the-blank task. This setup allows the models to excel at project-level code completion and code infilling tasks. Their coding prowess is unmatched among open-source code models, achieving exceptional performance across multiple programming languages and various benchmarks.
Key features of DeepSeek Coder include:
- Extensive Training Data: Built from scratch using 2 trillion tokens, predominantly code with a mix of English and Chinese linguistic data.
- Scalable and Flexible: Available in sizes of 1B, 5.7B, 6.7B, and 33B parameters to match diverse user requirements.
- Exceptional Performance: Leads the open-source sector in code models on various programming benchmarks.
- Advanced Code Capabilities: Supports expansive code completion and infilling tasks.
Supported programming languages include a comprehensive list ranging from traditional languages like C and Python to specialized ones like Solidity and Assembly.
Evaluation Results
DeepSeek Coder stands out in various benchmarks like HumanEval, MBPP, and DS-1000. For instance, the DeepSeek-Coder-Base-33B significantly surpasses other open-source code models, outperforming prominent models like CodeLlama-34B by wide margins in several benchmarks. Interestingly, the smaller DeepSeek-Coder-Base-7B mirrors CodeLlama-34B's performance levels. Post-instruction tuning, the DeepSeek-Coder-Instruct-33B model matches GPT35-turbo in multiple tests.
Data Creation and Training Process
Data Creation
The data creation journey involves gathering code from public repositories on GitHub, followed by methodical filtering to ensure quality. The steps include:
- Collecting code and filtering it based on established protocols.
- Analyzing file dependencies within repositories to sort them accordingly.
- Merging dependent files and eliminating duplicates through minhash techniques.
This rigorous process also screens out poorly written or error-prone code snippets, ensuring only high-quality data is used for training.
Model Training
Model training unfolds in three orchestrated phases:
- Initial Pre-training: Utilizes data with a significant proportion of code, along with small segments of code-related and non-code-related languages. Models are trained on 1.8 trillion tokens at a 4K window size in this phase.
- Further Pre-training: Expands the window size to 16K, processing additional tokens to craft foundational models known as DeepSeek-Coder-Base.
- Instruction Fine-tuning: Fine-tunes models with instruction data totaling 2 billion tokens, resulting in the robust DeepSeek-Coder-Instruct models.
How to Use
To begin using DeepSeek Coder, ensure all necessary dependencies are installed. The installation command is:
pip install -r requirements.txt
A demo is available online, or you can run it locally for hands-on experimentation.
Sample usage scenarios include code completion and insertion, demonstrating the model's prowess in generating functional code snippets effortlessly. The document provides practical examples for both Python code completion and code insertion tasks.
How to Fine-tune DeepSeek-Coder
DeepSeek-Coder models can be fine-tuned across various tasks using the provided scripts. The necessary components and guidelines for setting up the training environment, data formatting, and execution instructions are outlined for ease of use.
Detailed Evaluation Results
Details of comprehensive evaluation results are accessible, featuring various benchmarks and performance comparisons. This section includes insights on multilingual benchmarks and tasks tailored for math reasoning.
Inference with vLLM
DeepSeek Coder models can leverage high-throughput inference with vLLM. Examples of text completion and chat completion demonstrate its application in rendering human-like responses for a variety of prompts.
Q&A
The FAQ section addresses common queries, including tokenizer compatibility for model quantization and adjustments for using instruction-tuned models for code completion. These insights are designed to resolve user uncertainties and enhance their experience with DeepSeek Coder.
In summary, DeepSeek-Coder represents a powerful and comprehensive toolset for code modeling, emphasizing flexibility, performance, and ease of use across a range of programming tasks.