Medusa - Medusa Framework for Accelerating LLM Generation Using Multiple Decoding Heads

Medusa Project Overview

Medusa is an innovative framework designed to enhance the speed of language model generation through the use of multiple decoding heads. It proposes a simple yet powerful way to significantly boost the efficiency of language models without diving into the complexities associated with other acceleration techniques.

Tackling Common Issues

Medusa aims to address the challenges commonly found in existing acceleration methods, such as speculative decoding. These challenges include:

The need for a robust draft model.
The complexity of system implementations.
Inefficiencies encountered in sampling-based generation methods.

Key Features and Advantages

Multi-Decoding Heads: Medusa's core concept revolves around enhancing existing language models by adding extra "heads". These newly added components can predict multiple future tokens in parallel. During the model's operation, these heads generate several potential words for a given position, which are then processed through a tree-based attention mechanism. This approach leads to picking the longest plausible word sequence for further decoding, effectively speeding up the generation process.
Minimal Model Alteration: Medusa ensures that the original model remains intact while only the new heads are trained. This is achieved through a parameter-efficient training process, making it accessible even for those with limited computational resources.
Enhanced Performance: By removing the necessity to match the original model’s distribution, Medusa enables non-greedy generation to outperform traditional greedy decoding methods.
Initial Release Focus: The project initially prioritizes optimizing performance for single batch sizes, commonly used in local model hosting environments. This focus has already led to about a two times speed increase across several Vicuna models.
Expansion Efforts: Medusa is actively being integrated into more inference frameworks, with the goal of further enhancing performance and applicability across different environments.

What's New in Medusa-2

The latest update, Medusa-2, introduces full-model training capabilities. This addition maintains the original model's performance while infusing it with speculative prediction capabilities. Furthermore, Medusa-2 brings the capability of self-distillation, allowing integration with any fine-tuned language model without needing access to the original training data.

Installation and Usage

Medusa can be installed via pip or directly from the source for the latest version. It supports single-GPU inference, typically with a batch size of 1, although it is being expanded to support more extensive configurations.

For training new heads, Medusa utilizes the Axolotl library. This library helps manage the training process and applies updates to existing code to support Medusa’s specific requirements.

Community and Contributions

Medusa has already seen adoption in numerous open-source projects, including integration with platforms like TensorRT-LLM and Hugging Face's TGI. The project team encourages community engagement and contributions, offering avenues for discussion and collaboration to improve and expand the framework's capabilities.

Acknowledgements

The development of Medusa is influenced by several prominent projects in the language model community, including FastChat and TinyChat, among others. The project also receives support from industry leaders like Together AI and MyShell AI.

Medusa stands as a beacon for simplifying and accelerating the generation processes of language models, paving the way for broader applications and streamlined AI developments.