llama.onnx - Streamlined Model Inference Using ONNX for LLaMa and RWKV Across Various Devices

Introduction to llama.onnx

Overview of llama.onnx Project

The llama.onnx project provides efficient onnx models for running the LLaMa and RWKV architectures. Onyx (Open Neural Network Exchange) is a format designed to allow deep learning models to be used across different frameworks, which facilitates model interoperability. This project aims to make LLaMa and RWKV models easier to deploy and use, especially in various hardware environments.

Model Downloads

The project features downloadable onnx models available in different precisions to suit varying hardware capabilities:

LLaMa-7B: Available in float32 (fp32) and float16 (fp16) precision. The fp32 model is 26GB in size, while the fp16 model is 13GB. They can be acquired via huggingface or local hardware model platforms.
RWKV-4-palm-430M: Offered in fp16 precision, this model is a more compact 920MB download.

Recent Developments

The project is continually updated with notable advancements:

Released RWKV-4 models in onnx format.
Addressed issues with TensorRT outputs.
Implemented optimizations like mixed-precision quantization and memory pooling, enabling use on devices with as little as 2GB RAM.

Key Features

Model Support: Comes with LLaMa-7B and RWKV-400M onnx models, complete with standalone demos that do not require PyTorch or transformers.
Visualization and Quantization: Supports visualization tools and partial quantization for efficient deployment.
Compatibility with Various Devices: Designed to function on embedded devices and in distributed systems leveraging diverse technologies like FPGA, NPU, and GPGPU.
Utilization of ONNX Tools: Leverages the robust manufacturer support available for onnx formats, making it a practical choice for various applications.

How to Use

To utilize the llama.onnx models, users can follow these steps:

Install Necessary Packages: Use pip to install the required Python packages.
```
$ python3 -m pip install -r requirements.txt
```

Execute LLaMa demo: Run the LLaMa demo without needing PyTorch.

$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour"

If low on memory, use the --poolsize option:

$ python3 demo_llama.py ${FP16_ONNX_DIR} "bonjour" --poolsize 4

Try RWKV Demo: Run the RWKV demo using:

$ python3 demo_rwkv.py ${FP16_ONNX_DIR}

Model Export Procedures

For those interested in exporting their models, the project outlines a clear process:

RWKV Export: Clone the necessary repository and run the provided conversion script to generate onnx files.
LLaMa Export: Convert models to Hugging Face format before exporting using torch.onnx.export. Steps include using conversion scripts and running basic model inference checks.

Additional Notes

Model compatibility and accuracy are ensured, with detailed comparisons provided for runtime outputs.
Configuration examples and adjustments for different use scenarios are readily available.
Ongoing efforts in mixed-precision kernel optimization show the project's commitment to staying at the forefront of model efficiency.

Acknowledgements and License

The llama.onnx project builds upon the work of a variety of open-source repositories and technologies. It is released under the GPLv3 license, emphasizing its open and collaborative nature.

For further detailed guidance or collaboration, contributors and interested users are invited to delve into the project's GitHub repository.