ppl.nn - Enhanced AI Model Deployment with Advanced LLM Capabilities

Introduction to PPLNN

PPLNN stands for "Primitive Library for Neural Networks" and serves as a cutting-edge deep-learning inference engine focused on efficient AI performance. Designed to handle various ONNX models, PPLNN offers exceptional integration support for OpenMMLab, making it a valuable tool for developers and researchers in the AI field.

Key Features

High-Performance Inference: PPLNN is engineered to execute deep-learning models swiftly, supporting the demands of modern AI applications.
ONNX Model Compatibility: The engine is capable of running diverse ONNX models, broadening its utility across different AI projects.
OpenMMLab Support: Enhanced support for OpenMMLab is a standout feature, allowing seamless operation with this popular open-source deep-learning toolbox.

Important Updates

As of April 25, 2024, PMX has transitioned to OPMX. Users of PPLNN are advised to rename pmx_params.json to opmx_params.json and re-export their models.
It's important to note that ChatGLM1 will no longer be supported in the OPMX format.

Known Issues

PPLNN users are advised of potential issues with some devices particularly related to NCCL (NVIDIA Collective Communications Library). Devices like L40S and H800 might encounter illegal memory access problems. Setting the environment variable NCCL_PROTO=^Simple can help mitigate these issues.

LLM Features

PPLNN supports several features catering to Large Language Models (LLM):

New LLM Engine: Offers advanced features like Flash Attention and Split-k Attention which optimize decoding processes.
Dynamic Batching: Also known as Continuous Batching or In-flight Batching, this feature maximizes throughput.
Tensor Parallelism and Graph Optimization: These allow efficient scaling across multiple GPUs.
INT8 Quantization: Achieves exceptional numerical accuracy, close to FP16 standards, with groupwise KV Cache and per token-channels.

LLM Model Zoo

PPLNN provides a range of pre-trained models compatible with different AI applications:

LLaMA 1/2/3
ChatGLM 2/3
Baichuan 1/2 7B
InternLM 1/2
Mixtral, Qwen 1/1.5
Falcon, and Bigcode.

Getting Started with PPLNN

To get started with PPLNN, users need to install some prerequisites such as CMake, Git, and Python development packages. The source code is freely available on GitHub, and building from the source is straightforward with a few commands. A Python demo is also included to help users quickly evaluate the engine with an ONNX model.

Documentation

PPLNN provides comprehensive documentation to support users:

Guides on building from the source and integration.
API references for both C++ and Python.
Development guides for adding new engine operations and benchmarking tools across multiple platforms such as X86, CUDA, RISC-V, and ARM.

Contact and Contributions

Users can connect with the PPLNN community through platforms like WeChat and QQ Group. Contributions to improve the project are welcomed, adhering to the Contributor Covenant code of conduct.

Acknowledgements and License

PPLNN acknowledges the contributions of projects like onnxruntime, openvino, and TensorRT, among others. The project is licensed under the Apache License, Version 2.0, ensuring that it's free and open to modifications by the community.

For more details, users can visit the official website of PPLNN.