#Inference

Logo of lorax
lorax
LoRAX is a cost-effective framework for serving fine-tuned large language models efficiently on a single GPU, maintaining high throughput and low latency. It enables dynamic adapter loading and merging from various sources such as HuggingFace and Predibase, ensuring seamless concurrent processing. With support for heterogeneous batching, optimized inference, and ready-for-production tools like Docker images and Prometheus metrics, LoRAX is well-suited for diverse deployment scenarios. This platform supports models like Llama and Mistral and is free for commercial use under the Apache 2.0 License.
Logo of tract
tract
Tract is a versatile neural network inference engine supporting ONNX and NNEF model optimization and execution. It efficiently converts models from TensorFlow and Keras, making it suitable for both embedded systems and larger devices. Supporting models like Inception v3 and Snips, it runs efficiently on Raspberry Pi. As an open-source project under Apache/MIT licenses, it invites community contributions for custom application development.
Logo of OmniQuant
OmniQuant
OmniQuant presents a comprehensive quantization technique specifically designed for large language models. This approach performs effectively in both weight-only and weight-activation quantization, ensuring accurate results with configurations such as W4A16, W3A16, among others. Users can utilize pre-trained models like LLaMA and Falcon from the OmniQuant model zoo to generate quantized weights. The release includes quantization algorithms such as PrefixQuant and EfficientQAT, which enhance static activation quantization and optimize time-memory efficiency. OmniQuant’s weight compression technology reduces memory requirements, supporting efficient inference on GPUs and mobile devices, such as executing LLaMa-2-Chat with W3A16g128 quantization. Explore the quantization process with detailed resources and scripts for specific computational settings.
Logo of JetMoE
JetMoE
JetMoE-8B, an open-source AI model, exceeds the performance of LLaMA2-7B from Meta AI with a training cost of less than $0.1 million. Using only public datasets and minimal computational resources, JetMoE-8B offers academic accessibility and lowers inference costs due to its 2.2B active parameters. It ranks higher in Open LLM and MT-Bench evaluations, proving that efficient training of large language models is possible without high expenditures. Discover the technical specifics and access options via MyShell.ai and associated resources.
Logo of intel-extension-for-transformers
intel-extension-for-transformers
Intel Extension for Transformers improves Transformer model efficiency across platforms such as Intel Gaudi2, CPU, and GPU. Offering seamless Hugging Face API integration for model compression and software optimizations, it enhances models like GPT-J, BLOOM, and T5 for faster inference. The toolkit includes a flexible chatbot framework and expands low-bit inference capabilities, offering robust support for developers working with GenAI/LLM technologies.
Logo of TensorRT-YOLO
TensorRT-YOLO
The TensorRT-YOLO project supports enhanced inference for YOLOv3 to YOLO11 and PP-YOLOE models through NVIDIA TensorRT optimization. It integrates TensorRT plugins, CUDA kernels, and CUDA Graphs to deliver a fast object detection solution compatible with C++ and Python. Key features include ONNX export, command-line model export, and Docker deployment.
Logo of chat.petals.dev
chat.petals.dev
Discover Chatbot with advanced LLM inference via WebSocket and HTTP APIs. Utilize the platform on your server, supporting models like Llama 2 and StableBeluga2. Benefit from the WebSocket API's speed or opt for the flexibility of the HTTP API. Tailored for research, features like token streaming and sampling control are provided, offering a powerful, customizable experience. Explore easy integration for a comprehensive chatbot solution.
Logo of flutter-tflite
flutter-tflite
The TensorFlow Lite Flutter plugin is designed to integrate machine learning capabilities seamlessly into Flutter applications. It offers efficient inference for Android and iOS by utilizing TensorFlow Lite's API and supports acceleration with NNAPI and GPU delegates. The plugin's structure is consistent with TensorFlow Lite Java and Swift APIs, ensuring smooth integration and low-latency performance. Contributions are encouraged to meet evolving standards and improve support for the Flutter community in machine learning development.
Logo of languagemodels
languagemodels
The Python package allows efficient use of large language models on systems with only 512MB RAM, facilitating tasks such as instruction following and semantic search with data privacy. It enhances performance through GPU acceleration and int8 quantization. Ideal for developing chatbots, accessing real-time information, and educational purposes, the package is easy to install and suited for both learners and professionals, supporting educational and potential commercial use cases.
Logo of chatllm.cpp
chatllm.cpp
Utilize efficient real-time chatting with models ranging from less than 1B to over 300B parameters through a pure C++ implementation. Designed for optimized CPU performance, it features int4/int8 quantization, KV cache enhancements, and parallel computing, ensuring continuous communication with retrieval augmented generation. Stay updated with the latest model enhancements like LlaMA 3.2 and leverage integration possibilities with Python, JavaScript, and C bindings. Easily convert models to quantized formats for improved performance, and follow comprehensive instructions for building and deploying your application for interactive AI-powered chatting.
Logo of UniCATS-CTX-vec2wav
UniCATS-CTX-vec2wav
CTX-vec2wav is a vocoder from the AAAI-2024 paper 'UniCATS: A Unified Context-Aware Text-to-Speech Framework,' offering an advanced approach to text-to-speech enhancement through contextual VQ-diffusion and vocoding. Compatible with Linux and optimized for Python 3.9, this project provides clear guidance for both inference and training, suitable for various datasets and conditions. It supports high-fidelity output at 16kHz and 24kHz, utilizing resources such as ESPnet, Kaldi, and ParallelWaveGAN, and offers pre-trained models to advance speech synthesis development.
Logo of FastSAM
FastSAM
FastSAM is an image segmentation model offering 50 times faster performance with limited data. It supports text, box, and point prompts for a user-friendly experience. The model is lightweight for efficient memory usage, and recent updates improve edge quality and add semantic labels. Available demos on HuggingFace and Replicate showcase its use in anomaly detection and more.
Logo of vits_chinese
vits_chinese
Discover a cutting-edge TTS project that combines BERT and VITS to improve prosody and sound quality. The project uses Microsoft's natural speech features to create natural pauses and reduce sound errors through innovative loss techniques. Module-wise distillation is employed to speed up processing, resulting in high-quality audio outputs perfect for experimentation and research. Please note, this project is not intended for direct production use but serves as a valuable tool for TTS technological exploration.
Logo of wonnx
wonnx
A GPU-accelerated ONNX inference runtime entirely built in Rust, designed for web use. It supports Vulkan, Metal, and DX12, providing ease in model handling via CLI, Rust, Python, and WebGPU with WebAssembly. Available across platforms such as Windows, Linux, macOS, and Android. Includes comprehensive examples, CLI tools, and extensive documentation, catering to developers needing efficient, cross-platform inference solutions with Rust. Compatible with models like Squeezenet, MNIST, and BERT without overstated claims.
Logo of Qwen2
Qwen2
Qwen2.5 offers developers unparalleled flexibility with multilingual and high-context support, significantly improving application performance across diverse deployment scenarios. Explore enhanced fine-tuning capabilities with detailed performance metrics to optimize your projects.
Logo of YOLOv8-TensorRT
YOLOv8-TensorRT
YOLOv8-TensorRT boosts YOLOv8 performance by employing TensorRT for faster inference. It leverages CUDA and C++ for engine construction and facilitates ONNX model export with NMS integration. This project provides flexible deployment options using Python and Trtexec on various platforms, including Jetson. The comprehensive setup guide helps adapt to different AI deployment needs, offering an efficient PyTorch alternative.