#inference
GFPGAN
GFPGAN offers a versatile tool for real-world blind face restoration by leveraging pretrained face GAN models, such as StyleGAN2, to deliver natural-looking improvements even on low-resolution images. Latest updates provide models like V1.3 and V1.4 for finer results, and compatibility with platforms such as Huggingface Spaces. The algorithm also supports background enhancement via Real-ESRGAN and runs on various operating systems, ideal for projects requiring comprehensive face and image restoration.
onnxruntime
ONNX Runtime optimizes machine learning by accelerating inference and training across platforms. It supports models from frameworks like PyTorch and TensorFlow, and systems like scikit-learn and XGBoost, focusing on hardware optimization. By using multi-node NVIDIA GPUs, it notably reduces training time with minimal changes to PyTorch scripts. With compatibility across various operating systems, ONNX Runtime efficiently enhances performance while cutting costs. Access resources for deeper insights.
lightseq
Explore a library that significantly boosts sequence processing speed for training and inference, with CUDA-based support for models like BERT and GPT. LightSeq achieves up to 15x faster performance compared to traditional methods using fp16 and int8 precisions. Compatible with frameworks like Fairseq and Hugging Face, it offers efficient computations for machine translation, text generation, and more without exaggeration.
exui
Explore ExUI, a lightweight browser-based UI for running local inference with the ExLlamaV2 framework. It offers a minimalistic, responsive interface with persistent sessions, multiple instruction formats, and supports EXL2, GPTQ, and FP16 models. The platform includes a notepad mode and is compatible with Google Colab. Installation is easy with prebuilt wheels and latest Flash Attention recommendations.
AnyDoor
This project presents an innovative approach to zero-shot object-level image customization, allowing image personalization without large datasets. Key features include availability of training and inference code, online demo support on platforms such as ModelScope and Hugging Face, and development of robust models for applications like virtual try-on and face swapping. The installation is facilitated via Conda or Pip, utilizing the ControlNet framework, with community contributions enhancing its capabilities. It targets simplifying intricate image generation tasks, providing a vital tool for contemporary image processing.
ortex
Ortex, a wrapper for ONNX Runtime, enhances the deployment of ONNX models by supporting concurrent and distributed execution with Nx.Serving. This tool caters to various backends, including CUDA and Core ML, for efficient inference and easy model handling. Designed for models exported from PyTorch and TensorFlow, it offers a storage-only tensor implementation suitable for integration within Elixir applications. Installation involves adding Ortex to dependencies in mix.exs, with Rust required for compilation.
StableCascade
Stable Cascade leverages advanced compression techniques based on the Würstchen architecture to improve inference speed and lower training costs by significantly reducing latent space size. This model is compatible with various extensions like finetuning and ControlNet, ensuring high accuracy in prompt alignment and quality aesthetics. It demonstrates notable efficiency gains over models like Stable Diffusion, making it suitable for applications that prioritize speed and cost efficiency. The model provides flexible options for image compression and generation, accompanied by comprehensive resources for training and inference.
LLamaSharp
LLamaSharp is a versatile library offering efficient inference of LLaMA and LLaVA models across platforms on local devices, leveraging CPU and GPU capabilities. Its high-level APIs and RAG support facilitate seamless integration of large language models. With a variety of backends such as CUDA and Vulkan, LLamaSharp eases deployment without requiring native library compilation. It integrates well with libraries like semantic-kernel, and its comprehensive documentation assists in developing AI solutions.
llama.go
LLaMA.go is a framework for LLaMA model inference in Golang, reducing GPU dependencies and offering cross-platform support. It emphasizes performance and includes features like multi-threading and a standalone server mode. Future updates will enhance architecture support, performance optimizations, and compatibility with additional AI models.
aikit
AIKit is an adaptable platform for hosting, deploying, and fine-tuning large language models (LLMs). It offers OpenAI API-compatible tools, supports LocalAI for inference, and provides a flexible fine-tuning interface through Unsloth. Its minimal image size enhances security, and it supports multi-modal models and OpenAI API clients. AIKit is suitable for air-gapped environments and allows multiple model hosting with one image. It can be deployed on Kubernetes and supports AMD64, ARM64, and NVIDIA GPUs for faster inferencing.
segment-anything-fast
Discover methods to accelerate image segmentation using the Segment Anything Fast package. With efficient inference through bfloat16 support, torch.max-autotune, and Triton kernels for extended sequences, this tool boosts performance, especially on A100 GPUs. It also includes scaled dot product attention, NestedTensors, and int8 quantization, providing enhanced speed and accuracy. Its straightforward installation ensures easy adaptation to existing systems without unnecessary complexity.
OLMo
OLMo serves as a resource for open language models, developed for scientific purposes by AI2. It provides detailed instructions for PyTorch-based setup and offers models like OLMo 1B and 7B, trained on the Dolma dataset. Users can access checkpoints for model training and inferencing, facilitated by integration with Hugging Face. This repository aims to support clear and open research in language modeling.
ort
The Rust wrapper for ONNX Runtime v1.19 enhances machine learning inference and training efficiency on both CPU and GPU. Building upon the inactive onnxruntime-rs project, it offers smooth migration paths and is utilized by projects like Twitter and Supabase for recommendation improvements and reduced serverless function cold starts. Comprehensive guides and community support via Discord and GitHub are available for seamless project integration.
ao
Torchao provides effective solutions for PyTorch users to optimize inference and training through quantization and sparsity, enhancing model efficiency. It enables significant speed and memory improvements with weight and activation quantization. For training, it introduces Float8 data types and sparse training, ensuring resource efficiency. Its compatibility with PyTorch's `torch.compile()` and FSDP2 facilitates integration into existing workflows while supporting custom kernel development and experimental features. Suitable for researchers and developers looking to enhance performance while maintaining accuracy.
MegEngine
MegEngine is a versatile deep learning framework known for its unified approach to training and inference, prioritizing efficiency and usability. It reduces GPU memory usage significantly and is compatible with platforms such as x86, Arm, and CUDA. MegEngine is designed for effective inference with low hardware requirements, utilizing advanced features to improve performance. It supports various operating systems and offers easy installation through pip, fostering community involvement and AI innovation. Comprehensive documentation and tools aid in optimizing models on multiple platforms.
serving
TensorFlow Serving provides a stable and scalable platform for deploying machine learning models in production environments. It integrates effortlessly with TensorFlow while accommodating different model types and supporting simultaneous operation of multiple model versions. Notable features include gRPC and HTTP inference endpoints, seamless model version updates without client-side code alterations, low latency inference, and efficient GPU batch request handling. This makes it well-suited for environments seeking effective model lifecycle management and version control, enhancing machine learning infrastructures with adaptable and reliable functionalities.
ScaleLLM
A cutting-edge inference system designed for large language models, utilizing advanced techniques such as tensor parallelism and OpenAI-compatible APIs. It supports leading open-source models like Llama3.1 and GPT-NeoX, aiming for seamless production deployment with high efficiency through tools like Flash Attention and Paged Attention. The system is under active development, introducing enhancements like CUDA Graph, Prefix Cache, and Speculative Decoding. Easy installation via PyPI, offering customization and a flexible server for various tasks, ideal for performance and scalability needs.
DiT-MoE
DiT-MoE offers a scalable and efficient solution with its PyTorch implementation of Sparse Diffusion Transformers, designed to handle up to 16 billion parameters. Featuring advanced techniques like rectified flow-based training and expert routing, it aids in reducing computational load while enhancing model accuracy and convergence, with practical support from DeepSpeed. This project provides valuable assets such as pre-trained models and detailed scripts, catering to researchers requiring flexible and high-performing AI frameworks.
TensorRT_Tutorial
Discover how to optimize model efficiency and speed with NVIDIA TensorRT's high-performance inference capabilities. This guide provides an objective overview, focusing on INT8 optimization and including insights into user guide translations, sample code analysis, and practical usage experiences. Access educational resources like translated content, videos, and relevant blogs. Ideal for developers interested in maximizing TensorRT's utility without embellishment, this tutorial addresses documentation challenges and showcases best practices in deploying deep learning models with TensorRT.
ml-pen-and-paper-exercises
Discover a wide array of machine learning exercises focusing on linear algebra, graphical models, and inference methods, each complemented by detailed solutions. The topics include optimisation, factor graphs, hidden Markov models, and variational inference. Accessible as a compiled PDF on arXiv, this collection welcomes community input for enhancement. Perfect for enthusiasts of model-based learning and Monte-Carlo integration, offering in-depth comprehension through a pen-and-paper approach.
llama2.mojo
This project enhances the Llama2 model inference using Mojo's SIMD and vectorization, offering a 250x speed increase in Python performance. It exceeds llama2.c by 30% and llama.cpp by 20% in multithreaded CPU tasks. Supported models are Stories (260K to 110M) and Tinyllama-1.1B-Chat-v0.2. Benchmarks on Apple M1 Max demonstrate its proficiency. Suitable for developers exploring efficient transformer models in Mojo.
maxtext
MaxText is an open-source LLM that enables efficient training and inference on Google Cloud TPUs and GPUs. With support for models such as Llama2, Mistral, and Gemma, it provides robust scalability and high Model Flops Utilization through Jax and XLA compiler integration, making it adaptable for diverse LLM applications.
llm
Explore modern Rust libraries for effective LLM inference as the ‘llm’ project is archived. Consider Ratchet for web ML or Candle for versatile models. Evaluate wrappers like drama_llama and API aggregators for seamless integration in the Rust ecosystem.
awesome-rust-llm
Discover a curated selection of Rust libraries, frameworks, and tools tailored for large language models (LLMs). This collection includes key inference models such as llm and rust-bert, efficient tools like aichat and browser-agent, and vital core libraries including tiktoken-rs and polars. Find practical solutions for managing LLM memory, core application development, and AI project implementation. Improve Rust LLM projects with resources from this detailed guide. Contributions are encouraged to keep this guide up-to-date.
parler-tts
Parler-TTS is an open-source model for generating high-quality text-to-speech in different speaker styles. It provides complete access to datasets, training codes, and model weights under permissive licenses. The model supports rapid synthesis and is trained on extensive audiobook data, making it a suitable framework for researchers and developers. Parler-TTS allows for the customization of speech features through simple text prompts.
ltu
Discover how the LTU and LTU-AS models bridge audio and language processing, achieving state-of-the-art results in both closed-ended and open-ended audio question tasks. Access their PyTorch implementations, pretrained checkpoints, and comprehensive datasets crucial for audio and speech AI research. Try interactive demos on HuggingFace to explore their capabilities. These models demonstrate major advancements in audio and speech understanding, offering efficient inference methods such as APIs and local setups.
PickScore
The Pick-a-Pic project offers open-source datasets and a model to explore text-to-image user preferences. Available datasets include over a million examples in v2 and the original v1, along with the PickScore model. The repository includes a web application, installation instructions, and guides for inference, training, evaluation, and dataset download. A demo is available on HF Spaces, facilitating advanced AI research.
YOLOv8-TensorRT-CPP
This C++ implementation of YOLOv8 via TensorRT excels in object detection, semantic segmentation, and body pose estimation. Optimized for GPU inference, the project utilizes the TensorRT C++ API and facilitates integration with ONNX models converted from PyTorch. The project runs on Ubuntu, necessitating CUDA, cudnn, and CUDA-supported OpenCV. Users will find comprehensive setup instructions, model conversion guidance, and INT8 inference optimization tips. This project is ideal for developing high-performance vision applications on NVIDIA GPUs.
Feedback Email: [email protected]