#CUDA
lorax
LoRAX is a cost-effective framework for serving fine-tuned large language models efficiently on a single GPU, maintaining high throughput and low latency. It enables dynamic adapter loading and merging from various sources such as HuggingFace and Predibase, ensuring seamless concurrent processing. With support for heterogeneous batching, optimized inference, and ready-for-production tools like Docker images and Prometheus metrics, LoRAX is well-suited for diverse deployment scenarios. This platform supports models like Llama and Mistral and is free for commercial use under the Apache 2.0 License.
lightseq
Explore a library that significantly boosts sequence processing speed for training and inference, with CUDA-based support for models like BERT and GPT. LightSeq achieves up to 15x faster performance compared to traditional methods using fp16 and int8 precisions. Compatible with frameworks like Fairseq and Hugging Face, it offers efficient computations for machine translation, text generation, and more without exaggeration.
gocv
The GoCV package provides OpenCV 4 support for Go developers on Linux, macOS, and Windows, enabling efficient image and video processing with hardware acceleration via CUDA for Nvidia GPUs and Intel OpenVINO support. It includes examples and installation guides to streamline integration and leverage the latest OpenCV capabilities. The package is designed to be compatible with the newest Go releases, offering a reliable solution for developers looking to implement high-performance computer vision applications using Go, without unnecessary promotional language.
k2
k2 aims to integrate Finite State Automaton (FSA) and Finite State Transducer (FST) into autograd-based platforms such as PyTorch and TensorFlow. This is particularly advantageous for speech recognition, allowing diverse training objectives and joint system optimization. The focus on pruned FSA composition facilitates efficient ASR decoding and training, utilizing a codebase largely in C++ and CUDA to support parallel execution. Progressing towards production, k2 offers Python integration with pybind11 and has speech recognition recipes in related repositories.
ppl.llm.serving
This project provides a scalable solution for deploying Large Language Models using gRPC on the PPL.NN platform. Key features include model exporting and configuration for optimal performance on x86_64 and arm64 systems with CUDA. The environment supports inference, benchmarking, and seamless client-server interactions. Designed for Linux, it requires GCC, CMake, and CUDA, ensuring compatibility and enhanced performance.
koila
Koila provides an efficient method to resolve 'CUDA error: out of memory' issues in PyTorch with minimal code changes. By dynamically adjusting batch sizes to GPU availability and using lazy evaluation, it enhances resource management and performance. Its lightweight design supports large batch operations and eases debugging, seamlessly integrating with existing PyTorch setups. Available via PyPI, Koila is a promising tool for future enhancements like multi-GPU support, though not yet fully production-ready.
tiny-tensorrt
Discover a user-friendly NVIDIA TensorRT wrapper for deploying ONNX models in C++ and Python. Despite its lack of ongoing maintenance, tiny-tensorrt emphasizes efficient deployment using minimal coding. Dependencies include CUDA, CUDNN, and TensorRT, easily setup through NVIDIA's Docker. With support for multiple CUDA and TensorRT versions, it integrates smoothly into projects. Documentation and installation guidance are available on its GitHub wiki.
pytorch_scatter
Enhance your PyTorch experience with this extension offering efficient sparse update operations like scatter and segment. These operations bridge the gap in PyTorch, perfect for data segmentations and reductions, compatible with CPU/GPU, and support multiple data types. Installation is simple via Anaconda and pip, compatible across various OS and CUDA versions, featuring capabilities such as scatter_std and scatter_softmax, ensuring full backward compatibility and traceability for robust computational tasks.
accelerated-scan
The Accelerated Scan project implements efficient GPU-based forward and backward associative scans, improving the processing of first-order recurrences, particularly in state space models and linear RNNs. It utilizes a C++ CUDA kernel for chunked processing and takes advantage of advanced GPU communication techniques like warp shuffling and shared memory use. Implementations are available in both CUDA and Triton, ensuring faster performance with maintained numerical accuracy. Benchmarks highlight notable improvements over conventional methods, making it a suitable option for developers requiring dependable associative scanning capabilities.
llm.c
llm.c enables efficient pretraining of GPT-2 and GPT-3 in plain C/CUDA, circumventing large frameworks such as PyTorch. The project is developed collaboratively, highlighting both educational and practical perspectives for large model training, and supports further language adaptations, making it suitable for a diverse range of deep learning practitioners.
stable-diffusion-webui-forge
Stable Diffusion WebUI Forge facilitates development through efficient resource management, rapid inference, and innovative features. Taking inspiration from Minecraft Forge, this platform enhances Stable Diffusion WebUI by integrating popular extensions and supporting sophisticated image editing. It features an easy setup compatible with multiple CUDA and Pytorch versions, allowing for seamless updates and effective GPU usage. Users can access comprehensive guides, various extensions, and report on performance issues or enhancements, ensuring a reliable platform for image creation and enhancement.
face-alignment
The project provides an accurate method for detecting 2D and 3D facial landmarks through Python, utilizing FAN's deep learning techniques. It is compatible with several face detectors such as SFD, Dlib, and BlazeFace and can handle batch processing for directories. Operating efficiently on both CPU and GPU, it is optimized for devices with CUDA capabilities. Users can select different precision settings to improve performance. The installation is simple via pip or conda, with options for source builds and Docker support. User contributions and feedback are welcomed to enhance the project.
LLM-Kit
This open-source project provides a versatile WebUI toolkit designed to manage language model workflows effortlessly. Users can create custom models and applications without coding, in environments like Python and CUDA. The toolkit features robust modules, including APIs for prominent language models such as OpenAI and Baidu's Wenxin Yiyan. It supports functionalities including chat, image generation, dataset processing, and embedding models. Key features include role-play settings with memory and background libraries, and compatibility with large-scale models like ChatGLM and Phoenix-Chat. Operating under the AGPL-3.0 license, it encourages community involvement and shared development.
TensorRT-YOLO
The TensorRT-YOLO project supports enhanced inference for YOLOv3 to YOLO11 and PP-YOLOE models through NVIDIA TensorRT optimization. It integrates TensorRT plugins, CUDA kernels, and CUDA Graphs to deliver a fast object detection solution compatible with C++ and Python. Key features include ONNX export, command-line model export, and Docker deployment.
tutel
Tutel MoE provides an efficient implementation of Mixture-of-Experts, including 'No-penalty Parallelism' for adaptable training and inference. It is compatible with PyTorch and supports CUDA and ROCm GPUs as well as various CPU formats. Recent updates feature new benchmarks, tensorcore options, and improved communication. Tutel enables seamless configuration changes without additional costs and offers straightforward installation and testing processes. It supports distributed modes across multi-node and multi-GPU setups, making it suitable for developers looking to improve performance and scalability in machine learning frameworks.
efficient-dl-systems
This repository provides the comprehensive 2024 course materials for Efficient Deep Learning Systems taught at HSE University and Yandex School of Data Analysis. Topics covered include core GPU architecture, CUDA API, experiment management, distributed training, and Python web deployment. Detailed week-by-week content supports learning of both theoretical foundations and practical applications, emphasizing real-world examples and project-based studies.
jax-triton
The jax-triton repository facilitates effective JAX and Triton integration for optimized GPU computations. It utilizes 'jax_triton.triton_call' to implement Triton functions within 'jax.jit'-compiled routines. Users can begin with examples like Triton's vector addition tutorial and progress to advanced tasks such as fused attention. Installation is straightforward, supporting both stable and nightly Triton releases, with prerequisite CUDA-compatible JAX. Developers can participate by cloning the repository and conducting editable installs, supported by tests using 'pytest'.
llama3.np
Explore the pure NumPy implementation of the Llama 3 model for a straightforward approach to deep learning architectures without the need for CUDA. Trained with the stories15M model, this resource is ideal for developers and researchers interested in understanding the Llama 3 functionalities. It offers an easy-to-follow guide on using Python for complex model inference, serving as a valuable tool for both learning and research.
bitsandbytes
Bitsandbytes library provides an efficient Python interface for CUDA functions, featuring 8-bit optimizers, matrix multiplication, and quantization for 8-bit and 4-bit operations. It extends support to multiple backends like AMD GPUs and Intel processors, improving cross-platform functionality. The recent alpha release showcases its commitment to expanding hardware compatibility, with ongoing efforts for Windows and future Apple Silicon support, inviting constructive community feedback for continual enhancement.
sd-webui-reactor
This open-source project provides Stable Diffusion with rapid face-swapping capabilities, supporting multiple faces, gender detection, and image enhancements like restoration and upscaling. It offers broad compatibility across SD WebUIs and systems like Mac M1/M2, with features such as API access and CUDA acceleration for enhanced performance, while maintaining a neutral stance on usage responsibility.
pyg-lib
This project provides a selection of pre-built Python wheels that support multiple PyTorch and CUDA versions and are designed for Linux, Windows, and macOS. Although Windows support is still being enhanced, the Linux options are extensive, supporting PyTorch 1.12 through 2.4. The library can be easily installed using pip, with both stable and nightly builds available. Supporting Python 3.9 to 3.12, it offers a versatile solution for various scientific computing applications.
YOLOv8-TensorRT
YOLOv8-TensorRT boosts YOLOv8 performance by employing TensorRT for faster inference. It leverages CUDA and C++ for engine construction and facilitates ONNX model export with NMS integration. This project provides flexible deployment options using Python and Trtexec on various platforms, including Jetson. The comprehensive setup guide helps adapt to different AI deployment needs, offering an efficient PyTorch alternative.
gorgonia
Gorgonia provides a suite of tools for developing and testing machine learning models using graph computations in Go language. It competes in speed with TensorFlow and Theano while supporting CUDA for GPU computations, and aims to support distributed systems. The library is suited for developers familiar with Go who want to build effective ML systems, offering features like automatic differentiation, symbolic differentiation, and gradient descent optimization. Gorgonia fosters experimentation with alternative deep learning approaches and is supported by a committed community to assist developers.
rtp-llm
Created by Alibaba's Foundation Model Inference Team, the rtp-llm inference engine is engineered for high-performance acceleration of large language models across Alibaba platforms such as Taobao and Tmall. It features optimized CUDA kernels and broad hardware support, including AMD ROCm and Intel CPUs, and integrates seamlessly with HuggingFace models. The engine supports multi-machine, multi-GPU parallelism and introduces features like contextual prefix caches and speculative decoding, enhancing deployment efficiency on Linux with NVIDIA GPUs. Explore its proven reliability and broad usage in Alibaba's AI projects.
clip-guided-diffusion
This project uses CLIP-powered diffusion models for text-to-image transformation, offering options for prompt complexity and image size, compatible with CPU and GPU. It includes features for blending images, timestep adjustment, and support for both CLI and Python API. Straightforward installation and wandb integration for output logging are also available.
CCTag
This project provides a library for detecting and localizing CCTag markers formed by concentric circles, using both CPU and GPU technologies. Based on research from the CVPR 2016 conference, it is designed to operate under challenging conditions and requires CUDA compatibility. Offering continuous integration across Windows and Linux, it ensures updated builds and smooth integration. Resources such as printable markers and comprehensive documentation are available for enhanced deployment. Developed through the European Union’s Horizon 2020 program, CCTag is licensed under MPL v2.
ThunderKittens
ThunderKittens streamlines creating high-performance deep learning kernels with CUDA, soon supporting MPS and ROCm. It focuses on simplicity, extensibility, and performance to optimize tile manipulation specific to modern GPU architectures. Key features include tensor core optimization, asynchronous copy techniques to reduce latency, and distributed shared memory usage for efficient GPU usage. Supporting CUDA 12.3+ and C++20, ThunderKittens is powerful yet straightforward to incorporate, offering pre-built PyTorch kernels and an active developer community.
neuralangelo
Explore neural surface reconstruction techniques using NVIDIA's Imaginaire library. This project guides setting up environments with Docker and Conda for efficient execution. Learn methods for video frame data preparation, CUDA-based accelerated mesh processing, and optimizing isosurface extraction to improve 3D model quality with adaptable configuration options.
3d-ken-burns
This implementation uses the 3D Ken Burns Effect to animate static images with PyTorch, incorporating virtual camera pans and zooms. It utilizes CuPy for efficient CUDA processing and offers both automated and manual camera path adjustments. The tool can be run locally or on Colab with community notebooks, and includes depth estimation features for refined animations. Benchmarking scripts are provided for performance verification. Licensed for non-commercial use under Creative Commons, this tool provides an innovative approach to adding dynamic depth and motion to images.
CUDA-GEMM-Optimization
Explore performance enhancement methods for GEMM using CUDA kernels optimized for NVIDIA GPUs, specifically the GeForce RTX 3090. The project ensures compatibility with GPUs with compute capability 7.0 or above, using the NVIDIA NGC CUDA Docker container for efficient build and execution. Utilize techniques like 2D block tiling and vectorized memory access to optimize FP32 and FP16 calculations with or without Tensor Cores for significant performance gains.
xtts-streaming-server
XTTS streaming server provides a streamlined setup for audio streaming demos with Docker and CUDA support. Not intended for production, it facilitates audio task execution using pre-built or custom Docker images and allows model fine-tuning with specific files. It highlights the COQUI TOS agreement requirement, with all models licensed under CPML. Testing the server is simplified through scripts or Gradio demos, ensuring ease of use for demonstration setups.
slowllama
Explore how slowllama facilitates fine-tuning of Llama2 and CodeLLama models on Apple M1/M2 and nVidia GPUs without quantization. Learn about SSD and RAM offloading for efficient model management, focused exclusively on fine-tuning using LoRA, ensuring effective parameter updates on consumer-grade hardware. Review experimental results to understand GPU and memory optimization for large model fine-tuning.
stable-fast
Stable-fast provides top-tier inference capabilities for diffuser models, such as the StableVideoDiffusionPipeline, with compilation in seconds, unlike TensorRT. It natively supports dynamic shapes, LoRA, and ControlNet. Primed for HuggingFace Diffusers on NVIDIA GPUs, this framework leverages techniques like CUDNN Convolution Fusion and low precision Fused GEMM for enhancements. Designed for compatibility with multiple PyTorch editions and acceleration tools, Stable-fast requires minimal adjustments for maximum performance.
TensorRT
Discover NVIDIA's TensorRT open-source components, including plugins and ONNX parser support. The repository provides sample apps showcasing platform capabilities, enhancements, and fixes. Developers will find coding guidelines and contribution instructions helpful. The Python package facilitates installation, compatible with CUDA, cuDNN, and vital tools for smooth deployment. Engage with the TensorRT community for updates and enterprise support through NVIDIA AI Enterprise. Access detailed developer guides and forums for further assistance.
cutlass
CUTLASS 3.6.0 provides a versatile framework for CUDA matrix operations with modular templates and the CuTe library facilitating efficient tensor manipulation. It accommodates mixed-precision computations such as FP16, BF16, and TF32, optimized for NVIDIA platforms from Volta to Hopper. Updates feature structured sparse GEMM improvements, a refined convolution API, and expanded support for additional data types and architectures, promoting exceptional performance and wide compatibility.
flowframes
Flowframes is a user-friendly Windows interface for video interpolation, featuring support for PyTorch and NCNN frameworks like RIFE and DAIN. Designed for Vulkan-compatible GPUs, this open-source donationware project provides free builds, with additional features available on Patreon. It offers customizable settings for frame de-duplication and scene changes, alongside automated configurations for quick setup. With Auto-Encode and loop interpolation, it caters to both 2D animations and high-resolution video projects.
jittor
Jittor is a dynamic and efficient deep learning framework utilizing JIT compiling and meta-operators for specialized code generation. It supports a wide range of models, such as image recognition and reinforcement learning, with an easy-to-use Python front-end and a powerful CUDA and C++ back-end. Flexible installation options are available via pip, Docker, or manual setup, compatible with Linux, macOS, and Windows. The framework is continuously improved by contributions from its active community.
gaussian-splatting
This project presents an innovative method for real-time rendering of radiance fields, emphasizing 1080p high-resolution novel-view synthesis without speed sacrifice. It utilizes 3D Gaussian models based on sparse data from camera calibration to enhance scene representation with anisotropic covariance. The framework incorporates a visibility-aware rendering algorithm supporting anisotropic splatting to improve both training efficiency and real-time rendering. Recent updates have introduced accelerated training and OpenXR support, making this tool valuable for both researchers and those new to efficient real-time rendering.
lectures
This series provides a thorough exploration of CUDA, PyTorch, and parallel processing techniques through lectures led by industry professionals. Topics include profiling CUDA kernels, optimizing PyTorch, and understanding advanced GPU programming concepts. The collection includes practical notebooks and slides, offering valuable resources for both beginners and experienced developers to enhance their skills and optimize GPU-driven applications.
hidet
Hidet is an open-source deep learning compiler optimizing DNN models from PyTorch and ONNX to CUDA kernels, tailored for efficient inference on NVIDIA GPUs. It supports Linux with CUDA 11.6+ and Python 3.8+, applying graph and operator optimizations for enhanced performance. Comprehensive documentation and community engagement facilitate ongoing development, with straightforward installation and usage to integrate into workflows.
csprng
torchcsprng delivers AES 128-bit encryption in ECB and CTR modes, along with secure pseudorandom number generators for PyTorch, via a C++/CUDA extension that supports both CPU and CUDA. It offers APIs for flexible tensor encryption and decryption, allowing the choice of seed-based or crypto-secure random devices, thus ensuring high security across applications. Optimized for PyTorch, it enhances performance and security in parallel random number generation on CUDA and CPU, suitable for data manipulation and deep learning. The newest version supports Python 3.7-3.9 and CUDA-enabled setups.
apex
This repository provides NVIDIA tools that facilitate advanced mixed precision and distributed training in PyTorch. While some modules are deprecated, equivalent PyTorch solutions are available. Apex offers streamlined training with examples for ImageNet; supports synchronized batch normalization and leverages NVIDIA's NCCL library. Available for Linux and experimental Windows, it supports custom C++/CUDA extensions to enhance performance.
tiny-cuda-nn
Discover an efficient framework for training and querying neural networks, featuring a fast multi-layer perceptron and versatile hash encoding. Compatible with NVIDIA GPUs, it offers a C++/CUDA API and PyTorch extension, supporting various encodings, losses, and optimizers. Enhance neural network performance on RTX 3090 GPUs with its Python integration, and access useful utilities, performance benchmarks, and examples.
GaussianFlow
This detailed overview presents a new method for 4D content creation utilizing Gaussian dynamics, aimed at enhancing training efficiency by managing variables effectively. The approach accelerates processing through selective temporal variable management without degrading performance. It covers the strategic application of Gaussian parameters over sequential timelines and essential computational techniques like SVD and flow supervision, facilitating efficient 4D rendering. This guide serves as a valuable resource for developers and researchers targeting improvements in 4D graphics rendering technology.
Feedback Email: [email protected]