#GPU
FlexGen
FlexLLMGen enables efficient large language model inference on single GPUs by optimizing memory usage through IO offloading and effective batch management. Designed for throughput-oriented tasks, it reduces costs while supporting applications in benchmarking and data processing. While less suited for small-batch operations, FlexLLMGen remains a viable solution for scalable AI deployments.
helix
Helix provides a secure platform for hosting open source AI models directly in data centers or VPCs, featuring RAG, API-calling, and model fine-tuning with a user-friendly drag-and-drop interface. This solution optimizes GPU usage and reduces latency for scalable applications. Install effortlessly with Docker and Kubernetes, guided by complete documentation. Helix is designed for personal, educational, and small business applications, promoting innovation while safeguarding data security and control. Engage with the Helix community to explore and contribute to cutting-edge AI solutions.
paxml
Utilize PaxML for efficient machine learning experiment configuration and execution on Jax-powered Cloud TPU frameworks. This tool enables scalable machine learning task management on both TPUs and GPUs. Available for installation from PyPI or GitHub, PaxML integrates support for complex models such as GPT-3, supported by detailed documentation and Jupyter Notebook tutorials for an enhanced educational experience. Take advantage of NVIDIA enhancements for superior GPU performance, promoting efficient operation across various computational scenarios.
HierarchicalKV
HierarchicalKV, an integral component of NVIDIA Merlin, provides a hierarchical key-value storage system designed for recommender system applications. It optimizes the management of feature embeddings on GPU and host memory, addressing issues such as single GPU memory limitations and complex inter-CPU communication. By avoiding CPU use and employing advanced eviction strategies, HierarchicalKV improves the performance of substantial recommendation models, offering high load efficiency and customizable management strategies for constructing, evaluating, and deploying recommendation models.
rsl_rl
This project provides a quick and efficient implementation of reinforcement learning algorithms optimized for GPU. Initially based on the NVIDIA Isaac GYM `rl-pytorch`, it currently supports PPO with plans to include more algorithms like SAC and DDPG. Managed by researchers from ETH Zurich and NVIDIA's Robotic Systems Lab, the framework facilitates logging via Tensorboard, Weights & Biases, and Neptune. It is intended for researchers expanding reinforcement learning capabilities and promotes community contributions while following the Google Style Guide for documentation. To set up, clone the repository and adhere to the instructions for seamless integration into various environments.
lightning-thunder
The Lightning Thunder project enhances PyTorch models' performance by utilizing a source-to-source compiler. Supporting both single and multi-GPU architectures, it integrates advanced executors like nvFuser, torch.compile, and cuDNN. Achieving up to a 40% increase in training speed, Thunder offers substantial efficiency improvements, making it a valuable asset for machine learning development. As the tool is currently in its alpha stage, it encourages contributions and exploration of its capabilities.
mergekit
MergeKit offers an effective solution for merging pre-trained language models with support for algorithms like Linear, SLERP, and Task Arithmetic. It is suitable for resource-constrained settings, functioning on both CPU and GPU with low VRAM requirements. Features include lazy tensor loading and layer-based model assembly. Compatible with models like Llama, Mistral, and GPT-NeoX, it also provides an intuitive GUI on Arcee's platform and supports sharing on the Hugging Face Hub. A versatile YAML configuration enables custom merge strategies.
PowerInfer
PowerInfer, a high-speed inference engine for Large Language Models, leverages consumer-grade GPUs for enhanced performance. Utilizing activation locality and a hybrid CPU/GPU model, it optimizes resource demands while maintaining efficiency. PowerInfer offers up to 11 times faster performance than llama.cpp, generating an average of 13.20 tokens per second, with peaks of 29.08 tokens per second, nearly matching professional servers. This architecture incorporates adaptive predictors and sparse operators, facilitating integration, backward compatibility, and efficient deployment on models like Falcon-40B and Bamboo-7B.
insanely-fast-whisper-api
This open-source API offers rapid audio transcription by leveraging OpenAI's Whisper Large v3. The project integrates Transformers and flash-attn for speed and features easy deployment on any GPU cloud provider. With built-in speaker diarization and secure access through admin authentication, it enables task management with cancellation and status endpoints. Optimized for concurrency, it supports asynchronous tasks and webhooks for flexible operations. Easily deployable via Docker, it caters to Fly.io or other VM environments. Discover a scalable, cost-efficient API fully managed through JigsawStack.
RecBole
RecBole provides a diverse collection of 91 recommendation algorithms within a flexible Python and PyTorch framework. This tool supports research across general, sequential, context-aware, and knowledge-based recommendations, alongside 43 benchmark datasets. Version 2.0 enhances functionality with 8 packages targeting advanced topics like debiasing and fairness, optimizing for multi-GPU and mixed precision training. With robust documentation and evaluation protocols, RecBole serves as a valuable asset for both novice and experienced researchers in recommendation systems.
stable-diffusion-nvidia-docker
The project facilitates Stable Diffusion deployment using Docker, allowing GPU-based image generation without the need for coding skills. Features include a UI built with Gradio, support for the Stable Diffusion 2.0 model, and functionalities like img2img and image inpainting. Its Data Parallel approach enables multi-GPU support, optimizing inference speed for art and design tasks with straightforward installation for Ubuntu and Windows users.
blackjax
BlackJAX is a versatile library designed for JAX, providing efficient sampling tools suited for both CPU and GPU environments. It enables developers and researchers to explore modular algorithms, facilitating advancements in probabilistic programming through customizable techniques. BlackJAX bridges the need for independent sampling solutions, offering reusable code that enhances Bayesian inference research.
femtoGPT
Explore the minimal Rust-based femtoGPT, adept in inference and training on both CPUs and GPUs via OpenCL—bypassing the need for massive CUDA installations. This open-source tool serves as an accessible research platform for CPU and GPU. Easily deployable with Rust toolchain, it invites AI enthusiasts to dive into hands-on learning while following developer insights from a planned book.
vidur
Vidur facilitates efficient LLM deployment planning with minimal GPU usage. It supports diverse models and allows testing of new ideas, scheduling algorithms, and performance evaluations under various workloads. Its features include pipeline parallelism and detailed performance tracing, making it an invaluable tool for system deployment enhancement.
lerf
LERF integrates language into radiance fields, enhancing visual computing with easy installation. It provides scalable solutions including 'lerf-lite' for low-memory GPUs and 'lerf-big' for high-performance systems. Users can interact with relevancy maps via text prompts and use various image encoders for flexible visualization. The tool supports different GPUs, ensuring broad usability.
lectures
This series provides a thorough exploration of CUDA, PyTorch, and parallel processing techniques through lectures led by industry professionals. Topics include profiling CUDA kernels, optimizing PyTorch, and understanding advanced GPU programming concepts. The collection includes practical notebooks and slides, offering valuable resources for both beginners and experienced developers to enhance their skills and optimize GPU-driven applications.
carefree-creator
Discover this AI-powered backend designed for versatile content creation, integrating various Stable Diffusion versions for broad use. Based on 'carefree-learn' and compatible with Python 3.8+ and PyTorch 1.12.0+, it ensures efficient GPU usage through lazy and partial loading. Deploy easily with 'cfcreator serve', optimizing resources with '--limit' and '--lazy' options. Docker setup is supported, with a tailored build process for China-based users. Suitable for creators needing adaptable and efficient digital content production solutions.
ipex-llm
Explore a library designed for accelerating LLMs on Intel CPUs, GPUs, and NPUs. Seamlessly integrating with frameworks such as transformers and vLLM, it optimizes over 70 models for better performance. Latest updates feature GraphRAG support on GPUs and comprehensive multimodal capabilities like StableDiffusion. With low-bit optimizations, it enhances processing efficiency on Intel hardware for large models. Discover new LLM finetuning and pipeline parallel inference advancements with ipex-llm.
Finetune_LLMs
The project provides an in-depth guide to fine-tuning Large Language Models (LLMs) using a famous quotes dataset, with support for advanced methods like DeepSpeed, Lora, and QLora. It includes a comprehensive Docker walkthrough to integrate Nvidia-docker for GPU acceleration on Linux systems with modern Nvidia GPUs. The repository offers both updated and legacy code, catering to users with varying familiarity levels, and professional assistance is available if needed.
taichi
Taichi Lang is an open-source, imperative programming language that excels in high-performance numerical computation. Integrated with Python, it leverages just-in-time (JIT) compiler frameworks such as LLVM to optimize CPU and GPU tasks. Its versatility spans real-time simulations, AI, and visual effects, offering flexible data structuring with features like SNode. The `@ti.kernel` decorator boosts performance by compiling Python functions for parallel execution, making it adaptable to various computing backends.
TransformerEngine
Transformer Engine uses FP8 precision to accelerate Transformer models on NVIDIA Hopper GPUs, facilitating enhanced memory efficiency during training and inference. It includes optimized modules and a mixed-precision API for integration with deep learning frameworks, supporting architectures like BERT, GPT, and T5. With accessible Python and C++ APIs, Transformer Engine enables mixed-precision training, offering speed improvements with minimal accuracy changes. Compatible with major LLM libraries and supporting various GPU architectures, it is a versatile tool for NLP projects.
accelerated-scan
The Accelerated Scan project implements efficient GPU-based forward and backward associative scans, improving the processing of first-order recurrences, particularly in state space models and linear RNNs. It utilizes a C++ CUDA kernel for chunked processing and takes advantage of advanced GPU communication techniques like warp shuffling and shared memory use. Implementations are available in both CUDA and Triton, ensuring faster performance with maintained numerical accuracy. Benchmarks highlight notable improvements over conventional methods, making it a suitable option for developers requiring dependable associative scanning capabilities.
Feedback Email: [email protected]