#GPU acceleration

Logo of video-subtitle-extractor
video-subtitle-extractor
Video-subtitle-extractor objectively transforms embedded subtitles from videos into separate SRT files. It includes features like keyframe extraction, subtitle localization, and text recognition. Non-subtitle areas can be filtered, with options to remove watermarks. Supporting batch extraction in 87 languages, three modes—Fast, Auto, and Precise—are available. It uses local OCR without online APIs, maintaining privacy and efficiency, while GPU acceleration enhances performance. Compatible across Windows, macOS, and Linux, the tool offers both GUI and CLI interfaces for ease of use.
Logo of rtp-llm
rtp-llm
Created by Alibaba's Foundation Model Inference Team, the rtp-llm inference engine is engineered for high-performance acceleration of large language models across Alibaba platforms such as Taobao and Tmall. It features optimized CUDA kernels and broad hardware support, including AMD ROCm and Intel CPUs, and integrates seamlessly with HuggingFace models. The engine supports multi-machine, multi-GPU parallelism and introduces features like contextual prefix caches and speculative decoding, enhancing deployment efficiency on Linux with NVIDIA GPUs. Explore its proven reliability and broad usage in Alibaba's AI projects.
Logo of Omega-AI
Omega-AI
Discover a robust deep learning framework built with Java, facilitating seamless neural network setup and model training. It includes GPU acceleration compatibility and supports a diverse range of models such as CNN, RNN, VGG16, ResNet, YOLO, LSTM, Transformer, and GPT2. Enhanced multi-threading performance and optimized for CUDA and CuDNN. This framework is a perfect match for Java developers and includes comprehensive guides for GPU configuration. Connect with the community for insights and contributions. Visit Omega-AI's repositories on Gitee and GitHub for more information.
Logo of distrifuser
distrifuser
DistriFusion enhances diffusion model inference on multiple GPUs, leading to significant speedups in high-resolution image synthesis with maintained image quality. It addresses patch fragmentation through advanced communication strategies and is highlighted at CVPR 2024 and integrated with ColossalAI.
Logo of kompute
kompute
Kompute is a versatile framework for GPU compute operations, suitable for cross-vendor graphics cards. Utilizing Vulkan for asynchronous and parallel processing, it supports machine learning, mobile, and game development applications. Backed by the Linux Foundation, it offers a Python module and C++ SDK, with comprehensive documentation and community support.
Logo of LARS
LARS
LARS is a locally executable application for Large Language Models (LLMs) that offers advanced citation features to enhance response accuracy. Utilizing Retrieval Augmented Generation (RAG) technology, it reduces AI inaccuracies by basing responses on user-uploaded documents, including detailed citations such as document names and page numbers. Supporting formats like PDFs and Word files, LARS provides a built-in document reader and customizable settings, making it ideal for a wide range of tasks.
Logo of lms
lms
The 'lms' command line tool integrates seamlessly with LM Studio, enabling users to manage local API servers and load models with GPU acceleration. Supporting Linux, macOS, and Windows, it offers functionalities like model listing in JSON format, project creation, model status checks, and log streaming. This tool facilitates efficient model management and is essential for developers using LM Studio.
Logo of ai00_server
ai00_server
AI00 RWKV Server is an inference API server for the RWKV language model supporting Vulkan GPUs, eliminating the need for pytorch or CUDA. The compact design supports AMD and integrated graphics, and aligns with OpenAI ChatGPT API for applications like chatbots, text generation, translation, and Q&A. Open-source under the MIT license, it offers a streamlined LLM API experience.
Logo of NVTabular
NVTabular
NVTabular is a library for feature engineering and preprocessing of large-scale tabular datasets, optimized for recommender systems. It efficiently handles terabyte-scale data with GPU acceleration via RAPIDS Dask-cuDF. As part of NVIDIA's Merlin framework, it integrates with other tools to streamline model training and deployment. NVTabular solves issues like large datasets and complex preprocessing, facilitating quick experimentation and deployment. It operates beyond standard memory limitations, allowing for smooth data processing and resource optimization.
Logo of DirectML
DirectML
DirectML is a hardware-accelerated DirectX 12 library optimized for machine learning tasks on GPUs from AMD, Intel, NVIDIA, and Qualcomm. It integrates with Direct3D 12, minimizing latency and maximizing performance across platforms. Available on Windows 10 and Windows Subsystem for Linux, and as a standalone package, DirectML supports frameworks such as Windows ML and ONNX Runtime, facilitating model training and inference for PyTorch and TensorFlow applications.
Logo of onnxruntime
onnxruntime
ONNX Runtime optimizes machine learning by accelerating inference and training across platforms. It supports models from frameworks like PyTorch and TensorFlow, and systems like scikit-learn and XGBoost, focusing on hardware optimization. By using multi-node NVIDIA GPUs, it notably reduces training time with minimal changes to PyTorch scripts. With compatibility across various operating systems, ONNX Runtime efficiently enhances performance while cutting costs. Access resources for deeper insights.
Logo of dfdx
dfdx
Explore the capabilities of a Rust-based deep learning library with GPU acceleration and compile-time tensor operation checks. Featuring neural network components like Linear and Conv2D, it includes standard optimizers such as Adam and RMSprop. Currently in pre-alpha, it offers innovative features like shape validation and type-checking, enhancing reliability. Integrate it into your Rust project and utilize comprehensive documentation for powerful performance.
Logo of nanodl
nanodl
Explore a Jax-based library that streamlines the creation and training of transformer models, minimizing the complexity often found in model development. Includes customizable components for various AI tasks and offers distributed training support. Features intuitive dataloaders and distinct layers for optimized model building. Suitable for AI professionals developing smaller yet potent models, with community support available on Discord.
Logo of ThunderKittens
ThunderKittens
ThunderKittens streamlines creating high-performance deep learning kernels with CUDA, soon supporting MPS and ROCm. It focuses on simplicity, extensibility, and performance to optimize tile manipulation specific to modern GPU architectures. Key features include tensor core optimization, asynchronous copy techniques to reduce latency, and distributed shared memory usage for efficient GPU usage. Supporting CUDA 12.3+ and C++20, ThunderKittens is powerful yet straightforward to incorporate, offering pre-built PyTorch kernels and an active developer community.
Logo of react-native-fast-tflite
react-native-fast-tflite
Provides efficient TensorFlow Lite performance on React Native through JSI and zero-copy ArrayBuffers. It facilitates GPU-accelerated delegates such as CoreML and Metal, with dynamic runtime model swapping. Supports integration with VisionCamera for advanced imaging, optimizing AI model deployment across iOS and Android. Utilizes low-level C/C++ TensorFlow Lite API for direct memory access to enhance machine learning model execution.
Logo of DALI
DALI
NVIDIA's DALI library improves deep learning workflows by moving data loading and preprocessing tasks from CPU to GPU, thus overcoming CPU bottlenecks. It enhances performance for complex tasks like image classification and object detection. With compatibility across popular frameworks such as TensorFlow, PyTorch, and PaddlePaddle, DALI ensures smooth integration and application portability. It supports a wide range of data formats and offers multi-GPU scalability features, making it suitable for research and production. Additionally, DALI integrates with NVIDIA Triton Inference Server, facilitating efficient deployment of optimized inference models.
Logo of stable-diffusion-docker
stable-diffusion-docker
The project employs Docker containers with GPU acceleration for running Stable Diffusion, simplifying the tasks of text-to-image and image-to-image transformations using models from Huggingface. It mandates a CUDA-capable GPU with 8GB+ VRAM and supports functionalities like depth-guided diffusion, inpainting, and upscaling. A Huggingface user token is required for model access, with pipeline management via an intuitive script. Its configurable nature suits both high-performance and less robust systems, enhancing resource-efficient image rendering for developers and artists.
Logo of InternEvo
InternEvo
InternEvo is an open-source, lightweight framework for model pre-training, minimizing dependency needs. It efficiently supports large-scale GPU cluster training and single GPU fine-tuning, with a near 90% acceleration efficiency across 1024 GPUs. Regularly releasing advanced large language models like the InternLM series, it surpasses many notable open-source LLMs. Installation is straightforward, with support for torch, torch-scatter, and flash-attention for training acceleration. Comprehensive tutorials and tools ensure efficient model development, encouraging community contributions without overstating benefits.
Logo of gpytorch
gpytorch
GPyTorch is a scalable and flexible library for Gaussian processes built on PyTorch. It uses numerical linear algebra for efficient inference and supports GPU acceleration, making it suitable for integration with deep learning frameworks. Contributions include advancements in techniques like SKI/KISS-GP and stochastic variational deep kernel learning. Support is available for Python 3.8 and higher, with easy installation via pip or conda. The library is developed with input from significant contributors, enabling strong Gaussian process solutions for diverse applications.
Logo of Deep-Live-Cam
Deep-Live-Cam
Deep-Live-Cam offers an advanced solution for creating real-time face swaps and video deepfakes using just a single image. This tool aids in character animation and clothing modeling, with ethical standards ensuring consent and blocking inappropriate content. Available as both a pre-built and manually configurable option, it supports GPU acceleration and delivers dynamic face mapping and resizable previews for various creative projects.