#Distributed Training

Logo of ray
ray
Ray provides a streamlined way to scale AI and Python applications from a single machine to a cluster, without extra infrastructure. It includes scalable AI libraries and a core distributed runtime to support various tasks. The Ray Dashboard and Distributed Debugger aid in monitoring and troubleshooting. Ray's compatibility with cloud services and Kubernetes, complemented by a community-centric ecosystem, makes it vital for modern ML needs. Installation via pip is straightforward, offering quick access to its features.
Logo of awesome-huge-models
awesome-huge-models
Explore the growing field of large AI models after GPT-4, where collaborations on GitHub are overtaking traditional research. This page focuses on open-source developments in Large Language Models (LLMs) and includes accessible training and inference codes and datasets. Keep informed about LLM advancements in language, vision, speech, and science, and examine surveys and major model releases to gain insights into modern AI model architecture and licenses.
Logo of mpi-operator
mpi-operator
The MPI Operator facilitates distributed training on Kubernetes by simplifying configuration and deployment. It allows for efficient resource management and scalability in machine learning tasks, supporting diverse MPI implementations such as Intel MPI and MPICH. Key features include job monitoring and logging, enhancing manageability in high-performance computing applications. This setup is optimized for environments demanding efficient orchestration and resource usage.
Logo of composer
composer
An open-source library for scalable and flexible deep learning model training. Built on PyTorch, Composer simplifies distributed workflows for large clusters, supporting models like LLMs, CNNs, and transformers. Features include parallel data loading, efficient memory use, and customizable training loops, assisting in automating routine tasks to achieve flexibility and high performance. Designed for those with Python and PyTorch knowledge, it allows seamless integration with streaming datasets and experiment tracking tools for efficient training configurations.
Logo of relora
relora
Discover ReLoRA, a methodology that enhances neural network pretraining by employing low-rank updates to improve training efficiency. This technology provides adaptable configurations with reset frequency control and effective optimizer state management, ideal for large-scale neural models. Fully customizable in terms of batch sizes and learning rates, it supports distributed training via PyTorch DDP and ensures practical implementation for AI research advancements, offering enhanced performance and reproducibility from pre-trained models.
Logo of TonY
TonY
TonY provides a versatile framework to execute deep learning jobs on Apache Hadoop, supporting TensorFlow, PyTorch, MXNet, and Horovod. It allows single-node and distributed training, ensuring reliable and flexible management of machine learning tasks. TonY supports GPU isolation on Hadoop 2.6.0 onward with the required configurations. By utilizing either zipped Python environments or Docker containers, TonY seamlessly integrates into diverse Hadoop ecosystems, offering straightforward setup while delivering powerful computational resources for deep learning.
Logo of FedML
FedML
FedML provides an extensive solution to efficiently manage AI workloads across decentralized GPUs, multi-cloud environments, and edge servers. Utilizing TensorOpera AI, this unified and scalable machine learning library streamlines model training, deployment, and federated learning. Features like TensorOpera Launch simplify environment management by aligning AI tasks with economical GPU resources. FedML supports use cases such as on-device training and cross-cloud deployments, offering comprehensive MLOps capabilities with TensorOpera Studio and Job Store for smooth execution of AI tasks. It capitalizes on serverless deployments and vector database searches to operate at various scales.
Logo of lite-transformer
lite-transformer
Lite Transformer offers an advanced approach to NLP with its long-short range attention model, enhancing efficiency while reducing computational demands. Compatible with Python and PyTorch, and optimized for Nvidia GPU training, it includes pretrained models for datasets such as WMT'14 En-Fr and WMT'16 En-De. The project simplifies implementation with thorough guides on installation, data setup, and training, supporting both local and distributed computing. Perfect for researchers and developers seeking to deploy or evaluate models with sophisticated attention mechanisms in NLP.
Logo of training-operator
training-operator
Kubeflow Training Operator offers a Kubernetes-based system for scalable, distributed training of machine learning models. Compatible with frameworks like PyTorch, TensorFlow, and XGBoost, it also supports HPC tasks through MPI. It simplifies model training via Kubernetes Custom Resources API and a Python SDK, aiding in efficient resource management. Explore integration and performance enhancement with comprehensive guides and community resources.
Logo of efficient-dl-systems
efficient-dl-systems
This repository provides the comprehensive 2024 course materials for Efficient Deep Learning Systems taught at HSE University and Yandex School of Data Analysis. Topics covered include core GPU architecture, CUDA API, experiment management, distributed training, and Python web deployment. Detailed week-by-week content supports learning of both theoretical foundations and practical applications, emphasizing real-world examples and project-based studies.