hbox - Versatile Scheduling for Big Data and AI on Hadoop

HBox Project Introduction

HBox is an advanced scheduling platform that seamlessly integrates big data and artificial intelligence to support multiple machine learning and deep learning frameworks. Originally named XLearning, it's now referred to as HBox. Users who have previously cloned the repository locally are advised to update their remote URL to the new address for continued access.

Overview

HBox is built to run on Hadoop YARN, making it highly compatible and scalable. It supports several deep learning frameworks including TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost, and more. HBox also manages GPU resources, and applications can run in Docker with a RESTful API management interface. This makes it an ideal platform for data scientists and developers looking to streamline their workflow and leverage powerful machine learning algorithms.

Architecture

The architecture of HBox is comprised of three main components:

Client: Responsible for starting and retrieving the status of applications.
ApplicationMaster (AM): Manages the internal scheduling, lifecycle of applications, distribution of input data, and container management.
Container: Executes the application's processes, such as Worker or Parameter Server (PS), monitors progress, reports status, saves output, and may start a TensorBoard service for TensorFlow applications.

Key Features

1. Support for Multiple Frameworks

HBox supports both distributed and standalone modes for various deep learning frameworks. It also offers flexibility, allowing users to customize and use multiple versions of frameworks.

2. Unified Data Management on HDFS

Training data and model results are stored on HDFS or supported S3 storage. HBox provides several strategies for handling input and output of data, ensuring efficient data handling and storage:

Download: Distributes HDFS files to workers who download them locally.
Placeholder: Provides file lists to workers, who then read data directly from HDFS.
InputFormat and OutputFormat: Leverages MapReduce for data input and output, allowing users to specify implementations for these formats.

3. Visualization

The application interface is intuitive and provides:

Container List: Displays information like host, role, state, and progress.
TensorBoard Link: Direct access for real-time TensorFlow application monitoring.
Model Saving: Enables uploading of intermediate results to HDFS during execution.
Worker Metrics: Shows resource usage information for each worker.

4. Compatibility with Native Framework Code

HBox is designed to execute standalone mode programs directly without custom modifications, making it easy to integrate existing projects into the platform.

Compilation and Deployment

HBox is built with Java (JDK 1.8+) and Maven (3.6.3+). The platform compiles into a distributable package that includes scripts for job management, support libraries, and common scripts for configuration. It requires CentOS 7.2 or newer, Java 8, and Hadoop 2.6-3.2 (with GPU support needed for versions 3.1 and above).

Quick Start Guide

Users can submit applications using the hbox-submit command. An example setup for a TensorFlow application is provided, demonstrating how to upload data and configure HBox for execution, highlighting parameters like memory, number of workers, and specific files required for the application.

Support and Community

Developed by a team of experts, HBox invites contributions and improvements from the community. Resources, documentation, and contact information for the development team are available for users seeking assistance or collaboration.

For a comprehensive understanding and further details on configuration, data management, and submission parameters, users are encouraged to consult the detailed documentation provided within the project resources.