TonY - Harness Apache Hadoop for Deep Learning with Compatibility for TensorFlow, PyTorch, MXNet, and Horovod

Introduction to TonY

TonY is a specialized framework designed to execute deep learning tasks directly on Apache Hadoop. It supports popular machine learning libraries, including TensorFlow, PyTorch, MXNet, and Horovod. TonY enables both single-node and distributed training to function efficiently as Hadoop applications. This native integration, paired with its other features, strives to make machine learning tasks both reliable and flexible.

Compatibility

TonY is compatible with Hadoop versions 2.6.0 and above. For those requiring GPU isolation, Hadoop versions 2.10 or higher for Hadoop 2, and 3.1.0 or higher for Hadoop 3 are necessary.

Building TonY

To build TonY, the framework utilizes Gradle. The build process can be initiated using:

./gradlew build

To build without running tests, this command can be used:

./gradlew build -x test

After building, the required jar file will be located in the ./tony-cli/build/libs/ directory.

Usage

TonY offers two main methods for launching deep learning jobs:

Zipped Python Virtual Environment: This method doesn't require setting up Docker support on the Hadoop cluster and avoids dependency on a Docker registry. However, the cluster and the Python environment need the same OS version.
Docker Container: Requires a Docker-supported Hadoop cluster, where a Docker image is prepared with necessary Python dependencies such as TensorFlow or PyTorch.

Zipped Python Virtual Environment

With this setup, you prepare a zipped virtual environment and an XML configuration file (tony.xml). Here's a basic configuration example:

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
</configuration>

Execute the job using the Java command line with components like the path to the Python environment and scripts.

Docker Container

In this configuration, you need a Docker image with necessary dependencies. Configuration involves tony.xml similar to the following:

<configuration>
  <property>
    <name>tony.worker.instances</name>
    <value>4</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>1</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
  <property>
    <name>tony.docker.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tony.docker.containers.image</name>
    <value>YOUR_DOCKER_IMAGE_NAME</value>
  </property>
</configuration>

TonY Arguments

TonY offers several command-line arguments to manage training jobs, like specifying the script entry point, source directories, Python environments, and more.

TonY Configurations

Configurations for TonY jobs can be set either in an XML file or directly via the command line, with options to override settings.

Examples and Resources

Examples of distributed deep learning tasks with TensorFlow, PyTorch, and other frameworks are available. Additional resources such as presentations and papers provide deeper insights into TonY's capabilities and applications.

TonY is an open-source endeavor aimed at leveraging Hadoop's potential for deep learning, providing flexibility, and simplifying the integration of machine learning processes into cloud-based infrastructures.