Bits & Bytes online Edition




Data-Analytics on MPCDF HPC systems

Andreas Marek, Giuseppe Di Bernardo, John Alan Kennedy, Luka Stanisic

In the recent past the demand to use HPC systems for automatically processing and analyzing large amounts of scientific data has grown significantly. Specifically, in many scientific domains, there is rapidly growing interest in applying analysis methods from the field of machine learning, especially the so-called deep learning with artificial neural networks (ANN), an approach which is extremely powerful, but is not easily portable to HPC systems.

In order to address this trend and the future demands of our users, a new data-analytics team has been formed to support the scientists with their data-intensive projects on the HPC systems of the MPCDF. Being tightly linked with the well-established application and data groups of the MPCDF, the data-analytics team will assist the users in deploying their data-analysis projects in an efficient and scalable way on current and future HPC systems. The support ranges from providing optimized variants of software frameworks to the implementation and parallelization of algorithms and software.

Parallel data analysis

Apache Spark

Apache Spark is an open-source cluster-computing framework that supports big-data and machine-learning applications in a scalable way. Apache Spark provides a generic data-analytics platform with many plugins and libraries, it offers the possibility to parallelize many kinds of data-analysis software and additionally provides support for machine-learning algorithms.

The MPCDF provides the ability to run a Spark application on the HPC systems Draco and Cobra via the Slurm batch system. For details and example submission scripts we refer the user to a previous Bits & Bytes article "Spark on Draco" or the webpage about data-analytics software.

Machine-learning frameworks

The MPCDF provides support for several data-analytics and machine-learning frameworks on the HPC systems Draco and Cobra. In the following section we will introduce the software which is currently supported and provide basic information to get you started.

1. Scikit-learn

Scikit-learn is a Python-based package which provides numerous tools and algorithms for machine learning. Extensive documentation about the software package itself can be found at the http://scikit-learn.org webpage. Scikit-learn is provided on the HPC systems Draco and Cobra and can be loaded via the commands

    $ module load anaconda/3/5.1
    $ module load scikit-learn/0.19.1
    

Please note that although our installation makes efficient use of a single node of an HPC system, scikit-learn does not natively support the parallelization of machine-learning tasks across multiple nodes. It is, however, well suited for proto-typing and testing different machine-learning algorithms. We advise users to test such algorithms on the interactive Slurm partitions and not on the login nodes of the HPC systems.

2. Tensorflow

Tensorflow is the de-facto standard framework for machine-learning, especially deep-learning applications. On the HPC systems Draco and Cobra we provide a highly optimized version of Tensorflow. It is possible to use Tensorflow directly via the Python Tensorflow API (documentation is available at https://www.tensorflow.org) or to use Tensorflow as a backend of the higher-level Keras Python package (documentation is available at https://keras.io/). The Tensorflow installation at the MPCDF provides functionality for training and/or inference of a neural network on a single node of the HPC systems. The parallelization of the training over multiple nodes is supported via the Horovod (see Section 3) extension to Tensorflow.

2.1 Tensorflow/GPU

Tensorflow achieves its best performance when using GPUs, this is due to the large amount of matrix-matrix multiplications used internally. Thus, if the Tensorflow computations fit into the memory of a GPU (8 GB on Draco, 32 GB on Cobra), it is desirable to use the GPU version we provide.

To use the GPU version of Tensorflow 1.12 on Draco or Cobra load the software as follows:

    # on DRACO
    $ module load gcc/6.3.0
    # on COBRA
    $ module load gcc/6
    
    $ module load cuda
    $ module load anaconda/3/5.1
    # for cuda/9.1 (DRACO)
    $ module load cudnn/7.0
    $ module load nccl/2.2.13
    # for cuda/10.0 (COBRA)
    $ module load cudnn/7.4
    $ module load nccl/2.3.7
    
    $ module load tensorflow/gpu/1.12.0
    

If Keras is required, the following extra command is needed:

    $ module load keras/2.2.4
    

For more details on how to use Tensorflow with GPUs on our HPC systems, please have a look at the data-analytics webpage, where more exhaustive documentation and examples are provided.

2.2 Tensorflow/CPU

In case a Tensorflow application needs more memory than is available on the GPUs of the HPC systems, one can use the CPU version, which we also provide on Draco and Cobra. The CPU version allows applications to use all the memory of the node (between 96 GB and 768 GB, depending on the HPC system and the node). It should be pointed out that Tensorflow will run significantly slower in the CPU version than in the GPU version.

To use the CPU version of Tensorflow 1.12 load the software as follows:

    $ module load gcc/8
    $ module load anaconda/3/5.1
    $ module load tensorflow/cpu/1.12.0
    

If Keras is required, the following extra command is needed:

    $ module load keras/2.2.4
    

3. Tensorflow-Horovod

Horovod is an extension to Tensorflow which allows neural networks to be trained in parallel on multiple nodes of our HPC systems with the "data-parallelism" approach. A detailed description on how to use Horovod is beyond the scope of this article. However, we wish to mention here that to use Horovod one has to modify the plain Tensorflow (or Keras) Python code and adapt the Slurm submission scripts for parallel computations.

Examples for both steps can be found at the data-analytics webpage. Once the modifications are done, the training of a neural network can be sped up significantly, as we show in Figure 3:

Figure 3: Strong scaling of the training of the VGG-16 network on several nodes of the HPC system Cobra.

4. Apache Spark

In addition to providing a generic framework for data analytics Apache Spark also supports machine learning via native libraries (in the form of MLlib) and techniques exist to deploy distributed machine-learning and deep-learning solutions (including Tensorflow models) via Spark.