Bits & Bytes online Edition




Python on HPC systems

Klaus Reuter

Introduction

In addition to the traditional HPC applications compiled from C/C++ and Fortran code we're witnessing an increasing workload of Python-based code running on the HPC systems. Being an interpreted and dynamically-typed language, plain Python is not a language suitable per se to achieve high performance. Nevertheless, with the appropriate packages, tools, and techniques the Python programming language can be used to perform numerical computation in a very efficient manner, covering both aspects, the program's efficiency and the programmer's efficiency. The aim of this article is to provide some advice and orientation to the reader in order to use Python correctly on the HPC systems and to take first steps towards basic Python code optimization.

Performance

The key to achieve good performance with Python is to move expensive computation from the interpreted code layer down to a compiled layer which may consist of compiled libraries, code written and compiled by the user, or just-in-time compiled code. Below, three packages are discussed for such use cases.

NumPy

NumPy is the Python module that provides arrays of native datatypes (float32, float64, int64, etc.) and mathematical operations and functions on them. Typically, mathematical equations (in particular, vector and matrix arithmetic) can be written with NumPy expressions in a very readable and elegant way, which has several advantages: NumPy expressions avoid explicit, slow loops in Python. In addition, NumPy uses compiled code and optimized mathematical libraries internally, e.g. Intel MKL on MPCDF systems, which enables vectorization and other optimizations. Parts of these libraries use thread-parallelization in a very efficient way by default, e.g. to perform matrix multiplications. In summary, NumPy provides the de-facto standard for numerical array-based computations and serves as the basis for a multitude of additional packages.

Cython

Cython is a Python language extension that makes it relatively easy to create compiled Python modules written in Cython, C or C++. It integrates well with NumPy arrays and can be used to implement time-critical parts of an algorithm. Moreover, Cython is very useful to create interfaces to C or C++ code, such as legacy libraries or native CUDA code. Technically, the Cython source code is translated by the Cython compiler to intermediate C code which is then compiled to machine code by a regular C compiler like GCC or ICC.

Numba

Numba is a just-in-time compiler based on the LLVM framework. It compiles Python functions at runtime for the datatypes these functions are being called with. Moreover, Numba implements a subset of NumPy's functions, i.e. it is able to compile NumPy expressions. Functions are declared via a simple decorator syntax to be suitable for jit-compilation, hence, Numba is minimally intrusive.

Parallelization

While Python does implement threads as part of the standard library, these cannot be used to accelerate computation on more than one core in parallel because the standard cPython implementation serializes the execution of Python byte code. A global interpreter lock is used to ensure that only one instruction can be executed at a time from all threads belonging to a Python process. Nevertheless, Python is suitable for parallel computation. In the following, two important packages for intra-node and inter-node parallelism are addressed.

multiprocessing

The multiprocessing package is part of the Python standard library. It implements building blocks such as pools of workers and communication queues that can be used to parallelize data-parallel workloads. Technically, multiprocessing forks subprocesses from the main Python process that can run in parallel on multiple cores of a shared-memory machine. Note that some overhead is associated with the inter-process communication. It is, however, possible to access shared memory from several processes simultaneously. A typical use case are large NumPy arrays.

mpi4py

Access to the Message Passing Interface (MPI) is available via the module mpi4py. It enables parallel computation on distributed-memory computers where the processes communicate via messages. In particular, the mpi4py package supports the communication of NumPy arrays without additional overhead. On MPCDF systems, the environment module mpi4py provides an optimized build based on the default Intel MPI library.

I/O

NumPy implements efficient binary I/O for array data that is useful, e.g., for temporary files. A better choice with respect to portability and long-term compatibility are HDF5 files. HDF5 is accessible via the h5py Python package and offers an easy-to-use dictionary-style interface. For parallel codes, a special build of h5py with support for MPI-parallel I/O is provided via the environment module h5py-mpi.

The Python software ecosystem

In addition to the packages discussed above, there is a plethora of well-established packages for scientific computation and data science available, covering, e.g., numerical libraries (SciPy), visualization (matplotlib, seaborn), data analysis (pandas), and machine learning (TensorFlow, pytorch), to name only a few.

Software installation

Often, users need to install special Python packages for their scientific domain. In most cases, the easiest and quickest way is to create an installation local to the user's home directory. After loading the Anaconda environment module, the command "pip install --user PACKAGE_NAME" would download and install a package from the Python package index (PyPI), or similarly, the command "python setup.py install --user" would install a package from an unpacked source tarball. In both cases, the resulting installation is located below "~/.local" where Python will find it by default.

Summary

The software recommended in this article is available via the Anaconda Python Distribution (environment module "anaconda/3") on MPCDF systems. Note that for some packages (mpi4py, h5py-mpi), the hierarchical environment modules matter, i.e., it is necessary to load a compiler (gcc, intel) and an MPI module (impi) in addition to Anaconda in order to get access to these depending environment modules. Some examples of SLURM job scripts are available at https://www.mpcdf.mpg.de/services/computing/software/languages-1/python.

The application group at the MPCDF has developed an in-depth course on "Python for HPC" which covers all the topics touched in this article in more detail on two days. It is taught one to two times per year and announced via the MPCDF web page.

Finally, it should be pointed out that Python 2 reaches its official end-of-life on January 1, 2020. Consequently, new Python modules and updates to existing ones will not take Python 2 compatibility into account in the future. Users who are still running legacy code are strongly encouraged to migrate to Python 3.