- 1 Description
- 2 Loading an interpreter
- 3 Creating and using a virtual environment
- 4 Parallel programming with Python multiprocessing module
- 5 Anaconda
- 6 Jupyter
Python is an interpreted programming language with a design philosophy stressing the readability of code. Its syntax is simple and expressive. Python has an extensive, easy-to-use library of standard modules.
The capabilities of Python can be extended with modules developed by third parties. In general, to simplify operations, it is left up to individual users and groups to install these third-party modules in their own directories. However, most systems offer several versions of Python as well as tools to help you install the third-party modules that you need.
The following sections discuss the Python interpreter, and how to install and use modules.
Loading an interpreter
To discover the versions of Python available:
[name@server ~]$ module avail python
You can then load the version of your choice using module load. For example, to load Python 2.7 you can use the command
[name@server ~]$ module load python/2.7
If you want to use any of these Python modules, load a Python version of your choice and then
module load scipy-stack.
For more details, including version numbers of the contained packages, visit Scipy.org.
Creating and using a virtual environment
With each version of Python, we provide the tool virtualenv. This tool allows users to create virtual environments within which you can easily install Python modules. These environments allow one to install many versions of the same module, for example, or to compartmentalize a Python installation according to the needs of a specific project. We recommend that you create your Python virtual environment(s) in your home directory.
To create a virtual environment, enter the following command, where ENV is the name of the empty directory containing your environment:
[name@server ~]$ virtualenv --no-download ~/ENV
Once the virtual environment has been created, it must be activated:
[name@server ~]$ source ~/ENV/bin/activate
You should also upgrade pip in the environment:
[name@server ~]$ pip install --upgrade pip
To exit the virtual environment, simply enter the command deactivate:
(ENV) [name@server ~] deactivate
Once you have a virtual environment loaded, you will be able to run the pip command. This command takes care of compiling and installing most of Python modules and their dependencies. A comprehensive index of Python packages can be found at PyPI.
We first load the Python interpreter:
[name@server ~]$ module load python/2.7
We then activate the virtual environment, previously created using the virtualenv command:
[name@server ~]$ source ~/ENV/bin/activate
Finally, we install the latest stable version of Numpy:
(ENV) [name@server ~] pip install numpy --no-index
The --no-index option could be omitted, but using it guarantees that you will be using a version of numpy that was compiled by the Compute Canada team and can in some cases avoid issues with missing or conflicting dependencies that may arise when pip tries to install a Python package downloaded from the Internet.
If we wanted to install the development version of Numpy, we can also give a link toward its Git repository:
(ENV) [name@server ~] pip install git+git://github.com/numpy/numpy.git
In the first invocation of the pip command above it isn't as obvious where the numpy module is being installed from. One might assume it is being installed from PyPI but in the particular case of numpy it is actually being installed from a distribution package offered by Compute Canada. This distribution package is called a python wheel. It will install from Compute Canada's local wheel provided the package version Compute Canada provides is current. If the PyPI has a newer version, that version will be installed instead of the version Compute Canada provides. To disable this default behaviour and use the older Compute Canada specific wheel use the --no-index option.
To see where the pip command is installing a python module from you can tell it to be more verbose with the -vvv option. Compute Canada provides python wheels for many common python modules which are configured to make the best use of the hardware and installed libraries on our clusters.
Installing dependent packages
In some cases, such as TensorFlow or Pytorch, Compute Canada provides wheels for a specific host (cpu or gpu), suffixed with _cpu or _gpu. Packages dependent on torch will then fail to install. If my_package depend on numpy and torch, then the following will allow us to install it:
(ENV) [name@server ~] pip install numpy torch_gpu --no-index (ENV) [name@server ~] pip install my_package --no-deps
The --no-deps options tells pip to ignore dependencies.
Creating virtual environments inside of your jobs
Parallel filesystems such as the ones used on our clusters are very good at reading or writing large chunks of data, but are rather bad for small chunks of data. Launching a software and loading libraries, such as starting python and loading a virtual environment, is precisely this kind of operation. For this reason, we recommend that you create your virtual environment inside of your job, using the compute node's local disk. It may seem counter-intuitive to recreate your environment for every job, but it is very often much faster than running from the parallel filesystem, and will give you some protection against some filesystem performance issues. This can be achieved using the following submission script example :
#!/bin/bash #SBATCH --account=def-someuser #SBATCH --mem-per-cpu=1.5G # increase as needed #SBATCH --time=1:00:00 module load python/3.6 virtualenv --no-download $SLURM_TMPDIR/env source $SLURM_TMPDIR/env/bin/activate pip install --upgrade pip pip install --no-index -r requirements.txt python ...
where the requirements.txt file will have been created from a test environment. For example, if you want to create an environment for TensorFlow, you would do the following on a login node :
[name@server ~]$ module load python/3.6 [name@server ~]$ ENVDIR=/tmp/$RANDOM [name@server ~]$ virtualenv --no-download $ENVDIR [name@server ~]$ source $ENVDIR/bin/activate [name@server ~]$ pip install --upgrade pip [name@server ~]$ pip install --no-index tensorflow_gpu [name@server ~]$ pip freeze > requirements.txt [name@server ~]$ deactivate [name@server ~]$ rm -rf $ENVDIR
This will yield a file called requirements.txt, with content such as the following
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2 -f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic absl-py==0.5.0 astor==0.7.1 gast==0.2.0 grpcio==1.17.1 h5py==2.8.0 Keras-Applications==1.0.6 Keras-Preprocessing==1.0.5 Markdown==2.6.11 numpy==1.16.0 protobuf==3.6.1 six==1.12.0 tensorboard==1.12.2 tensorflow-gpu==1.12.0+computecanada termcolor==1.1.0 Werkzeug==0.14.1
This file will ensure that your environment is reproducible between jobs.
Note that the above instructions require all of the packages you need to be available in the python wheels that we provide (see below). If this is not the case, please contact Technical support to request the wheels you need to be added to our repository.
Listing available wheels
Currently available wheels are listed on the Available Python wheels page. You can also run the command avail_wheels on the cluster. By default, it will:
- only show you the latest version of a specific package (unless versions are given);
- only show you versions that are compatible with the python module (if one loaded), otherwise all python versions will be shown;
- only show you versions that are compatible with the CPU architecture that you are currently running on.
To list wheels containing "cdf" (case insensitive) in its name:
[name@server ~]$ avail_wheels --name "*cdf*" name version build python arch ------- --------- ------- -------- ------ netCDF4 1.4.0 cp27 avx2
Or to list all available versions:
[name@server ~]$ avail_wheels --name "*cdf*" --all_version name version build python arch ------- --------- ------- -------- ------ netCDF4 1.4.0 cp27 avx2 netCDF4 1.3.1 cp36 avx2 netCDF4 1.3.1 cp35 avx2 netCDF4 1.3.1 cp27 avx2 netCDF4 1.2.8 cp27 avx2
Or to list a specific version:
[name@server ~]$ avail_wheels --name "*cdf*" --version 1.3 name version build python arch ------- --------- ------- -------- ------ netCDF4 1.3.1 cp36 avx2 netCDF4 1.3.1 cp35 avx2 netCDF4 1.3.1 cp27 avx2
Or to list for a specific version of python:
[name@server ~]$ avail_wheels --name "*cdf*" --python 3.6 name version build python arch ------- --------- ------- -------- ------ netCDF4 1.3.1 cp36 avx2
The python column tell us for which python version the wheel is available, where cp36 stands for cpython 3.6.
A few other examples
- List multiple packages and multiple versions: avail_wheels numpy biopython --version 1.15.0 1.7
- List the wheels for specific architectures : avail_wheels --arch avx avx2
- List the wheels specifically for GPU and display only name, version, python columns: avail_wheels --column name version python --all_versions --name "*gpu"
- Display usage and help: avail_wheels --help
Parallel programming with Python multiprocessing module
Doing parallel programming with Python can be an easy way to get results faster. An usual way of doing so is to use the multiprocessing module. Of particular interest is the Pool class of this module, since it allows one to control the number of processes started in parallel, and apply the same calculation to multiple data. As an example, suppose we want to calculate the cube of a list of numbers. The serial code would look like this :
def cube(x): return x**3 data = [1, 2, 3, 4, 5, 6] cubes = [cube(x) for x in data] print(cubes)
def cube(x): return x**3 data = [1, 2, 3, 4, 5, 6] cubes = list(map(cube,data)) print(cubes)
Using the Pool class, running in parallel, the above codes become :
import multiprocessing as mp def cube(x): return x**3 pool = mp.Pool(processes=4) data = [1, 2, 3, 4, 5, 6] results = [pool.apply_async(cube, args=(x,)) for x in data] cubes = [p.get() for p in results] print(cubes)
import multiprocessing as mp def cube(x): return x**3 pool = mp.Pool(processes=4) data = [1, 2, 3, 4, 5, 6] cubes = pool.map(cube, data) print(cubes)
The above examples will however be limited to using 4 processes. On a cluster, it is very important to use the cores that are allocated to your job. Launching more processes than you have cores requested will slow down your calculation and possibly overload the compute node. Launching fewer processes than you have cores will result in wasted resources and cores remaining idle. The correct number of cores to use in your code is determined by the amount of resources you requested to the scheduler. For example, if you have the same computation to perform on many tens of data or more, it would make sense to use all of the cores of a node. In this case, you can write your job submission script with the following header :
#SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=32 python cubes_parallel.py
and then, your code would become the following :
import multiprocessing as mp import os def cube(x): return x**3 ncpus = int(os.environ.get('SLURM_CPUS_PER_TASK',default=1)) pool = mp.Pool(processes=ncpus) data = [1, 2, 3, 4, 5, 6] results = [pool.apply_async(cube, args=(x,)) for x in data] cubes = [p.get() for p in results] print(cubes)
import multiprocessing as mp import os def cube(x): return x**3 ncpus = int(os.environ.get('SLURM_CPUS_PER_TASK',default=1)) pool = mp.Pool(processes=ncpus) data = [1, 2, 3, 4, 5, 6] cubes = pool.map(cube, data) print(cubes)
Note that in the above example, the function cube itself is sequential. If you are calling some external library, such as numpy, it is possible that the functions called by your code are themselves parallel. If you want to distribute processes with the technique above, you should verify whether the functions you call are themselves parallel, and if they are, you need to control how many threads they will take themselves. If, for example, they take all the cores available (32 in the above example), and you are yourself starting 32 processes, this will slow down your code and possibly overload the node as well.
Note that the multiprocessing module is restricted to using a single compute node, so the speedup achievable by your program is usually limited to the total number of CPU cores in that node. If you want to go beyond this limit and use multiple nodes, consider using mpi4py or PySpark. Other methods of parallelizing Python (not all of them necessarily supported on Compute Canada clusters) are listed here. Also note that you can greatly improve the performance of your Python program by ensuring it is written efficiently, so that should be done first before parallelizing. If you are not sure if your Python code is efficient, please contact technical support and have them look at your code.
Please see Anaconda.
Please see Jupyter.