Setting up and Optimizing Python for Data Science on Intel, AMD, and ARM (including Apple) Computers

Setting up a Python environment is often a pain. Even worse is that sometimes the environment, once it is finally built, is surprisingly slow because the underlying numeric libraries (BLAS, OneAPI, and so on) are improperly configured. This aims to be a fairly definitive guide to fixing those problems as of early 2022. How much speedup is possible? Usually around 20% over defaults overall, but routinely as high as 100x if the starting environment is broken or has poor defaults.

But before we get into all that optimization, let’s begin with the basic environment installation.

Part 1: Basic Environment Setup

Before installation, you first need to know three big pieces of information about the computer or virtual machine (VM) you are using: the operating system, the architecture and brand of the CPU.

The OS: should be one of Windows, MacOS, or Linux (Ubuntu, Fedora, etc).
The architecture: either X64 (AMD and Intel), or ARM (Raspberry Pi, Apple Silicon, Nvidia Jetson, etc.). There are also some rare ones: PowerPC (IBM mainframes) and RISC-V (not common yet).
Almost everything is 64-bit these days, but if using a Raspberry Pi or other small single board computer (SBC) with less than 4 GB of memory, keep an eye out for 32-bit installations.

Most cloud VMs are Intel X64 CPUs running Linux.

There are three major choices for setting up a Python environment: venv, Anaconda, and Mambaforge. Which you choose to install and how you install it will depend on the above OS and CPU information.

Mambaforge: newer, faster, and open-source version of conda that can be downloaded from miniforge. I strongly recommend this as the default choice for python environments built using conda-forge packages. It uses the ‘conda’ commands by default unless conda is already installed, then is accessed via an interchangeable ‘mamba’. This is the best option for ARM and Apple Silicon and the documentation details using X64 emulated environments if native packages are not yet available for those systems.
Python venv: the built-in virtual environment creator is small and ‘just works’ on all systems. It is often preinstalled on Linux, but can be downloaded from pip or from an os package distribution system (ie. `sudo apt install`). The only disadvantages are that it lacks many of the convenient environment management commands of mamba/conda and it lacks the ability to specify different python versions on creation.
Anaconda. The advantage of this is the nice graphical UI and complete download package it comes with. However, an enterprise license is required for large organizations, and the conda tool is currently much slower than mamba. A smaller installation called Miniconda is available if you don’t need all the extras.

If you are using a Raspberry Pi, use a python venv and use the piwheels channel for pip. This is the default on Raspberry Pi OS, so no configuration should be required.

Setup examples

Environment creation on Mamba/Conda after running the installation script. Environments automatically install to ~/.conda/envs/ or ~/.miniconda/envs/ or ~/mambaforge/envs/.

conda create -n my_environ python=3.9
conda activate my_environ

For Pip Venv where ./my_environ will be a new folder in the current directory.

python -m venv ./my_environ
source ./my_environ/bin/activate

Now, there is a Python environment with no additional packages installed.

The first tip is to install all the package you think you will need at once. The advantage of installing as many things at once as possible is the tools attempt to find compatible versions for everything at once – otherwise with each new install it may have to uninstall and reinstall dependencies. For example:

conda install scikit-learn pandas statsmodels prophet numexpr bottleneck tqdm holidays lightgbm matplotlib -c conda-forge

or in a venv, use pip for everything. Conda and conda-forge won’t have every package, and sometimes will have an outdated version of a package. For version information of installed packages use conda list example_package or pip show example_package. Some examples that aren’t always available by conda for all architectures and may need to be installed on pip: mxnet, tensorflow, prophet. Some packages like to specify unnecessary dependency ranges, to bypass that, for example:

pip install --no-deps mxnet

Although this may then require manually installing any additional dependencies, it often helps. How do you know what the additional dependencies are? There are several ways, but I often check out the setup.py file in the project’s GitHub repository online while also checking out any documentation there.

Performance Tip: note that bottleneck and numexpr are installed above. These are part of the optional dependencies of pandas and automatically speed up pandas operations. There is also some numba acceleration, which requires additional configuration and in my experience is rarely useful. See the documentation for more info.

As a last measure, most packages can be ‘Built from Source’ if other installation methods fail. Building from source can also be used for performance gains as it can be optimized specifically for an environment. Doing so with packages written only in Python is simple, all that are needed is python setuptools and a setup.py file. However, many numeric libraries are more complex and exceedingly tedious to build from source.

Part 2: CPU Performance Optimization

Here, it is important to know how many physical CPUs are present in the system. This might seem easy, but hyperthreading usually presents twice as many ‘logical’ cpus as there are ‘physical’ cpus. Most if not all Cloud VMs use ‘vCPUs’ and the reported vCPUs are equal to hyperthreads (so physical cores = vCPU / 2). The ‘average’ desktop and higher-end laptops have 8 physical CPUs with 16 threads right now. To further confuse the issue, the latest CPUs have two or more kinds of cores, efficiency and high-performance cores.

The reason this is important is that these numeric/scientific workloads tend to be very demanding and can really only be run on 1 thread per core. Processes sometimes default to running 16 processes on an 8 physical core CPU computer can often lead to crashes and will definitely lead to much slower performance (“thrashing”). I have also seen it best to run even less than the number of physical cores, because bottlenecks tend to be single-core: n_cpu = n – 1 or 0.75 * n physical cores is the safest number to start with.

Benchmarking

There seem to be many tradeoffs in choosing an environment configuration. What helps MXNet neural networks often appears to be slow down Tensorflow, and vice-versa. Common CPU benchmarks such as Cinebench, Geekbench or countless others don’t help much with assessing a Python environment as they all run using unrelated code.

Personally, I use the AutoTS Benchmark which tests a number of common models seeking a balanced overall environment because running these models is what my production environment does. A couple of other third-party benchmark options are listed in the appendix, and there is the Python built-in timeit package for throwing together custom tests.

BLAS Configuration

The Basic Linear Algebra Subprogram (BLAS) is a library that does most of the work beneath all numeric calculations done in any programming language. There are several major libraries: OpenBLAS, MKL (Intel), BLIS (AMD), and ATLAS.

To be completely honest, I may not have configured things properly as I don’t see much difference between OpenBLAS and MKL on my tests. MKL seems to respond more to environmental variable configurations. Regardless, conda-forge uses OpenBLAS as the default and in most situations sticking with that is the best choice.

Since BLAS libraries use the same API, you can directly swap out libblas files in your environment directory. However, I have found doing so with conda to be easier, although I have found about 5 different variations on the instructions, so take your pick. It is best to do this in a new environment and not modify an existing one. Some examples:

mamba install “blas=*=openblas“

conda install scipy “blas=*=*mkl“

conda install conda–forge::blas=*=openblas

conda install nomkl numpy scipy scikit-learn numexpr

conda remove nomkl && conda install mkl

It is worth checking the numpy.show_config() function output. If all or most of the listings there are “NOT AVAILABLE” then you should consider installing from a different source or otherwise Googling a solution as it means numpy will be much slower, lacking any BLAS acceleration.

Turn Off Hyperthreading on Linux

In my testing, the simplest way to improve numeric Python performance on Ubuntu 21.10 was to turn off hyperthreading (SMT) in the BIOS (this is only an option on computers you have direct access to, not cloud computers). This led to the best performance on both Intel and AMD chips (hyperthreading is not an option to my knowledge on any ARM chips). The logic behind this is that each thread then receives more resources, and that auto detection of cpu core counts by programs is less likely to overestimate physical CPU count.

There were some exceptions: MXNet was actually much slower without hyperthreading. With properly optimized programs and environmental variables, I believe hyperthreading would be faster overall, but for minimal effort, turning off hyperthreading leads to the most gain. Performance gains on Windows by turning off hyperthreading were less than 5% and did not appear to be significant overall.

Possible In Code Improvements for All Systems

If your code utilizes Tensorflow or Pytorch, inter_ and intra_op_parallelism_threads can be configured, where intra is number of cores in the CPU and inter is number of sockets (for servers).

Joblib seems to be the best multiprocessing package in Python, and is used by default in scikit-learn. I won’t discuss multiprocessing here, but will mention that configuring a parallel_backend with inner_max_num_threads=1 and n_jobs can sometimes help in hyperthreading environments where joblib is used. This also sets the n_jobs for sklearn models if not already set. Also see threadpoolctl.

Example for system with 8 physical cores:

import tensorflow as tf

tf.config.threading.set_inter_op_parallelism_threads(1)

tf.config.threading.set_intra_op_parallelism_threads(8)

# if scikit-learn-intelex installed and using Intel CPU

from sklearnex import patch_sklearn

patch_sklearn()

# conda install mkl-service (and use MKL as BLAS)

import mkl

mkl.set_num_threads(8)

# joblib context appears to help the most on Linux, slower elsewhere

from joblib import parallel_backend

with parallel_backend(“loky“, inner_max_num_threads=1, n_jobs=8):

from autots.evaluator.benchmark import Benchmark

test = Benchmark()

test.run(n_jobs=8, times=1)

print(test.results)

Intel Optimized Distributions

There is an entire OneAPI AI Toolkit out there if you want to download the optimized environment from Intel with one click. The documentation is much better than it was a year or two ago, but is still lacking in some regards – especially for Intel GPU use which I don’t think I ever got working (or it was just slower, like happens with CUDA in some workloads). I tested it and found it rather clunky.

However, I found installing the two primary individual packages to be easier: intel-tensorflow and Intel Extension for scikit-learn. I have not tested the PyTorch extension.

The most valuable seems to be intel-tensorflow which can be installed from pip, and intel-tensorflow-avx512 which can only be used on Linux. I noticed performance gains with both over regular Tensorflow. Use is easy: install and use Tensorflow as usual.

Use of pip install scikit-learn-intelex is also quite easy. It can be installed and used almost in place (see the documentation).

Performance benefit of both appeared to be relatively minor for my workload, but since using them and testing them is easy, it is definitely worth a try. Note this works best on newer CPUs, and many cloud VMs may use relatively old Intel CPUs.

Setting Environmental Variables

Most of the underlying libraries that run numeric acceleration utilize environmental variables to configure multithreading. Setting these can boost performance quite easily. The following configuration is simple and achieves most or all of the potential in most systems. Generally, it is best to apply variables on the command line using the export command (Linux) or set command (Windows) as it applies to all user processes, but I have included the rest in Python for simplicity.

Here 8 Physical CPUs are used in threads which is 16 cloud vCPUs. Adjust according to your machine.

export OMP_NUM_THREADS=8

export TF_DISABLE_MKL=1

Tuning OMP_NUM_THREADS

OMP_NUM_THREADS is the one environmental variable that has impact across all environments that I have seen. The best default number is equal to the number of physical cores, however fewer or more threads may be more efficient. I especially expect that to be true as CPUs with a mix of big and little cores become more common. With those, I noticed that setting num threads equal to the number of performance cores was sometimes more effective, as was using 8 even if more are available.

Setting Environmental Variables on Intel on Linux

Not all of what follows is Intel specific. Anything starting with “KMP” or with “MKL” in the name is Intel specific (or disables Intel-specific), however, everything else may be applicable to more systems. This doesn’t seem to have as much effect on Windows or MacOS where the OS seems to override more direct control over threads.

If you love parameter tuning, then Intel CPUs are for you. The MKL, OneAPI, and OneDNN implementations can give good performance, but to get that performance there often needs to be a bit of environment tuning – which if not done, or not done well, can lead to impressive slowness instead. Make sure you test the performance of any configuration.

This recipe seems to be one reasonable starting point for an Intel machine with 8 cores:

import os

os.environ[“OMP_NUM_THREADS”] = “8”

os.environ[“TF_DISABLE_MKL”] = “0”

os.environ[“TF_ENABLE_ONEDNN_OPTS”] = “1”

os.environ[“KMP_BLOCKTIME”] = “1”

If you have hyperthreading on (you probably do), a little more tuning seems to be helpful as well:

os.environ[“OMP_DYNAMIC”] = “true”

os.environ[“KMP_AFFINITY”] = “noverbose,granularity=fine,balanced,1,0”

MXNet seems to like hyperthreading on, KMP_BLOCKTIME=0, and OMP_PROC_BIND options

See appendix for more common variables for tuning.

Set System CPU Performance Options

This is simplest on Windows, where Power Plans are available (run powercfg -setactive SCHEME_MIN). For all systems, the computer BIOS will often have performance adjustment optimizations. On Linux, thermald may help, as well as utilizing the Clear Linux distribution.

Part 3: GPU Performance Optimization

GPUs in my benchmark are actually much slower than CPU operations. CUDA on a RTX 3060 was 4x slower than intel-tensorflow on the i7 11800H CPU on relatively small neural networks. Reddit has agreed with me on this, so it must be right. Likely this result is explained by the large overhead and need to copy and move the data around to the graphics card.

Large neural network use, where GPUs would show their strength, are rare in business and student contexts in my experience, and Google Colab is very convenient to use when needed with no software configuration required.

I do have one tip for NVIDIA GPUs: research the CUDA and CuDNN versions needed for PyTorch, MXNet and Tensorflow, and plan which version of each you will download carefully if you want one environment that can run them all. Otherwise, you’ll be spending a lot of time downloading and reinstalling CUDA and packages.

AMD has ROCm which appears to have the “slow but steady” development approach and Intel seems to be throwing a bunch of development work towards their own new GPUs, but it still feels like a work in progress as of writing (my testing being on a mini pc with Intel 1135g7 with Xe 84 execution units iGPU).

My personal feeling is that you might as well configure and test a GPU if you have one, but a proper CPU setup is more important.

Appendix: More Environmental Variable Options

Tuning these is more likely to slow things down than speed things up, but you might find a minimum if you are bored. It sometimes seems to require a reboot to properly reset variables during testing as sometimes other variables are set based on upstream variables (I’m guessing, that’s my explanation for strange behavior otherwise).

os.environ[“OMP_PROC_BIND”] = “close” # also ‘false’, ‘spread’

os.environ[“OMP_PLACES”] = “cores” # ‘threads’

os.environ[“OMP_MAX_ACTIVE_LEVELS”] = “5” # default is 4

os.environ[“MKL_DYNAMIC”] = “TRUE”

os.environ[“KMP_AFFINITY”] = “verbose,granularity=core,scatter,1,0”

os.environ[“KMP_BLOCKTIME”] = “1” # also try 0, 30, etc

os.environ[“KMP_TEAMS_THREAD_LIMIT”] = “8”

# More thread parameters

os.environ[“MKL_NUM_THREADS”] = “8”

os.environ[“MKL_DOMAIN_NUM_THREADS”] = “MKL_BLAS=1”

os.environ[“OPENBLAS_NUM_THREADS”] = “8”

os.environ[“VECLIB_MAXIMUM_THREADS”] = “8”

os.environ[“NUMEXPR_MAX_THREADS”] = “8”

# On/off options

os.environ[“TF_DISABLE_MKL”] = “1”

os.environ[“TF_ENABLE_ONEDNN_OPTS”] = “0” # (this may overlap the above)

os.environ[“USE_DAAL4PY_SKLEARN”] = “1”

# Verbose params:

os.environ[“KMP_SETTINGS”] = “1” # (same as =TRUE)

os.environ[“DNNL_VERBOSE”] = “1”

os.environ[“IDP_SKLEARN_VERBOSE”] = “INFO”

# Probably not useful but maybe for GPU:

os.environ[“DNNL_CPU_RUNTIME”] = “DPCPP” # (or OMP)

os.environ[“LIBOMPTARGET_DEVICETYPE”] = “GPU” # (or CPU)

Appendix: Best AutoTS Benchmark Comparisons for 0.3.13a9

Here, a smaller number is better as these are runtimes in seconds for the same operations. NP/SK/JL is numpy, scikit-learn, and joblib representing the “total_runtime” from the test and is the most representative of day-to-day data science performance. Results may not be comparable to future versions of this benchmark. Let me know if you manage to beat these performances, and how!

The AMD Ryzen 5800X takes the win in my opinion although I also recommend Intel to most people. While slightly losing out to the Intel 12900K, the 5800X is in Eco-Mode using much less power. However, as they are using different operating systems, it is not entirely a fair comparison, either. The Apple M1 Silicon worked well with MambaForge but that did not include Prophet, MXNet or Tensorflow. Tensorflow. Tensorflow was installed by emulation, but as you can see, performance was poor and attempts to use the Apple ‘Metal’ acceleration failed. The Nvidia Graphics (also tested elsewhere with a GTX 1650 Super and 3060 Ti) were, as discussed above, also slower for these workloads.

OS	CPU	NP/SK/JL	TensorflowRNN	TensorflowCNN	GluonTS	Prophet
Ubuntu 21.10	AMD Ryzen 5950X	15.6	34.7	25.2	18.0	2.7
Ubuntu 21.10	AMD Ryzen 5800X	16.3	30.9	23.3	26.3	2.6
Windows 11	Intel 12900K	16.0	34.2	17.0	18.0	3.5
Windows 11	Intel 11800H	20.7	37.2	18.1	20.5	3.3
Windows 11	Intel 11800H + RTX 3060	21.3	124.9	75.2	21.7	3.5
MacOS 12	Apple M1	20.9	888.9	537.0
Ubuntu 21.10	Intel 1135g7	28.1	41.0	28.0	49.2	6.3
Pi OS Bullseye	Raspberry Pi 4 ARM	183.7