I spend a lot of time while developing in data science waiting for my current sample of code to finish running. In particular, my package AutoTS tries hundreds or thousands of models, and even with small data that can take a while to run. Another big factor for me is the environmental cost of development – if I leave my workstation running all night at full power, it is churning through kilowatts of power. That much power, used routinely, starts to become significant. Finally, as a matter of cost, when spending hundreds or thousands of dollars on a new computer or server instance, I want to know what features of the CPU are actually worth paying for.
But in order to save you the trouble of reading this long article, here are a few things I found:
- AVX-512 Instruction sets (“Deep Learning Boost”) allowed an ultrabook CPU to outperform a powerful desktop AVX-2 CPU
- Intel MKL does offer significant performance boost over OpenBLAS – Intel will likely outperform a more-powerful AMD chip (in data science)
- Having a large number of CPU cores is very helpful for some models, but is not as helpful as I was expecting for most models
- CPU clock speed/frequency is quite significant and benefits all types of models – thus the Intel Xeon was slower despite its high core count.
My recommendation then, if you are buying a new CPU, is to keep an eye on the Instruction Set that exact model supports, and/or look for so-called “Deep Learning Boost” as it really does make a difference – as of writing in most newer Xeons and 11th Gen Core i5/i7/i9. As for configuring a cloud VM, take a look at the clock speeds offered by different instances as for most workloads the higher clock speed will be more noticeable than adding more cores. In general paying for more than 16 cores in a VM is not going to be worth the cost and energy consumption – unless you already know that you have a highly-parallelized big data workload. 
I should note that I had to do a little work to get the full performance out of the newest CPUs in Anaconda. You can read about that here.
From a power-conscious viewpoint, the server CPUs are terribly inefficient – the environmentally minded may prefer to train models on a laptop when they can – which with high speed, high core count, AVX512 supporting laptop CPUs available, it is not as much of a limitation as it once was.
Intel vs AMD and ARM
If you listen to much of the news about CPUs, Intel is mere moments from death, hounded by AMD from one side and ARM from the other. I am personally excited to see this increased competition, and if I were a gamer would probably have already run off and bought myself a Ryzen Zen 3 CPU from AMD. It is hard to deny that AMD is offering more, fast cores for a cheaper price…
Yet I am not a gamer but a data scientist (most of the time). One of the ways Intel has responded to the increased competition, it seems, is by pushing their HPC (ie supercomputer math) expertise down into lower level consumer chips. That doesn’t help the gamer much, but it does help the data scientist. Here is why Intel still holds this one small market segment that we data scientist live in:
- Intel’s MKL (Math Kernel Library) is well known as the fastest and most popular toolkit that empowers computations in MATLAB, R, Python, and so on – and it doesn’t really work on ARM or AMD as well.
- Intel is releasing AVX-512 instructions in many of their newer consumer CPUs – bringing supercomputer instructions to the masses -something AMD has no stated intention to do. Deep Learning Boost, to my knowledge, is an section of AVX-512 instructions that are particularly useful for data science.
- Intel is also releasing their own GPUs, and with the OpenVINO toolkit seem to be building an ecosystem that allows automatic use of of both CPU and GPU without somewhat troublesome mess that is switching between CPU and Nvidia’s CUDA right now. I am really excited to see this develop more.
- Intel’s CPUs are still power and fast, if no longer in the completely dominating way they once were.
Benchmarking on Small Data
The initial benchmark looked at small data: 1028 rows with 9 series. It used a fixed selection of 618 models in AutoTS 0.2.7 alpha release. In this experiment, the average total runtime was just under 30 minutes for 618 models. The fastest CPU’s were over 40% faster than this, finishing in just over 15 minutes. Since this data has 9 series, it is an awkward number for an 8 core CPU, and likely doesn’t showcase the full parallel advantage over 4 cores for some models.
In general the benchmark timings below are shown as percentage relative to slowest running CPU. Environment installation on most computers was done within 24 hours with the same Anaconda + pip install instructions and accordingly should have nearly identical package versions. Laptops tend to be fickle between runs as they go on and off of turboboost – although with good cooling/airflow they can usually maintain their turboboost for an extended period. Controlling all variables is difficult – there is the ‘silicon lottery’ which refers to the fact that by chance some CPUs are slower or faster than others of the same model.
What is really interesting about this is that I had not yet fixed the LINPACK issue with the 1165G7 and 10700, which means their calculations should be much slower. The likely explanation for their high performance regardless is that these CPUs have the fastest memory, largest cache sizes, and generally fastest IO – and that the calculations themselves were not the bottleneck of these operations.
The fanless Pentium J5005 stands out for its energy efficiency. This is the main reason I have chosen it as a my personal server. However, it cannot support recent versions of Tensorflow or MXNet – I believe because it lacks AVX2 instructions.
| Pentium J5005 | i5-8265U | i7-1165G7 | i7-7700HQ | i7-10700 | Xeon (Probably 8160) | ||||
| Class | Desktop | Laptop | Ultrabook | Gaming Laptop | Desktop | Server | |||
| Cores | 4C/4T | 4C/8T | 4C/8T | 4C/8T | 8C/16T | 24C/96T | |||
| Base Frequency GHz | 1.5 | 1.6 | 2.8 | 2.8 | 2.9 | 2 | |||
| Max Boost GHz | 2.8 | 3.9 | 4.7 | 3.8 | 4.8 | 3.5 | |||
| Lithography (nm) | 14 | 14 | 10 | 14 | 14 | 14 | |||
| Watts, Approx | 5 | 20 | 20 | 40 | 65 | 150 | |||
| Instruction Set | SES4.2 | AVX2 | AVX512 | AVX2 | AVX2 | AVX512 | |||
| Cache (MB) | 4 | 6 | 12 | 6 | 16 | 33 | |||
| Other | Fanless | CUDA-enabled GPU | GPU, not used | Cloud VM | |||||
| OS | Ubuntu 20.04 | Windows 10 Pro | Windows 10 | Windows 10 | Windows 10 Pro | Ubuntu 18.04 | |||
| RAM (GB) | 12 | 16 | 16 | 32 | 32 | 120 | |||
| Model Failure | 29.3% | 27.6% | 22.2% | 29.3% | 22.3% | 28.0% | |||
| GluonTS Failure | 100% | 100% | 38% | 100% | 38% | 84% | |||
| Model | smaller is better, includes only models which succeeded on all machines, scaled by slowest | Slowest (s) | Model Count | Parallelized | |||||
| AverageValueNaive | 1.00 | 0.35 | 0.36 | 0.43 | 0.42 | 0.42 | 0.112 | 45 | 0 | 
| DatepartRegression | 1.00 | 0.34 | 0.38 | 0.45 | 0.40 | 0.42 | 0.359 | 40 | some | 
| ETS | 1.00 | 0.59 | 0.36 | 0.51 | 0.43 | 0.36 | 0.795 | 19 | all | 
| FBProphet | 1.00 | 0.78 | 0.41 | 0.68 | 0.27 | 0.38 | 16.059 | 35 | all | 
| GLM | 0.81 | 1.00 | 0.49 | 0.58 | 0.59 | 0.53 | 0.858 | 48 | all | 
| GLS | 1.00 | 0.33 | 0.36 | 0.42 | 0.44 | 0.40 | 0.297 | 50 | 0 | 
| LastValueNaive | 1.00 | 0.38 | 0.34 | 0.48 | 0.41 | 0.49 | 0.152 | 51 | 0 | 
| RollingRegression | 1.00 | 0.58 | 0.35 | 0.62 | 0.41 | 0.58 | 34.137 | 34 | some | 
| SeasonalNaive | 1.00 | 0.33 | 0.33 | 0.39 | 0.42 | 0.39 | 0.336 | 78 | 0 | 
| UnobservedComponents | 1.00 | 0.48 | 0.36 | 0.50 | 0.33 | 0.51 | 6.277 | 54 | 0 | 
| VAR | 1.00 | 0.70 | 0.59 | 0.95 | 0.62 | 0.86 | 0.321 | 38 | 0 | 
| VECM | 1.00 | 0.37 | 0.47 | 0.41 | 0.50 | 0.69 | 0.172 | 65 | 0 | 
| WindowRegression | 1.00 | 0.56 | 0.31 | 0.55 | 0.32 | 0.47 | 25.006 | 21 | some | 
| ZeroesNaive | 1.00 | 0.41 | 0.34 | 0.57 | 0.43 | 0.50 | 0.144 | 40 | 0 | 
| Average | 0.99 | 0.52 | 0.39 | 0.54 | 0.43 | 0.50 | |||
| Variability | 0.05 | 0.19 | 0.07 | 0.14 | 0.09 | 0.13 | |||
| Total Runtime (seconds) | 2732.6 | 1666.9 | 977.5 | 1632.8 | 979.9 | 1376.3 | 618 | ||
| Watt Hours Used | 3.795 | 9.261 | 5.431 | 18.142 | 17.692 | 57.347 | 
One issue I have not really looked into is Ubuntu (Linux) vs Windows. It is my belief that Ubuntu should normally be a bit faster as software tends to be easier to optimize and parallelize on Linux.
Failure rates seems to be correlated with processor age. The base line failures occur because of model parameters not suiting the data – for example numerous ETS models failed here because a variation of them can only take positive data, and negative data is present here.
MXNet models are not shown above because too many of them failed. Some MXNet+GluonTS models only work on the Xeon and 7700HQ, and more only work on the 1165G7 and 10700, with no overlap between the models than can run on the two groups. My best guess is a new Xeon chip, circa 2020, running Linux would be able to run all of the models that could run here. With GluonTS Version 0.4.0 or so, I was able to run it on more computers, but the current iteration seems very picky – indeed it notably has started to fail on my CUDA-enabled GPU as well. Unsurprisingly, MXNet generally went fastest on faster CPUs with more cores.
The results between the only two directly comparable GluonTS results are as follows:
1165G7    124.04 seconds
10700        93.69 seconds
and
Xeon         25.53 seconds
7700HQ    32.27 seconds
Benchmarking on Larger Data
I tried a number of experiments trying the use of Environmental Variables to set MKL_NUM_THREADS and OP_NUM_THREADS. Interestingly, setting these values usually made the runtime just a bit slower. Whatever the defaults are, they work best.
This next dataset is much larger: 100 series of 1941 records.
| Model | 8265U | 7700HQ | Xeon | 10700_noblas | 10700_intelmkl | 1165G7_intelmkl | 1165G7_openblas | 1165G7_noblas | 
| AverageValueNaive | 0.570597 | 0.675388 | 0.623755 | 1 | 0.459512 | 0.411459 | 0.361436 | 0.809619 | 
| DatepartRegression | 0.432211 | 0.576583 | 0.674464 | 1 | 0.37934 | 0.34398 | 0.326993 | 0.736249 | 
| GLM | 1 | 0.68471 | 0.69342 | 0.819854 | 0.402433 | 0.412218 | 0.418916 | 0.674156 | 
| GLS | 0.410792 | 0.448268 | 0.492879 | 1 | 0.287141 | 0.248585 | 0.255251 | 0.859893 | 
| LastValueNaive | 0.655863 | 0.715732 | 1 | 0.94812 | 0.473435 | 0.442501 | 0.470626 | 0.715015 | 
| SeasonalNaive | 0.463668 | 0.533757 | 0.574586 | 1 | 0.346478 | 0.305724 | 0.292223 | 0.757318 | 
| VAR | 0.436312 | 0.489989 | 0.25075 | 1 | 0.091756 | 0.101739 | 0.303972 | 0.916992 | 
| VECM | 0.099544 | 0.119688 | 0.117934 | 1 | 0.065491 | 0.066359 | 0.099372 | 0.395243 | 
| ZeroesNaive | 0.493798 | 0.620933 | 0.725027 | 1 | 0.365719 | 0.319094 | 0.438986 | 0.859505 | 
| Average | 0.506976 | 0.540561 | 0.572535 | 0.974219 | 0.319034 | 0.294629 | 0.329753 | 0.74711 | 
| Variability | 0.225382 | 0.171806 | 0.247243 | 0.056923 | 0.139066 | 0.126462 | 0.106164 | 0.144724 | 
| Total Runtime (s) | 434.1653 | 453.0578 | 350.2862 | 1061.042 | 172.8733 | 171.7618 | 284.8567 | 787.6348 | 
The Intel Xeon wins hands down on the FBProphet data, not shown. That is the most highly parallelized model – one process per series. Yet the Xeon is much slower on most others – since it has ARV512, the logical conclusion is the slow (2.8 GHz max) clock speed. The 1165G7 vs 10700 is perhaps the most interesting. The 10700 has a higher clock speed and twice and as many cores. The 1165G7 has only one clear advantage – ARV512 with Deep Learning Boost. Another interesting comparison is the 7700HQ vs the Xeon – here sometimes the chip with a higher clock speed is faster, and sometimes the one with more cores and ARV512 is faster. Overall, impossible to really say what single feature is most important, but my guess is: clock speed, instruction set, and more cores, in that order.
