AVX2 vs AVX512 + that iGPU for Data Science speed

If you have no idea what AVX-2 and AVX-512 are, they are names for instruction sets, a combination of CPU design and language that help execute, in this case, math. AVX512 is a newer 512-bit instruction set promising faster performance. Since data science can sometimes be rather slow, faster math sounds very promising.

Overall, the AVX512 instruction were 17% faster overall in this particular test, with individual models seeing 5-20% boost. Some of this was due to slightly improved single-core clockspeeds as well, however (say 2.5% faster) and perhaps other changes in the underlying architecture and execution units. What made this surprisingly difficult is the fastest of all was a laptop CPU (i5 1135) which was offloading instructions to its Xe GPU – and a suggestion that the Desktop CPU was trying to do the same, but with a paltry internal GPU, it was actually slowing itself down as a result.

Normally, we, that is anyone outside the lab of a major CPU designer, wouldn’t really have a chance to compare the differences between these, but enter 11th gen Intel desktop CPUs. These on paper have much of the same specs as their 10th gen predecessors, but feature an entirely new architecture that adds in avx512 instructions. I had a 10700 i7 CPU and with a simple CPU swap, leaving all other components the same, I could test the new 11700 i7 and see what performance differences I saw.

I develop AutoTS, and it happens to provide me a simple way of benchmarking a bunch of different models. I used 200 time series pulled from the M5 competition dataset – 200 series being large enough to run a bunch of models quickly while still giving me some indication of how it performance looks at scale. I ultimately ran 470 models of which 312 ran successfully in all environments.

I setup two virtual environments on the test machines: 1. using OpenBLAS (achieved by a basic pip install) and 2. an Intel-optimized environment installed from the Intel Anaconda channel (instructions in the extended_tutorial). Both of these affect the backends of the computations, primarily the behind-the-scenes work of Numpy and Scikit-learn. Mxnet and GluonTS are an exception here, they used Intel MKL in both environments. OpenBLAS is the only option that properly works on AMD, so it was sole environment on my sole AMD machine. These backends matter quite a lot, as we shall see.

In total, I had the same Desktop, used twice, once with a 10700 CPU and the second time with 11700 CPU. I also had an AMD 4500U powered mini PC and an Intel 1135g7 mini PC – both mobile/laptop chips which aren’t expected to be able to compete with the more powerful desktop 11700 chip. Because the mini PCs have only mediocre cooling, I also ran the 1135g7 outside at 45°F (~ 7°C) for an indication of how it might do with better cooling, it’s suffixed _cold in the results. For the record, the 11700/10700 desktop has an Nvidia GPU, but I had it completely disabled, no CUDA and it’s usage hovered near 1%, confirming it wasn’t in play.

See code here:
https://github.com/winedarksea/autots_benchmark

Abbreviated Results Table:

11700intel11700openblasamd4500Uopenblas10700intel10700openblas1135g7openblas1135g7intelFASTEST
Number of Cores8868844
Boost Clock (GHz)4.94.944.84.84.24.2
Geekbench5 Single1732173211831325132515011501
Geekbench5 Multi9851985152949367936745544554
Total Runtime (s)4386.044151.235211.825067.055003.674980.503779.921135g7intel
GLM35.9350.5760.0840.8056.6261.6640.3611700intel
KNN359.9998.6750.96394.99127.8053.6748.561135g7intel
VAR83.3779.0537.3586.4181.2739.6722.111135g7intel
AverageValueNaive38.13113.6588.2138.73119.94118.3836.441135g7intel

Really, it is just confusing.

Sometimes one thing is faster, sometimes other things are faster… Really hard to tell. It even varies a bit run-to-run on the same machine (especially for the fastest models). The results for GLM most closely mirror my initial hypothesis. But on the 11700, the Intel Conda channel is sometimes much slower than OpenBLAS, like on KNN, exactly where the 1135G7 is doing very well. This suggests there is a bug where the Intel channel is offloading to the detected GPU, even when the detected GPU is the 11700 iGPU which is way too tiny and actually slows things down. This rather ruins my AVX-2 vs AVX-512 comparison, as it’s also an iGPU comparison…

One thing I will say, the 1135g7 computer is drawing a lot less power than the 11700 for its results, I wish I had measured, but I will guesstimate at least 100 watts (one has a 90W power supply, the other a 550W power supply…). I also noticed differences in CPU utilization and clockspeed between the 11700 and 10700, with the 11700 actually often sustaining a lower clock speed 4.45 GHz which I believe was occurring with AVX-512 instructions, but also topping out at 5.0 GHz.

Complete Results Table:

11700intel11700openblasamd4500Uopenblas10700intel10700openblas1135g7openblas1135g7intel_cold1135g7intelFASTESTModel Count
AverageValueNaive38.13113.6588.2138.73119.94118.3835.5736.441135g7intel_cold17
DatepartRegression25.0256.3654.4026.5159.9056.2217.8318.091135g7intel_cold12
Ensemble360.10346.94661.28441.63421.08478.05538.87563.9211700openblas3
GLM35.9350.5760.0840.8056.6261.6638.4440.3611700intel18
GLS7.172.043.009.492.623.222.652.6111700openblas12
GluonTS912.40904.621385.881096.021088.861050.251182.541203.2111700openblas21
LastValueNaive29.6583.8266.1228.9189.0475.9025.7626.211135g7intel_cold16
RollingRegression1452.90688.601018.601617.19874.201107.96542.98540.511135g7intel6
SeasonalNaive157.38197.47169.16179.47225.42166.25141.76143.461135g7intel_cold26
VAR83.3779.0537.3586.4181.2739.6718.9122.111135g7intel_cold12
VECM15.2413.408.9016.368.6115.516.827.061135g7intel_cold21
WindowRegression1237.761527.801573.761454.731883.571720.251123.281147.041135g7intel_cold6
ZeroesNaive30.9786.9185.0630.7992.5487.1727.9728.901135g7intel_cold18
Breakdown of Datepart/Rolling Regression Models:Average
Adaboost1.040.280.381.360.360.450.800.7411700openblas
BayesianRidge1093.58590.06967.831223.06746.581054.49493.89492.18`
DecisionTree2.820.390.453.730.510.420.490.5111700openblas
ElasticNet0.610.070.080.810.090.070.100.0911700openblas
KNN359.9998.6750.96394.99127.8053.6749.3548.561135g7intel
MLP19.2555.2553.0918.9358.4654.9415.9516.321135g7intel_cold
SVM0.630.240.200.820.310.150.230.201135g7openblas
Total Runtime (s)4386.044151.235211.825067.055003.674980.503703.403779.921135g7intel_cold

If there are any takeaways I have:

  1. Make sure you configure your environment properly, use Intel Conda channel on Intel, and OpenBLAS on AMD, and probably if in doubt OpenBLAS (pip install) anywhere.
  2. I am excited for Intel Discrete GPUs as they promise to automatically accelerate numpy/scikit-learn without needing vendor-locked-in code like you have for Nvidia CuDF and related functions.
  3. AVX-512 does help, if not as much as the GPU does.

Leave a Comment

Your email address will not be published. Required fields are marked *