Characterizing 'vectorized' and 'guvectorized' for different amounts of data and compiler targets
On Numba's JIT Vectorization Capabilities
The Python package Numba is a JIT compiler that translates a subset of Python and NumPy code into machine code. Among other features, it allows to generate NumPy ufuncs via numba.vectorize and generalized ufuncs via numba.guvectorize. The mentioned types of functions can be compiled for different targets, i.e. for CPU, both single- and multi-threaded (parallel), as well as for CUDA on Nvidia GPUs. I am analyzing the performance of a simple compiled demo workload across different sizes of input data and different compiler targets. TLDR:
numba.guvectorize show near-identical scaling behavior. An array of size more than 10^3 is required to saturate 24 CPU cores. CUDA shows its strengths north of 10^4.
Poliastro already heavily relies in Numba, so a deeper analysis of Numba's features, capabilities and performance was required before making any specific design decisions around how to do array computations.
numba.guvectorize proved to be interesting early on. They not only offer broadcasting semantics, but they also allow to specify compiler targets:
cpu (single threaded),
parallel (multiple threads on CPU) and
cuda (for Nvidia GPUs). This is where I got interested in the scaling behavior of code compiled for different targets via both decorators.
The following tests were performed on an AMD Epyc 7443p in performance mode at basically full boost clock speed and an Nvidia RTX A5000. On the software side, CPython 3.10.5, Numpy 1.23.1, Numba 0.56, llvmlite 0.39.0 and CUDA 11.3 were used on top of Ubuntu 20.04 LTS.
The following imports are relevant for running the benchmark.
Note the the constant
COMPLEXITY allows to make the artificial workload run longer. 2^11 is roughly equivalent to the complexity encountered in a variety of algorithms found in Poliastro.
The following piece of code serves to verify the results of compiled code later on. It is a mix of pure Python plus
numpy, intentionally using an iterative approach. The "dummy" function performs the actual work on one single number at a time. The "base" function serves as a dispatcher for an array.
Test implementations compiled with Numba
The following code snippets use
numba.guvectorize, each for targets
cuda. They are expected to yield results identical to those of the base implementation above.
Verification of results of all functions against base implementation
Just to make sure, the results of all functions are verified against the base implementation.
The actual benchmark looks as follows. It steps through arrays of various sizes and repeats the measurement for each array size a certain number of times. Notice that the garbage collector is deactivated for this benchmark so it can not interfere.
Results and Analysis
For input arrays longer than 10^3, all CPU cores can basically be saturated on target
parallel. Performance scales accordingly. For input arrays longer than 10^4,
cuda is yet a little faster at the upper end than all 24 CPU cores combined. While
cuda suffers heavily on the low end when used with small arrays, which is to be expected, code compiled with target
parallel does not suffer as much. For arrays with less than 8 elements, it is usually only by a factor of 2 slower than a single-threaded solution compiled for target
On the CPU side, an almost complete and stable saturation of the available 24 cores can be observed. CUDA in contrast appears to suffer from the relatively simple workload combined with constant transfers of data across the PCIe bus. It still manages to become faster than the CPU though.