Characterizing 'vectorized' and 'guvectorized' for different amounts of data and compiler targets
On Numba's JIT Vectorization Capabilities
The Python package Numba is a JIT compiler that translates a subset of Python and NumPy code into machine code. Among other features, it allows to generate NumPy ufuncs via numba.vectorize and generalized ufuncs via numba.guvectorize. The mentioned types of functions can be compiled for different targets, i.e. for CPU, both single- and multi-threaded (parallel), as well as for CUDA on Nvidia GPUs. I am analyzing the performance of a simple compiled demo workload across different sizes of input data and different compiler targets. TLDR: numba.vectorize and numba.guvectorize show near-identical scaling behavior. An array of size more than 10^3 is required to saturate 24 CPU cores. CUDA shows its strengths north of 10^4.
Background
This analysis is part of a project to introduce array types into the Python package Poliastro. It is funded via a NumFOCUS Small Development Grant.
Poliastro already heavily relies in Numba, so a deeper analysis of Numba's features, capabilities and performance was required before making any specific design decisions around how to do array computations. numba.vectorize and numba.guvectorize proved to be interesting early on. They not only offer broadcasting semantics, but they also allow to specify compiler targets: cpu (single threaded), parallel (multiple threads on CPU) and cuda (for Nvidia GPUs). This is where I got interested in the scaling behavior of code compiled for different targets via both decorators.
The following tests were performed on an AMD Epyc 7443p in performance mode at basically full boost clock speed and an Nvidia RTX A5000. On the software side, CPython 3.10.5, Numpy 1.23.1, Numba 0.56, llvmlite 0.39.0 and CUDA 11.3 were used on top of Ubuntu 20.04 LTS.
Relevant imports
The following imports are relevant for running the benchmark.
import gc
import math
from time import time_ns
import numpy as np
import numba as nb
from numba import cuda # must be explicitly imported
from tqdm import tqdm
if False: # numba unhappy when used with target=cuda
sin = np.sin
cos = np.cos
else:
sin = math.sin
cos = math.cos
COMPLEXITY = 2 ** 11
Note the the constant COMPLEXITY allows to make the artificial workload run longer. 2^11 is roughly equivalent to the complexity encountered in a variety of algorithms found in Poliastro.
Base implementation
The following piece of code serves to verify the results of compiled code later on. It is a mix of pure Python plus numpy, intentionally using an iterative approach. The "dummy" function performs the actual work on one single number at a time. The "base" function serves as a dispatcher for an array.
def base_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
def base(d: np.ndarray) -> np.ndarray:
r = np.zeros(d.shape, dtype = d.dtype)
for idx in range(d.shape[0]):
r[idx] = base_dummy(d[idx])
return r
Test implementations compiled with Numba
The following code snippets use numba.vectorized and numba.guvectorize, each for targets cpu, parallel and cuda. They are expected to yield results identical to those of the base implementation above.
@nb.njit
def vectorized_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.vectorize
def vectorized(d: float) -> float:
return vectorized_dummy(d)
numba.vectorized, target cpu@nb.njit('f8(f8)')
def guvectorized_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.guvectorize('void(f8[:],f8[:])', '()->()')
def guvectorized(d: float, r: float):
r[0] = guvectorized_dummy(d[0])
numba.guvectorized, target cpu@nb.njit('f8(f8)')
def vectorized_parallel_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.vectorize('f8(f8)', target = 'parallel')
def vectorized_parallel(d: float) -> float:
return vectorized_parallel_dummy(d)
numba.vectorized, target parallel@nb.njit('f8(f8)')
def guvectorized_parallel_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.guvectorize('void(f8[:],f8[:])', '()->()', target = 'parallel')
def guvectorized_parallel(d: float, r: float):
r[0] = guvectorized_parallel_dummy(d[0])
numba.guvectorized, target parallel@cuda.jit('f8(f8)', device = True, inline = True)
def vectorized_cuda_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.vectorize('f8(f8)', target = 'cuda')
def vectorized_cuda(d: float) -> float:
return vectorized_cuda_dummy(d)
numba.vectorized, target cuda@cuda.jit('f8(f8)', device = True, inline = True)
def guvectorized_cuda_dummy(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.guvectorize('void(f8[:],f8[:])', '()->()', target = 'cuda')
def guvectorized_cuda(d: float, r: float):
r[0] = guvectorized_cuda_dummy(d[0])
numba.guvectorized, target cudaVerification of results of all functions against base implementation
Just to make sure, the results of all functions are verified against the base implementation.
funcs = [
base,
vectorized,
vectorized_parallel,
vectorized_cuda,
guvectorized,
guvectorized_parallel,
guvectorized_cuda,
]
def test_all(size = 100):
data = (np.random.random(size) * 128).astype('f8')
res_base = funcs[0](data)
for func in funcs[1:]:
res = func(data)
assert np.allclose(res_base, res)
Benchmark
The actual benchmark looks as follows. It steps through arrays of various sizes and repeats the measurement for each array size a certain number of times. Notice that the garbage collector is deactivated for this benchmark so it can not interfere.
def _name(func):
try:
return func.__name__
except AttributeError: # numba#8272, fixed
try:
return list(func.functions.values())[0][1].py_func.__name__[13:]
except AttributeError: # another numba bug, fixed
return list(func.kernelmap.values())[0][1].py_func.__name__[9:]
def benchmark(start = 1, stop = 17, reps = 2):
sizes = (2 ** np.arange(start, stop + 0.5, 0.5)).astype('i8')
results = {_name(func): [] for func in funcs[1:]}
gc.disable()
for size in tqdm(sizes):
data = np.arange(0, size, 1, dtype = 'f8')
for func in funcs[1:]: # exclude pure python
runtimes = []
for rep in range(reps):
gc.collect()
start = time_ns()
_ = func(data)
runtime = time_ns() - start
runtimes.append(runtime)
results[_name(func)].append(min(runtimes) * 1e-9)
gc.enable()
return sizes, results
Results and Analysis
For input arrays longer than 10^3, all CPU cores can basically be saturated on target parallel. Performance scales accordingly. For input arrays longer than 10^4, cuda is yet a little faster at the upper end than all 24 CPU cores combined. While cuda suffers heavily on the low end when used with small arrays, which is to be expected, code compiled with target parallel does not suffer as much. For arrays with less than 8 elements, it is usually only by a factor of 2 slower than a single-threaded solution compiled for target cpu.
On the CPU side, an almost complete and stable saturation of the available 24 cores can be observed. CUDA in contrast appears to suffer from the relatively simple workload combined with constant transfers of data across the PCIe bus. It still manages to become faster than the CPU though.