‹‹‹ Blog Overview

Characterizing 'vectorized' and 'guvectorized' for different amounts of data and compiler targets
On Numba's JIT Vectorization Capabilities

The Python package Numba is a JIT compiler that translates a subset of Python and NumPy code into machine code. Among other features, it allows to generate NumPy ufuncs via numba.vectorize and generalized ufuncs via numba.guvectorize. The mentioned types of functions can be compiled for different targets, i.e. for CPU, both single- and multi-threaded (parallel), as well as for CUDA on Nvidia GPUs. I am analyzing the performance of a simple compiled demo workload across different sizes of input data and different compiler targets. TLDR: numba.vectorize and numba.guvectorize show near-identical scaling behavior. An array of size more than 10^3 is required to saturate 24 CPU cores. CUDA shows its strengths north of 10^4.

  • CUDA
  • HPC
  • JIT
  • Python
  • benchmark
  • generalized universal function
  • numba
  • ufunc
  • vectorization

Background

This analysis is part of a project to introduce array types into the Python package Poliastro. It is funded via a NumFOCUS Small Development Grant.

Poliastro already heavily relies in Numba, so a deeper analysis of Numba's features, capabilities and performance was required before making any specific design decisions around how to do array computations. numba.vectorize and numba.guvectorize proved to be interesting early on. They not only offer broadcasting semantics, but they also allow to specify compiler targets: cpu (single threaded), parallel (multiple threads on CPU) and cuda (for Nvidia GPUs). This is where I got interested in the scaling behavior of code compiled for different targets via both decorators.

The following tests were performed on an AMD Epyc 7443p in performance mode at basically full boost clock speed and an Nvidia RTX A5000. On the software side, CPython 3.10.5, Numpy 1.23.1, Numba 0.56, llvmlite 0.39.0 and CUDA 11.3 were used on top of Ubuntu 20.04 LTS.

Relevant imports

The following imports are relevant for running the benchmark.

import gc
import math
from time import time_ns

import numpy as np
import numba as nb
from numba import cuda  # must be explicitly imported
from tqdm import tqdm

if False:  # numba unhappy when used with target=cuda
    sin = np.sin
    cos = np.cos
else:
    sin = math.sin
    cos = math.cos

COMPLEXITY = 2 ** 11
Source code 1: Relevant imports

Note the the constant COMPLEXITY allows to make the artificial workload run longer. 2^11 is roughly equivalent to the complexity encountered in a variety of algorithms found in Poliastro.

Base implementation

The following piece of code serves to verify the results of compiled code later on. It is a mix of pure Python plus numpy, intentionally using an iterative approach. The "dummy" function performs the actual work on one single number at a time. The "base" function serves as a dispatcher for an array.

def base_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

def base(d: np.ndarray) -> np.ndarray:
    r = np.zeros(d.shape, dtype = d.dtype)
    for idx in range(d.shape[0]):
        r[idx] = base_dummy(d[idx])
    return r
Source code 2: Base implementation

Test implementations compiled with Numba

The following code snippets use numba.vectorized and numba.guvectorize, each for targets cpu, parallel and cuda. They are expected to yield results identical to those of the base implementation above.

@nb.njit
def vectorized_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.vectorize
def vectorized(d: float) -> float:
    return vectorized_dummy(d)
Source code 3: Using numba.vectorized, target cpu
@nb.njit('f8(f8)')
def guvectorized_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.guvectorize('void(f8[:],f8[:])', '()->()')
def guvectorized(d: float, r: float):
    r[0] = guvectorized_dummy(d[0])
Source code 4: Using numba.guvectorized, target cpu
@nb.njit('f8(f8)')
def vectorized_parallel_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.vectorize('f8(f8)', target = 'parallel')
def vectorized_parallel(d: float) -> float:
    return vectorized_parallel_dummy(d)
Source code 5: Using numba.vectorized, target parallel
@nb.njit('f8(f8)')
def guvectorized_parallel_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.guvectorize('void(f8[:],f8[:])', '()->()', target = 'parallel')
def guvectorized_parallel(d: float, r: float):
    r[0] = guvectorized_parallel_dummy(d[0])
Source code 6: Using numba.guvectorized, target parallel
@cuda.jit('f8(f8)', device = True, inline = True)
def vectorized_cuda_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.vectorize('f8(f8)', target = 'cuda')
def vectorized_cuda(d: float) -> float:
    return vectorized_cuda_dummy(d)
Source code 7: Using numba.vectorized, target cuda
@cuda.jit('f8(f8)', device = True, inline = True)
def guvectorized_cuda_dummy(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@nb.guvectorize('void(f8[:],f8[:])', '()->()', target = 'cuda')
def guvectorized_cuda(d: float, r: float):
    r[0] = guvectorized_cuda_dummy(d[0])
Source code 8: Using numba.guvectorized, target cuda

Verification of results of all functions against base implementation

Just to make sure, the results of all functions are verified against the base implementation.

funcs = [
    base,
    vectorized,
    vectorized_parallel,
    vectorized_cuda,
    guvectorized,
    guvectorized_parallel,
    guvectorized_cuda,
]

def test_all(size = 100):

    data = (np.random.random(size) * 128).astype('f8')
    res_base = funcs[0](data)

    for func in funcs[1:]:
        res = func(data)
        assert np.allclose(res_base, res)
Source code 9: Verifying results

Benchmark

The actual benchmark looks as follows. It steps through arrays of various sizes and repeats the measurement for each array size a certain number of times. Notice that the garbage collector is deactivated for this benchmark so it can not interfere.

def _name(func):
    try:
        return func.__name__
    except AttributeError:  # numba#8272, fixed
        try:
            return list(func.functions.values())[0][1].py_func.__name__[13:]
        except AttributeError:  # another numba bug, fixed
            return list(func.kernelmap.values())[0][1].py_func.__name__[9:]

def benchmark(start = 1, stop = 17, reps = 2):
    sizes = (2 ** np.arange(start, stop + 0.5, 0.5)).astype('i8')
    results = {_name(func): [] for func in funcs[1:]}
    gc.disable()
    for size in tqdm(sizes):
        data = np.arange(0, size, 1, dtype = 'f8')
        for func in funcs[1:]:  # exclude pure python
            runtimes = []
            for rep in range(reps):
                gc.collect()
                start = time_ns()
                _ = func(data)
                runtime = time_ns() - start
                runtimes.append(runtime)
            results[_name(func)].append(min(runtimes) * 1e-9)
    gc.enable()
    return sizes, results
Source code 10: Benchmark

Results and Analysis

Interactive plot 1: Runtime over size of array

For input arrays longer than 10^3, all CPU cores can basically be saturated on target parallel. Performance scales accordingly. For input arrays longer than 10^4, cuda is yet a little faster at the upper end than all 24 CPU cores combined. While cuda suffers heavily on the low end when used with small arrays, which is to be expected, code compiled with target parallel does not suffer as much. For arrays with less than 8 elements, it is usually only by a factor of 2 slower than a single-threaded solution compiled for target cpu.

Interactive plot 2: Relative speedup of different configurations over size of array

On the CPU side, an almost complete and stable saturation of the available 24 cores can be observed. CUDA in contrast appears to suffer from the relatively simple workload combined with constant transfers of data across the PCIe bus. It still manages to become faster than the CPU though.

‹‹‹ Blog Overview