Fastest way to find both the minimum and maximum of an BGR array per cell in Numba?

I have an array of 3 dimensional with shape (height, width, 3) representing a BGR image with values being floats between 0 and 1 (inclusive).

I need to find both the minimum and maximum channel of each pixel in one loop. If img is the array, then the effect would be the same as np.dstack([img.max(axis=-1), img.min(axis=-1)]), and I wish to do it in one loop.

I have already tried to implement it myself, and I have beaten NumPy builtin methods by far, but the efficiency is only because Numba. I think perhaps there can be more efficient ways to do this.

My idea is simple, with three numbers a, b, c, there are 6 possible relationships:

a <= b <= c
a <= c <= b
b <= a <= c
b < c < a
c < a < b
c < b < a

With 3 conditional variable swaps I have managed to find minimum and maximum in one function without calling min and max:

def extrema(a, b, c):
    if b > c:
        b, c = c, b

    if a > b:
        a, b = b, a

    if b > c:
        b, c = c, b

    return a, c

My method is faster than min, max calls:

In [9]: %timeit extrema(29, 500, 12)
235 ns ± 14.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [10]: %timeit max((29, 500, 12))
186 ns ± 7.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: %timeit min((29, 500, 12))
186 ns ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So I have implemented the following functions for benchmarking:

import numba as nb
import numpy as np

def extrema_img(img: np.ndarray) -> np.ndarray:
    ext = img.copy()
    for a, b in ((1, 2), (0, 1), (1, 2)):
        temp = ext[..., a][mask := ext[..., a] > ext[..., b]]
        ext[mask, a] = ext[mask, b]
        ext[mask, b] = temp
    return ext


def extrema_img_1(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            a, b, c = img[y, x]
            if b > c:
                b, c = c, b

            if a > b:
                a, b = b, a

            if b > c:
                b, c = c, b

            ext[y, x] = (a, c)
    return ext


@nb.njit(cache=True, fastmath=True)
def extrema_img_nb1(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            a, b, c = img[y, x]
            if b > c:
                b, c = c, b

            if a > b:
                a, b = b, a

            if b > c:
                b, c = c, b

            ext[y, x] = (a, c)
    return ext


@nb.njit(cache=True, fastmath=True)
def extrema_img_nb3(img: np.ndarray) -> np.ndarray:
    (h, w) = img.shape[:2]
    ext = np.zeros((h, w, 2))
    for y in range(h):
        for x in range(w):
            ext[y, x] = np.sort(img[y, x])[:3:2]
    return ext


def extrema_img_7(img: np.ndarray) -> np.ndarray:
    return np.sort(img, axis=-1)


def extrema_img_8(img: np.ndarray) -> tuple:
    return img.min(axis=-1), img.max(axis=-1)

With img = np.random.random((1080, 1920, 3)), the benchmark results are below:

In [2]: %timeit extrema_img(img)
338 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit extrema_img_1(img)
3.88 s ± 55.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit extrema_img_nb1(img)
18.6 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit extrema_img_nb1(img)
19.6 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit extrema_img_nb3(img)
1.67 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit extrema_img_7(img)
89.3 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit extrema_img_8(img)
184 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

extrema_img_nb1 is the fastest, but extrema_img_1 is much slower than the boolean masking based solution, I have tried to compile the masking based solution but it doesn’t work, I guess I should post a separate question.

How can it be faster?

  • I’ve removed the python-3.x tag. Please stop using that tag on your posts that aren’t explicitly about python-3.x.

    – 




  • @jared The Walrus operator (:=) is a feature exclusive to Python 3.x.

    – 

  • That doesn’t make this a python-3.x question. Just because a question uses range instead of xrange doesn’t mean it should be tagged python-3.x.

    – 




  • 1

    Besides, please note that I added a comment in the answer of your previous question explaining why Numpy is not efficient with such input layout and why the Numba code can be far faster using a better layout. It is worth reading since SIMD can outperform all micro-optimizations you can think about by a far margin.

    – 

  • 1

    @jared An x2 speed up is actually quite disappointing for a GPU code (ie. parallel) compared to a sequential CPU code that could be parallelized and that does not currently benefit from SIMD instructions. In fact, discrite GPUs cannot be faster than CPU for most basic image operation which are memory-bound. This is because the PCIe link used for transfering data is significantly slower than the host RAM. Not to mention allocating GPU data, starting a kernel and doing synchronizations is quite slow. Such a computation is known to be memory-bound with SIMD instructions on most machines.

    – 

Leave a Comment