How can I do math on my GPU with NumPy in Python?

Wassup! I’m building a rendering engine in Python, and it’s been a slow but fun process. I’ve been using NumPy, and in this previous question, I got some help to speed up the NumPy math to solve for vertex positions in perspective, but it’s still running on the CPU. As I integrate more features like texture mapping, I know there will be exponentially more math to be solved, so I need to figure out how to do it in parallel on my GPU.

I have no experience with this. I’ve tried using GPU functions from Numba (like @cuda.jit and @vectorize) and a bit with CuPy, but they either don’t work at all for my code, or run slower than their CPU equivalent functions. For a beginner, the documentation all feels pretty cryptic, so I’m struggling.

Which approach might be best for my situation, and what am I missing? I want to solve lots of math on the GPU, I’m flexible on how to accomplish that. Are there specific features that make a math problem good or bad for GPU application?

My GPU is an NVIDIA GeForce RTX 3050 Ti Laptop GPU, I’m running Python 3.11 with the most recent versions of NumPy, Numba and CuPy already installed, but I’m open to new options. I also installed the Cuda toolkit through Anaconda, though I don’t know how to use it.

While this is mostly a general question about how to utilize my GPU, here’s the code I’m currently trying to run on the GPU. I would appreciate answers about using the GPU that aren’t limited to just this code below:

import time
import numpy as np
from numba import njit #Unused packages: vectorize, cuda

@njit
def render_all_verts_NJIT(vert_array, unit_vec, shift, focus):
    data = (vert_array - shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

Here’s an equivalent function without @njit, in case that would cause issues for the GPU (@njit takes 1.2 extra seconds to compile upon its first call, but after that, this method is about 60% slower on my CPU):

def render_all_verts(vert_array, unit_vec, shift, focus):
    data = (vert_array - shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

This is just the code I’ve been using to test and time the functions:

# TESTING AND TIMING THE FUNCTIONS ---------------------------------------
# The next few lines make an array, similar to the 3D points I'd be rendering.
# It contains n vertices with random float coordinate values from -m to m
n = 1000
m = 50
original_vertices = (np.random.sample(size=(n, 3)).astype(np.float32)-0.5)*2 * m
print('Original vertices:\n', original_vertices, '\n')
# The next few variables are camera attributes that will be used to render vertices
camera_vector = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
camera_shift = np.array([0, 0, 10], dtype=np.float32)
camera_focus = np.single(5)
# This empty array is the same shape as example_vertices. The output results will be saved here.
rendered_vertices = np.empty(original_vertices.shape)

#This repeatedly renders the given points using the given function and times it.
def render_example(function, example_array, loop_times):
    start_time = time.time()
    for i in range(loop_times):
        output = function(example_array, camera_vector, camera_shift, camera_focus)
    print(f'Time for function {str(function)} rendering test array of shape {np.shape(example_array)} {loop_times} times...')
    print(f'--- {time.time() - start_time} seconds ---')
    return output

render_times = 10000
rendered_vertices = render_example(render_all_verts_NJIT, original_vertices, render_times)
rendered_vertices = render_example(render_all_verts, original_vertices, render_times)

print('\nLast calculated render of vertices:\n', rendered_vertices)

  • 1

    Have you saw that post?

    – 

  • 1

    The standard way to port a Numpy code to Nvidia GPU is to use CuPy. That being said, CuPy is often sub-optimal because it runs many kernels and they tend to be rather memory bound. This problem comes from Numpy in the first place (which is also sub-optimal on CPU partially for the same reason). When performance needs to be improved, a CUDA kernel needs to be written. CuPy supports high-level kernels like element-wise ones and reduction as well as low-level row kernels (in C/CUDA). You cannot use Numpy operations in kernels (because it is in C/CUDA).

    – 




  • 1

    With Numba, the kernel is implemented in Python but there are several restrictions coming from the underlying CUDA back-end. AFAIK, you cannot allocate dynamically a runtime-defined amount of memory in CUDA for example. The amount of memory that can be allocated is also limited. Besides, Numpy is not ported because this is a huge work to port it to GPU and the model just do not fit well (it would not be efficient anyway). Thus, you cannot call most of Numpy functions in Numba CUDA kernels (only trivial ones).

    – 

  • 1

    So, put it shortly, yes, you need to do that yourself.

    – 

  • 1

    Note np.dot call BLAS libraries like OpenBLAS which are generally parallel ones. CUDA has an alternative for that : CuBLAS. I strongly encourage you to read the CUDA programming guide and CUDA tutorials before trying to write a GPU implementation of this code.

    – 




I found that the break even was ~20000 vertexes using this code. The cpu was faster for 1000 vertices.

You can use pytorch with device=”cuda” like this. It might be possible to check the performance for different types depending on accurace (torch.float16 or torch.bfloat16 instead of torch.float32)


camera_vector = torch.tensor([[1, 0, 0],
                             [0, 1, 0],
                             [0, 0, 1]], dtype=torch.float32, device="cuda")
camera_shift = torch.tensor([0, 0, 10], dtype=torch.float32, device="cuda")
camera_focus = 5
torch_vertices = torch.tensor(example_array, device="cuda")

def render_all_verts_torch(vert_tensor, camera_vector, camera_shift, camera_focus):
    data = vert_tensor.t().to(torch.float32)
    data += camera_shift.view(3, 1)
    torch.mm(camera_vector, data, out=data)
    data[:2] *= camera_focus / torch.abs(data[2:3])
    return data.t()

You can also get rid of the transpose on gpu.

def render_all_verts_torch(vert_tensor, camera_vector, camera_shift, camera_focus):
    data = vert_tensor.to(torch.float32)
    data += camera_shift
    torch.mm(data, camera_vector, out=data)
    data[:, :2] *= camera_focus / torch.abs(data[:, 2:3])
    return data

Leave a Comment