How can I do math on my GPU with NumPy in Python?

Question 1

Wassup! I’m building a rendering engine in Python, and it’s been a slow but fun process. I’ve been using NumPy, and in this previous question, I got some help to speed up the NumPy math to solve for vertex positions in perspective, but it’s still running on the CPU. As I integrate more features like texture mapping, I know there will be exponentially more math to be solved, so I need to figure out how to do it in parallel on my GPU.

I have no experience with this. I’ve tried using GPU functions from Numba (like @cuda.jit and @vectorize) and a bit with CuPy, but they either don’t work at all for my code, or run slower than their CPU equivalent functions. For a beginner, the documentation all feels pretty cryptic, so I’m struggling.

Which approach might be best for my situation, and what am I missing? I want to solve lots of math on the GPU, I’m flexible on how to accomplish that. Are there specific features that make a math problem good or bad for GPU application?

My GPU is an NVIDIA GeForce RTX 3050 Ti Laptop GPU, I’m running Python 3.11 with the most recent versions of NumPy, Numba and CuPy already installed, but I’m open to new options. I also installed the Cuda toolkit through Anaconda, though I don’t know how to use it.

While this is mostly a general question about how to utilize my GPU, here’s the code I’m currently trying to run on the GPU. I would appreciate answers about using the GPU that aren’t limited to just this code below:

import time
import numpy as np
from numba import njit #Unused packages: vectorize, cuda

@njit
def render_all_verts_NJIT(vert_array, unit_vec, shift, focus):
    data = (vert_array - shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

Here’s an equivalent function without @njit, in case that would cause issues for the GPU (@njit takes 1.2 extra seconds to compile upon its first call, but after that, this method is about 60% slower on my CPU):

def render_all_verts(vert_array, unit_vec, shift, focus):
    data = (vert_array - shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

This is just the code I’ve been using to test and time the functions:

# TESTING AND TIMING THE FUNCTIONS ---------------------------------------
# The next few lines make an array, similar to the 3D points I'd be rendering.
# It contains n vertices with random float coordinate values from -m to m
n = 1000
m = 50
original_vertices = (np.random.sample(size=(n, 3)).astype(np.float32)-0.5)*2 * m
print('Original vertices:\n', original_vertices, '\n')
# The next few variables are camera attributes that will be used to render vertices
camera_vector = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
camera_shift = np.array([0, 0, 10], dtype=np.float32)
camera_focus = np.single(5)
# This empty array is the same shape as example_vertices. The output results will be saved here.
rendered_vertices = np.empty(original_vertices.shape)

#This repeatedly renders the given points using the given function and times it.
def render_example(function, example_array, loop_times):
    start_time = time.time()
    for i in range(loop_times):
        output = function(example_array, camera_vector, camera_shift, camera_focus)
    print(f'Time for function {str(function)} rendering test array of shape {np.shape(example_array)} {loop_times} times...')
    print(f'--- {time.time() - start_time} seconds ---')
    return output

render_times = 10000
rendered_vertices = render_example(render_all_verts_NJIT, original_vertices, render_times)
rendered_vertices = render_example(render_all_verts, original_vertices, render_times)

print('\nLast calculated render of vertices:\n', rendered_vertices)

Question 2

I found that the break even was ~20000 vertexes using this code. The cpu was faster for 1000 vertices.

You can use pytorch with device=”cuda” like this. It might be possible to check the performance for different types depending on accurace (torch.float16 or torch.bfloat16 instead of torch.float32)


camera_vector = torch.tensor([[1, 0, 0],
                             [0, 1, 0],
                             [0, 0, 1]], dtype=torch.float32, device="cuda")
camera_shift = torch.tensor([0, 0, 10], dtype=torch.float32, device="cuda")
camera_focus = 5
torch_vertices = torch.tensor(example_array, device="cuda")

def render_all_verts_torch(vert_tensor, camera_vector, camera_shift, camera_focus):
    data = vert_tensor.t().to(torch.float32)
    data += camera_shift.view(3, 1)
    torch.mm(camera_vector, data, out=data)
    data[:2] *= camera_focus / torch.abs(data[2:3])
    return data.t()

You can also get rid of the transpose on gpu.

def render_all_verts_torch(vert_tensor, camera_vector, camera_shift, camera_focus):
    data = vert_tensor.to(torch.float32)
    data += camera_shift
    torch.mm(data, camera_vector, out=data)
    data[:, :2] *= camera_focus / torch.abs(data[:, 2:3])
    return data

Leave a Comment Cancel reply