Using Python, I am sending 50 images to a cuda routine that calculates the location of 50 beads in each image. Below is a drawing explaining (I just have drawn 4 beads):
The cuda routine is expecting a flattened array and so I have to extract from my image region of interests (say 64 x 64) that I flatten and then concatenate.
Here is what I have done :
import numpy as np
import time
from numpy.lib.stride_tricks import sliding_window_view
# Create a numpy array of 1024 x 1280 (the image)
image = np.random.randint(0, 255, (1024, 1280), dtype=np.uint8)
# Assuming x_values and y_values are lists of x and y coordinates of those 64 x 64 region of interests.
x_values = np.random.randint(100, 900, 50)
y_values = np.random.randint(100, 900, 50)
# create 2 arrays for the 2 methods. Such array is used by cuda
main_array_1 = np.zeros((50*50*64*64))
main_array_2 = np.zeros((50*50*64*64))
print("#########################")
# First method
print("#########################")
start_time = time.time()
sub_images = np.array([image[x:x+64, y:y+64] for x, y in zip(x_values, y_values)])
flattened_sub_arrays = sub_images.ravel()
main_array_1 [:len(flattened_sub_arrays)] = flattened_sub_arrays
print("--- %s seconds ---" % ((time.time() - start_time)))
print(flattened_sub_arrays.shape)
print("#########################")
# second method (Thanks to hpaulj, see comments).
print("#########################")
# Create an array of indices for x and y
x_indices = np.array([np.arange(x, x+64) for x in x_values])
y_indices = np.array([np.arange(y, y+64) for y in y_values])
start_time = time.time()
# Create a sliding window view of the image
window_view = np.lib.stride_tricks.sliding_window_view(image, (64, 64))
# Extract the sub-images using the x and y values
sub_images = window_view[x_values, y_values]
# Flatten the sub-images
flattened_sub_arrays_2 = sub_images.ravel()
main_array_2 [:len(flattened_sub_arrays_2)] = flattened_sub_arrays_2
print("--- %s seconds ---" % ((time.time() - start_time)))
print(flattened_sub_arrays_2.shape)
print("#########################")
print("#########################")
# compare the two methods
print(np.array_equal(main_array_1, main_array_2))
Is there a way to make this much faster ?
One possible solution how to speed the function is to use numba + parallelize it:
from numba import njit, prange
@njit(parallel=True)
def store_array_numba(image, x_values, y_values, main_array):
for i in prange(len(x_values)):
x = x_values[i]
y = y_values[i]
idx = 64 * 64 * i
for j in range(64):
main_array[idx + 64 * j : idx + 64 * (j + 1)] = image[x + j, y : y + 64]
Benchmark on my machine (AMD 5700x):
from timeit import timeit
import numpy as np
from numba import njit, prange
# Create a numpy array of 1024 x 1280
image = np.random.randint(0, 255, (1024, 1280), dtype=np.uint8)
N = 50
# Assuming x_values and y_values are lists of x and y coordinates (that define the top-left corner of each sub-image)
x_values = np.random.randint(100, 900, N)
y_values = np.random.randint(100, 900, N)
# This is the array that will store the sub-images and which will be used by cuda
main_array_1 = np.zeros((N * 64 * 64))
main_array_2 = np.zeros((N * 64 * 64))
def store_array(image, x_values, y_values, main_array):
sub_images = np.array(
[image[x : x + 64, y : y + 64] for x, y in zip(x_values, y_values)]
)
flattened_sub_arrays = sub_images.ravel()
main_array[: len(flattened_sub_arrays)] = flattened_sub_arrays
return main_array
@njit(parallel=True)
def store_array_numba(image, x_values, y_values, main_array):
for i in prange(len(x_values)):
x = x_values[i]
y = y_values[i]
idx = 64 * 64 * i
for j in range(64):
main_array[idx + 64 * j : idx + 64 * (j + 1)] = image[x + j, y : y + 64]
main_array_1 = store_array(image, x_values, y_values, main_array_1)
store_array_numba(image, x_values, y_values, main_array_2)
assert np.allclose(main_array_1, main_array_2)
t1 = timeit(
"store_array(image, x_values, y_values, main_array_1)",
number=50,
globals=globals(),
)
t2 = timeit(
"store_array_numba(image, x_values, y_values, main_array_2)",
number=50,
globals=globals(),
)
print(f"time normal = {t1}")
print(f"time numba = {t2}")
Prints:
time normal = 0.0057696549920365214
time numba = 0.001736361999064684
Have you used a profiler to figure out which part of this code takes the longest to run, and where there might be room for improvement?
Well. I just calculated the time for those 3 lines of code. I thought there may alternatives.
‘sliding windows view’ might be useful here. Use it make all (64,64) blocks, and then select 50 of them for further processing.
That
sub_images
is most likely the big time consumer. It iterates on 50. It is possible to construct advanced indexing arrays that do thaat indexing once.np.linspace(x_values, x_values+64,64)
would be a start.