I want to measure the execution time of a loop that enqueues OpenCL kernels and then copies a sub-buffer to another location on the device. I use the following structure to profile the kernel:
for (int i = 0; i < N; ++i) {
cl_event test;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);
clWaitForEvents(1 , &test);
clFinish(queue);
clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
time_passed_kernel += (time_end-time_start)/1e6;
clReleaseEvent(test);
cl_event buffer_copy_event;
clEnqueueCopyBuffer(queue, device_buffer, device_data, 0,
data_offset * i * sizeof(cl_float), device_buffer_size * sizeof(cl_float),
0, NULL, &buffer_copy_event);
clWaitForEvents(1, &buffer_copy_event);
clReleaseEvent(buffer_copy_event);
}
When I measure the runtime of the whole loop with:
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < N; ++i) {
cl_event test;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);
clWaitForEvents(1 , &test);
clReleaseEvent(test);
cl_event buffer_copy_event;
clEnqueueCopyBuffer(queue, device_buffer, device_data, 0,
data_offset * i * sizeof(cl_float), device_buffer_size * sizeof(cl_float),
0, NULL, &buffer_copy_event);
clWaitForEvents(1, &buffer_copy_event);
clReleaseEvent(buffer_copy_event);
}
auto end = std::chrono::steady_clock::now();
time_passed_kernel = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() / 1e+6;
I receive runtimes which are factor 5 higher than compared to the event measurement before. When I add up the runtimes of the kernel execution and the buffer-copy, I receive around 70ms in runtime, but as soon as I time the whole loop, I measure around 350ms on average. Since clWaitForEvents()
is a blocking function, I don’t understand the offset in runtime.
I’d guess it’s simply the overhead of creating objects and invoking the kernel. Please show a minimal reproducible example and include the actual times you’re measuring
@AlanBirtles Thanks for your answer, that was also my first thought, but a slowdown of factor 5 just by the invocation overhead seems pretty wild. I just added the other content that’s in the loop.
Please show a minimal reproducible example