Correctly measure OpenCL kernel latency

I want to measure the execution time of a loop that enqueues OpenCL kernels and then copies a sub-buffer to another location on the device. I use the following structure to profile the kernel:

for (int i = 0; i < N; ++i) {

   cl_event test;

   clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);

   clWaitForEvents(1 , &test);
   clFinish(queue);
   clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
   clGetEventProfilingInfo(test, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
   time_passed_kernel += (time_end-time_start)/1e6;
   clReleaseEvent(test);

   cl_event buffer_copy_event;
   clEnqueueCopyBuffer(queue, device_buffer, device_data, 0,
                               data_offset * i * sizeof(cl_float), device_buffer_size * sizeof(cl_float),
                              0, NULL, &buffer_copy_event);
   clWaitForEvents(1, &buffer_copy_event);
   clReleaseEvent(buffer_copy_event);

}

When I measure the runtime of the whole loop with:

auto start = std::chrono::steady_clock::now();

for (int i = 0; i < N; ++i) {

   cl_event test;

   clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &test);

   clWaitForEvents(1 , &test);
   clReleaseEvent(test);

   cl_event buffer_copy_event;
   clEnqueueCopyBuffer(queue, device_buffer, device_data, 0,
                               data_offset * i * sizeof(cl_float), device_buffer_size * sizeof(cl_float),
                              0, NULL, &buffer_copy_event);
   clWaitForEvents(1, &buffer_copy_event);
   clReleaseEvent(buffer_copy_event); 
}

auto end = std::chrono::steady_clock::now();
time_passed_kernel = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() / 1e+6;

I receive runtimes which are factor 5 higher than compared to the event measurement before. When I add up the runtimes of the kernel execution and the buffer-copy, I receive around 70ms in runtime, but as soon as I time the whole loop, I measure around 350ms on average. Since clWaitForEvents() is a blocking function, I don’t understand the offset in runtime.

  • I’d guess it’s simply the overhead of creating objects and invoking the kernel. Please show a minimal reproducible example and include the actual times you’re measuring

    – 

  • @AlanBirtles Thanks for your answer, that was also my first thought, but a slowdown of factor 5 just by the invocation overhead seems pretty wild. I just added the other content that’s in the loop.

    – 

  • Please show a minimal reproducible example

    – 

Leave a Comment