How to prevent false sharing when writing to elements of a 1D array indexed by the number of threads in OpenMP?

Question

I have a 1D array with the size set by the number of OpenMP threads. In my case it is in Fortran, but I believe this problem would apply to C or C++. Let’s say we have number of openmp threads set by integer numThreads (this might be given as an input making the other arrays allocatable). We have an array called exArray of size numThreads. I will be calling a subroutine on the elements of exArray for each thread. For convenience, this subroutine is called exSubroutine(). Now I have the following sort of code:

    integer :: numThreads
    integer :: exArray(numThreads)
    integer :: iThread

    subroutine exSubroutine(a)
    `your text`integer, intent(in out) :: a
    end subroutine


    !$OMP parallel private(iThread)
    `your text`iThread = omp_get_thread_num() + 1
    `your text`call exSubroutine(exArray(iThread))
    `your text`! more code here that used/writes to exArray(iThread) maybe several more times
    !$OMP end parallel

Now due to ‘False sharing’ (I’m not an expert in computation, this is all based on google searches), it seems each CPU will automatically load the entire array into its cache, which means the array has to hop from one CPU cache to another to just to write to a single element. My solution is to create a private variable for each thread which copies the element value of the thread index, then copies it back after being used. So it may look like this:

    integer :: numThreads
    integer :: exArray(numThreads)
    integer :: iThread, privVar

    subroutine exSubroutine(a)
        integer, intent(in out) :: a
    end subroutine


    !$OMP parallel private(iThread, privVar)
    `your text`iThread = omp_get_thread_num() + 1
    `your text`privVar = exArray(iThread)
    `your text`call exSubroutine(privVar)
    `your text`! more code here that used/writes to privVar maybe several more times
    `your text`exArray(iThread) = privVar
    !$OMP end parallel

This certainly makes it faster, but still ultimately requires at least one write operation to exArray(iThread). Let’s assume that I have my reasons for saving the state exArray with amount of elements numThreads. I understand there are other tricks such as array padding and whatnot. However, I’m surprised and feel (maybe naively given my lack of understanding) that there must be a way (an OpenMP directive or something) that makes the compiler understand that each CPU only needs to write to a SINGLE element of a shared array and therefore only needs to load that single value to its cache, without resorting to any tricks which make the code harder to understand. If something like this exists, I’d greatly appreciate it if someone could let me know what that is, or the best way to handle this!

I have tried what I mentioned above in the ‘details’ section. For 2D arrays that I used in the same program, it is not an issue. Example might be if you have exArray of dimension(1000, numThreads), then I guess because it doesn’t fit into a single cache line anyway it doesn’t lead to this ‘false sharing’ issue.

Leave a Comment Cancel reply