How to prevent false sharing when writing to elements of a 1D array indexed by the number of threads in OpenMP?

I have a 1D array with the size set by the number of OpenMP threads. In my case it is in Fortran, but I believe this problem would apply to C or C++. Let’s say we have number of openmp threads set by integer numThreads (this might be given as an input making the other arrays allocatable). We have an array called exArray of size numThreads. I will be calling a subroutine on the elements of exArray for each thread. For convenience, this subroutine is called exSubroutine(). Now I have the following sort of code:

    integer :: numThreads
    integer :: exArray(numThreads)
    integer :: iThread

    subroutine exSubroutine(a)
    `your text`integer, intent(in out) :: a
    end subroutine


    !$OMP parallel private(iThread)
    `your text`iThread = omp_get_thread_num() + 1
    `your text`call exSubroutine(exArray(iThread))
    `your text`! more code here that used/writes to exArray(iThread) maybe several more times
    !$OMP end parallel

Now due to ‘False sharing’ (I’m not an expert in computation, this is all based on google searches), it seems each CPU will automatically load the entire array into its cache, which means the array has to hop from one CPU cache to another to just to write to a single element. My solution is to create a private variable for each thread which copies the element value of the thread index, then copies it back after being used. So it may look like this:

    integer :: numThreads
    integer :: exArray(numThreads)
    integer :: iThread, privVar

    subroutine exSubroutine(a)
        integer, intent(in out) :: a
    end subroutine


    !$OMP parallel private(iThread, privVar)
    `your text`iThread = omp_get_thread_num() + 1
    `your text`privVar = exArray(iThread)
    `your text`call exSubroutine(privVar)
    `your text`! more code here that used/writes to privVar maybe several more times
    `your text`exArray(iThread) = privVar
    !$OMP end parallel

This certainly makes it faster, but still ultimately requires at least one write operation to exArray(iThread). Let’s assume that I have my reasons for saving the state exArray with amount of elements numThreads. I understand there are other tricks such as array padding and whatnot. However, I’m surprised and feel (maybe naively given my lack of understanding) that there must be a way (an OpenMP directive or something) that makes the compiler understand that each CPU only needs to write to a SINGLE element of a shared array and therefore only needs to load that single value to its cache, without resorting to any tricks which make the code harder to understand. If something like this exists, I’d greatly appreciate it if someone could let me know what that is, or the best way to handle this!

I have tried what I mentioned above in the ‘details’ section. For 2D arrays that I used in the same program, it is not an issue. Example might be if you have exArray of dimension(1000, numThreads), then I guess because it doesn’t fit into a single cache line anyway it doesn’t lead to this ‘false sharing’ issue.

  • “My solution is to create a private variable for each thread which copies the element value of the thread index, then copies it back after being used.” : this is the right approach, and I don’t really think there are other ways. Compilers do not have any control on the cache strategy AFAIK.

    – 

  • 1

    Welcome, I suggest taking the tour. You hove some stray pieces of text in the code blocks, please check and edit.

    – 

  • “only needs to load that single value to its cache” In general, sorry, but no – that’s not how the CPU and memory subsystem works, and you can’t change it in software. You would need to get your (very small) soldering iron out if you want to change it.

    – 

  • @IanBush The code I generated is faster because I first copy the elements of the array to an individual private variable. According to what I’ve read, this is faster because each CPU only loads that single value to its cache line rather than sharing the entire array between CPUS. So it seems using a trick in software, the actual data passed in hardware to a CPU was changed. If that’s the case, is there no way to have an individual private reference (or pointer) to an element of an array for each thread? Or are arrays of data types always bundled together in any writing operation?

    – 

Leave a Comment