I’ve been trying to sort out a concurrency problem after one of the devs working in this area left a couple of months ago but I’m lost on an appropriate way to solve this.
For context, we load a customers data into a structure like:
[ Key ] -> { Value }
[customer-specific-hash] -> {Slice of data points/files}
Example – really badly formatted sorry:
[a60d849ad97bfb833e1096941]
->
{
{ StartDate: '01-02-2022', EndDate: '28-02-2022', DataFrames: [1598,921578,12981,21749,192578...]},
{ StartDate: '01-03-2022', EndDate: '28-03-2022', DataFrames: [1234,1567,6781,126978...]},
}
The above is because we have 100,000’s of customers and there is a process that kicks off every night that consolidates there data based on the hashes (or a bucket really) per customer. Before processing the data frames, we go through the slice and “merge” the dataframes into a big DataFrame with lots of legal/accounting rules around it.
This runs within goroutines to index all the datapoints as fast as possible.
So the implementation is essentially a sync.Map[string, []DataFrame]
but I noticed whilst the map operations are guarded, the appending to the dataframe slices is not. There could be about 20-30 file references in that slice every night for each hash.
There is every chance that the past 2 years, customer data would have been incorrectly merged & I’ve been tasked to fix it. Pre sync.map they had used a RWMutex with a Map – again, but not the slice and it points to this article as a guide.
Firstly, is the idea of a Map that contains a slice the appropriate data structure?
I’ve tried to create a RWMutex based slice handler but wondered if the Map could have a chan DataFrame
instead to throw into when indexing a customers files, then once done, the second step of consolidating it into an array (as the len(chanx)
) would be known instead?
I am coming from primarily Java, so I may have some terms confused so I apolgise.
You have two separate problems:
- Concurrency issues while updating the map
- Concurrency issues while updating an entry of the map
sync.Map
will protect against 1, but not 2.
One way to deal with this problem is to have:
sync.Map[string, *DFrame]
where
type DFrame struct {
sync.RWMutex
Data []DataFrame
}
Once to get an entry from the map, you should Lock
or RLock
it, and then work with the data. This is not just limited to the appending of the slice. You have to RLock
the struct even if you are only reading from the data frames.
So if you are appending a new dataframe:
df := &DFrame{}
entry,_:=m.LoadOrStore(key, df)
dfEntry:=entry.(*DFrame)
dfEntry.Lock()
dfEntry.Data=append(dfEntry.Data, newDataFrame)
dfEntry.Unlock()
Looks like Burak Sendar has graced you with a great answer. He’s one of the pros around here. If you want to up your Go concurrency game, read go.dev/blog/codelab-share “don’t communicate by sharing state; share state by communicating” (the channel based approach). I try to avoid locks whenever possible; they’re tricky to get right and omissions like this are all to easy.
thank-you @erik258 that’s a great share and so succint in explanation 😀 I did wonder whether a channel approach would be easier.