Finding overlapping rows with 2 or more identical values

EDIT: hopefully clarified the problem and corrected the first dataframe to match the result dataframe

example dataframe.

df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],
'vegetable':['carrot','carrot','beets','carrot','artichoke'],
'fruit':['banana','apple','banana','banana','banana'],
'protein':['beef','chicken','beef','fish','fish'],
'calories':[10, 50, 100, 150, 200]
})

Assuming it’s ordered (here by calories ASC) I’m trying to add a new column named ‘master meal’ to the DataFrame.
This column will contain the name of the first recipe that shares a significant overlap in ingredients with the current recipe. A significant overlap is defined as sharing at least two ingredients.

If a recipe has already been used as a ‘master meal’ or has a ‘master meal’ assigned to it, it should not be considered for subsequent rows.

in this example, the result would be:

df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],
'vegetable':['carrot','carrot','beets','carrot','artichoke'],
'fruit':['banana','apple','banana','banana', 'banana'],
'protein':['beef','chicken','beef','fish', 'fish'],
'calories':[10, 50, 100, 150, 200],
'master meal': ['meal 1',None,'meal 1','meal 1', None]
})

(ie. ‘meal 5’ won’t get the master meal value set to ‘meal 4’ because ‘meal 4’ has been tagged already)

I was able to build something with apply() where I compared each row to the rest of the data frame, but as you can imagine, it didn’t work too well when applied to a bigger dataset.
I scratched my head all day to find a vectorized approach without success.

Maybe you have a better idea? I don’t know how I can avoid looping through the dataframe or if so, doing it efficiently.

  • 1

    The logic is not clear.

    – 

  • @PandaKim I tried to reformulate. Hopefully, it’s more precise. I wanted to add I’m trying to use duplicated() with 2 or more columns, and it actually made me think of a solution. maybe. if I use the sum of duplicated() on subset=[‘vegetable’, ‘fruit’], subset=[‘vegetable’, ‘protein’] and subset=[‘fruit’, ‘protein’] it would give me the actual number of overlapping ingredients. I still have to run it for each row and of course, it becomes tedious when you add columns, but I don’t have anything else at the moment…

    – 




  • why is meal 1 tagged as meal 1 yet meal 2 not tagged as meal 2?

    – 

  • @Onyambu Meal 1 has duplicates and meal 2 does not (nor does meal 5) but you’re right it would make more sense to tag them as their own group if no duplicates are found.

    – 

You could use the following:

Using for loop:

dat = df.iloc[:,1:4].to_numpy()
ln = dat.shape[0]
results = [None]* ln
for i in range(ln):
    if results[i]: continue
    for j in range(i + 1, ln):
        if (dat[i] == dat[j]).sum() >= 2:
            results[j] = df['recipe'].iloc[i]
            
df['master meal'] = results
df
   recipe  vegetable   fruit  protein  calories master meal
0  meal 1     carrot  banana     beef        10        None 
1  meal 2     carrot   apple  chicken        50        None
2  meal 3      beets  banana     beef       100      meal 1
3  meal 4     carrot  banana     fish       150      meal 1
4  meal 5  artichoke  banana     fish       200        None

Using numpy

from scipy.spatial.distance import pdist

dat = df.iloc[:,1:4].to_numpy()
ln = dat.shape[0] 
results = np.array([None]*ln)
dist = pdist(dat, lambda x,y: (x==y).sum())

i1,i2 = (ln - np.fliplr(np.array(np.tril_indices(ln,-1))) - 1)[:, dist >= 2]
i1, i2 = np.c_[i1,i2][~np.in1d(i1,i2)].T



results[i2] = df['recipe'].iloc[i1]
df['master meal'] = results
df
   recipe  vegetable   fruit  protein  calories master meal
0  meal 1     carrot  banana     beef        10        None 
1  meal 2     carrot   apple  chicken        50        None
2  meal 3      beets  banana     beef       100      meal 1
3  meal 4     carrot  banana     fish       150      meal 1
4  meal 5  artichoke  banana     fish       200        None

Leave a Comment