EDIT: hopefully clarified the problem and corrected the first dataframe to match the result dataframe
example dataframe.
df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],
'vegetable':['carrot','carrot','beets','carrot','artichoke'],
'fruit':['banana','apple','banana','banana','banana'],
'protein':['beef','chicken','beef','fish','fish'],
'calories':[10, 50, 100, 150, 200]
})
Assuming it’s ordered (here by calories ASC) I’m trying to add a new column named ‘master meal’ to the DataFrame.
This column will contain the name of the first recipe that shares a significant overlap in ingredients with the current recipe. A significant overlap is defined as sharing at least two ingredients.
If a recipe has already been used as a ‘master meal’ or has a ‘master meal’ assigned to it, it should not be considered for subsequent rows.
in this example, the result would be:
df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],
'vegetable':['carrot','carrot','beets','carrot','artichoke'],
'fruit':['banana','apple','banana','banana', 'banana'],
'protein':['beef','chicken','beef','fish', 'fish'],
'calories':[10, 50, 100, 150, 200],
'master meal': ['meal 1',None,'meal 1','meal 1', None]
})
(ie. ‘meal 5’ won’t get the master meal value set to ‘meal 4’ because ‘meal 4’ has been tagged already)
I was able to build something with apply() where I compared each row to the rest of the data frame, but as you can imagine, it didn’t work too well when applied to a bigger dataset.
I scratched my head all day to find a vectorized approach without success.
Maybe you have a better idea? I don’t know how I can avoid looping through the dataframe or if so, doing it efficiently.
You could use the following:
Using for loop:
dat = df.iloc[:,1:4].to_numpy()
ln = dat.shape[0]
results = [None]* ln
for i in range(ln):
if results[i]: continue
for j in range(i + 1, ln):
if (dat[i] == dat[j]).sum() >= 2:
results[j] = df['recipe'].iloc[i]
df['master meal'] = results
df
recipe vegetable fruit protein calories master meal
0 meal 1 carrot banana beef 10 None
1 meal 2 carrot apple chicken 50 None
2 meal 3 beets banana beef 100 meal 1
3 meal 4 carrot banana fish 150 meal 1
4 meal 5 artichoke banana fish 200 None
Using numpy
from scipy.spatial.distance import pdist
dat = df.iloc[:,1:4].to_numpy()
ln = dat.shape[0]
results = np.array([None]*ln)
dist = pdist(dat, lambda x,y: (x==y).sum())
i1,i2 = (ln - np.fliplr(np.array(np.tril_indices(ln,-1))) - 1)[:, dist >= 2]
i1, i2 = np.c_[i1,i2][~np.in1d(i1,i2)].T
results[i2] = df['recipe'].iloc[i1]
df['master meal'] = results
df
recipe vegetable fruit protein calories master meal
0 meal 1 carrot banana beef 10 None
1 meal 2 carrot apple chicken 50 None
2 meal 3 beets banana beef 100 meal 1
3 meal 4 carrot banana fish 150 meal 1
4 meal 5 artichoke banana fish 200 None
The logic is not clear.
@PandaKim I tried to reformulate. Hopefully, it’s more precise. I wanted to add I’m trying to use duplicated() with 2 or more columns, and it actually made me think of a solution. maybe. if I use the sum of duplicated() on subset=[‘vegetable’, ‘fruit’], subset=[‘vegetable’, ‘protein’] and subset=[‘fruit’, ‘protein’] it would give me the actual number of overlapping ingredients. I still have to run it for each row and of course, it becomes tedious when you add columns, but I don’t have anything else at the moment…
why is meal 1 tagged as meal 1 yet meal 2 not tagged as meal 2?
@Onyambu Meal 1 has duplicates and meal 2 does not (nor does meal 5) but you’re right it would make more sense to tag them as their own group if no duplicates are found.