Hi I have data frame in which I want to replace/cap the outlier with the 3*standard deviation value for all column with group by for each column. For example:
df = pd.DataFrame({"A":["A", "A", "A", "A", "B","B","B","B","B","B","B","B","B","B","B","B"],
"B":[7, 2, 54, 3, 5,23,5,7,7,7,7,7,7,7,6,7],
"C":[20, 16, 11, 3, 8,5,5,20,6,6,6,6,6,5,6,6],
"D":[14, 3, 32, 2, 6,5,6,20,4,5,4,5,4,5,5,5],
})
feature=['B','C','D']
mean = df.groupby('A')[feature].mean()
std = df.groupby('A')[feature].std()
now I want to replace outlier for each column in feature with appropriate standard deviation for that group.
Something like below but for each group and each column
for col in feature:
for each in df['A'].unique():
m=mean.loc[each,col]
s=std.loc[each,col]
df.loc[each,df[col]< m-3*s,]=m-3*s
Expected output:
I have many column and loop is time consuming. Is there any better way or can it be done with one loop?
You can do this in a vectorial way using a groupby.transform
:
cols = ['B', 'C', 'D']
g = df.groupby('A')[cols].transform
mean = g('mean')
std = g('std')
tmp = mean-3*std
df[df[cols]< tmp] = tmp
Output:
A B C D
0 A 7 20 14
1 B 2 16 3
2 A 54 11 35
3 A 3 3 2
4 B 5 8 6
5 B 23 5 5
Intermediate tmp
:
B C D
0 -63.74898 -14.181368 -33.109879
1 -24.07345 -7.392055 0.084091
2 -63.74898 -14.181368 -33.109879
3 -63.74898 -14.181368 -33.109879
4 -24.07345 -7.392055 0.084091
5 -24.07345 -7.392055 0.084091
Can you provide the expected output for clarity? Your code doesn’t run and it seems that your condition is always False
I have added few more row to have clarity. Earlier, output used to be same as input due to limited observation.