How to replace the outlier with 3 standard deviation value for all column based on group by column?

Hi I have data frame in which I want to replace/cap the outlier with the 3*standard deviation value for all column with group by for each column. For example:

df = pd.DataFrame({"A":["A", "A", "A", "A", "B","B","B","B","B","B","B","B","B","B","B","B"], 
                   "B":[7, 2, 54, 3, 5,23,5,7,7,7,7,7,7,7,6,7], 
                   "C":[20, 16, 11, 3, 8,5,5,20,6,6,6,6,6,5,6,6], 
                   "D":[14, 3, 32, 2, 6,5,6,20,4,5,4,5,4,5,5,5],
                }) 

feature=['B','C','D']

mean = df.groupby('A')[feature].mean()
std = df.groupby('A')[feature].std()

now I want to replace outlier for each column in feature with appropriate standard deviation for that group.

Something like below but for each group and each column

for col in feature:
 for each in df['A'].unique():
  m=mean.loc[each,col]
  s=std.loc[each,col]
  df.loc[each,df[col]< m-3*s,]=m-3*s

Expected output:

enter image description here

I have many column and loop is time consuming. Is there any better way or can it be done with one loop?

  • Can you provide the expected output for clarity? Your code doesn’t run and it seems that your condition is always False

    – 

  • I have added few more row to have clarity. Earlier, output used to be same as input due to limited observation.

    – 

You can do this in a vectorial way using a groupby.transform:

cols = ['B', 'C', 'D']

g = df.groupby('A')[cols].transform
mean = g('mean')
std = g('std')

tmp = mean-3*std
df[df[cols]< tmp] = tmp

Output:

   A   B   C   D
0  A   7  20  14
1  B   2  16   3
2  A  54  11  35
3  A   3   3   2
4  B   5   8   6
5  B  23   5   5

Intermediate tmp:

          B          C          D
0 -63.74898 -14.181368 -33.109879
1 -24.07345  -7.392055   0.084091
2 -63.74898 -14.181368 -33.109879
3 -63.74898 -14.181368 -33.109879
4 -24.07345  -7.392055   0.084091
5 -24.07345  -7.392055   0.084091

Leave a Comment