Pandas: Replace missing values in testing set by the mean of each group from the training set

Question 1

I want to replace the missing values in the “X” column of the testing set according to the average of each category of the “Class” column, but these averages must come from the training set.

train:

| Class | X   |
| ---   | --- |
| A     | 10  |
| A     | NaN |
| A     | 20  |
| B     | 15  |
| B     | 17  |
| B     | NaN |

test:

| Class | X   |
| ---   | --- |
| A     | 11  |
| A     | NaN |
| B     | 25  |
| B     | NaN |

The idea is to use the averages of each group in the training set to replace the corresponding missing values in the testing set. In this case, the mean values of the column X for each category in the training set are:

Mean of X for Class A: 15
Mean of X for Class B: 16

So, the final testing set should be transformed like this:

final_test:

| Class | X   |
| ---   | --- |
| A     | 11  |
| A     | 15  |
| B     | 25  |
| B     | 16  |

I used the groupby() function but then I don´t know how to take the grouped values of the training set to replace the missing values in the testing set.

Thanks a lot.

Question 2

Try:

df_test = (
    df_test.set_index("Class").fillna(df_train.groupby("Class").mean()).reset_index()
)
print(df_test)

Prints:

  Class     X
0     A  11.0
1     A  15.0
2     B  25.0
3     B  16.0

Leave a Comment Cancel reply