I want to replace the missing values in the “X” column of the testing set according to the average of each category of the “Class” column, but these averages must come from the training set.
train:
| Class | X |
| --- | --- |
| A | 10 |
| A | NaN |
| A | 20 |
| B | 15 |
| B | 17 |
| B | NaN |
test:
| Class | X |
| --- | --- |
| A | 11 |
| A | NaN |
| B | 25 |
| B | NaN |
The idea is to use the averages of each group in the training set to replace the corresponding missing values in the testing set. In this case, the mean values of the column X for each category in the training set are:
- Mean of X for Class A: 15
- Mean of X for Class B: 16
So, the final testing set should be transformed like this:
final_test:
| Class | X |
| --- | --- |
| A | 11 |
| A | 15 |
| B | 25 |
| B | 16 |
I used the groupby() function but then I don´t know how to take the grouped values of the training set to replace the missing values in the testing set.
Thanks a lot.
Try:
df_test = (
df_test.set_index("Class").fillna(df_train.groupby("Class").mean()).reset_index()
)
print(df_test)
Prints:
Class X
0 A 11.0
1 A 15.0
2 B 25.0
3 B 16.0
Please provide enough code so others can better understand or reproduce the problem.
Bot