Given a dataframe that contains a combination of null and numeric values in which each series of numeric values is always located together and is never interspersed with nulls. Such as:
df1 = pd.DataFrame({
'A': [1, 2, 3, np.nan, np.nan],
'B': [np.nan, np.nan, 1, 2, 3],
'C': [np.nan, 1, 2, 3, np.nan]
})
A B C
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 3.0 1.0 2.0
3 NaN 2.0 3.0
4 NaN 3.0 NaN
Desired Output
I’d like to create a second data frame with identical index and columns in which a defined series is inserted so that it ends at the index of the last non-null value for each column in df1.
Note that the length of the defined series will differ to the length of non-null values in each column. i.e.
new_data = ['A','B']
A B C
0 NaN NaN NaN
1 A NaN NaN
2 B NaN A
3 NaN A B
4 NaN B NaN
Current Approach
My current approach achieves this by creating an empty dataframe, looping through each column, defining the index range and assigning the new data:
new_data = ['A','B']
df2 = pd.DataFrame(columns = df1.columns, index = df1.index)
for col in df2:
end = df1[col].last_valid_index()+1
start = end - len(data)
df2[col][start:end] = new_data
A B C
0 NaN NaN NaN
1 A NaN NaN
2 B NaN A
3 NaN A B
4 NaN B NaN
While this works, it feels somewhat brute force and I hoped to find a more elegant solution please.
Assuming you have only one stretch of non-NA per column (if not see alternative below).
One option using a reverse cumsum
, map
, and where
:
m = df1.notna()
out = (m[::-1].cumsum()[::-1]
.apply(lambda s: s.map(dict(enumerate(new_data[::-1], start=1))))
.where(m)
)
Output:
A B C
0 NaN NaN NaN
1 A NaN NaN
2 B NaN A
3 NaN A B
4 NaN B NaN
Intermediates:
# m[::-1].cumsum()[::-1]
A B C
0 3 3 3
1 2 3 3
2 1 3 2
3 0 2 1
4 0 1 0
# ….apply(lambda s: s.map(dict(enumerate(new_data[::-1], start=1))))
A B C
0 NaN NaN NaN
1 A NaN NaN
2 B NaN A
3 NaN A B
4 NaN B NaN
alternative
If you can have multiple stretches of non-NAN, replace cumsum
bu groupby.cumcount
:
dic = dict(enumerate(new_data[::-1]))
out = (df1.apply(lambda s: s.groupby(s.isna().cumsum()).cumcount(ascending=False).map(dic))
.where(df1.notna())
)
Example output:
A B C
0 NaN NaN NaN
1 A NaN NaN
2 B NaN A
3 NaN A B
4 NaN B NaN
5 A NaN NaN
6 B NaN NaN
Alternative input:
A B C
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 3.0 1.0 2.0
3 NaN 2.0 3.0
4 1.0 3.0 NaN
5 2.0 NaN NaN
6 3.0 NaN NaN
Can there be multiple series in each column?