Insert series ending at last valid index in dataframe column. Pandas

Question 1

Given a dataframe that contains a combination of null and numeric values in which each series of numeric values is always located together and is never interspersed with nulls. Such as:

df1 = pd.DataFrame({
    'A': [1, 2, 3, np.nan, np.nan],
    'B': [np.nan, np.nan, 1, 2, 3],
    'C': [np.nan, 1, 2, 3, np.nan]
})

    A   B   C
0   1.0 NaN NaN
1   2.0 NaN 1.0
2   3.0 1.0 2.0
3   NaN 2.0 3.0
4   NaN 3.0 NaN

Desired Output
I’d like to create a second data frame with identical index and columns in which a defined series is inserted so that it ends at the index of the last non-null value for each column in df1.

Note that the length of the defined series will differ to the length of non-null values in each column. i.e.

new_data = ['A','B']

    A   B   C
0   NaN NaN NaN
1   A   NaN NaN
2   B   NaN A
3   NaN A   B
4   NaN B   NaN

Current Approach
My current approach achieves this by creating an empty dataframe, looping through each column, defining the index range and assigning the new data:

new_data = ['A','B']
df2 = pd.DataFrame(columns = df1.columns, index = df1.index)

for col in df2:
    end = df1[col].last_valid_index()+1
    start = end - len(data)
    df2[col][start:end] = new_data

    A   B   C
0   NaN NaN NaN
1   A   NaN NaN
2   B   NaN A
3   NaN A   B
4   NaN B   NaN

While this works, it feels somewhat brute force and I hoped to find a more elegant solution please.

Question 2

Assuming you have only one stretch of non-NA per column (if not see alternative below).

One option using a reverse cumsum, map, and where:

m = df1.notna()

out = (m[::-1].cumsum()[::-1]
       .apply(lambda s: s.map(dict(enumerate(new_data[::-1], start=1))))
       .where(m)
      )

Output:

     A    B    C
0  NaN  NaN  NaN
1    A  NaN  NaN
2    B  NaN    A
3  NaN    A    B
4  NaN    B  NaN

Intermediates:

# m[::-1].cumsum()[::-1]
   A  B  C
0  3  3  3
1  2  3  3
2  1  3  2
3  0  2  1
4  0  1  0

# ….apply(lambda s: s.map(dict(enumerate(new_data[::-1], start=1))))
     A    B    C
0  NaN  NaN  NaN
1    A  NaN  NaN
2    B  NaN    A
3  NaN    A    B
4  NaN    B  NaN

alternative

If you can have multiple stretches of non-NAN, replace cumsum bu groupby.cumcount:

dic = dict(enumerate(new_data[::-1]))

out = (df1.apply(lambda s: s.groupby(s.isna().cumsum()).cumcount(ascending=False).map(dic))
          .where(df1.notna())
      )

Example output:

     A    B    C
0  NaN  NaN  NaN
1    A  NaN  NaN
2    B  NaN    A
3  NaN    A    B
4  NaN    B  NaN
5    A  NaN  NaN
6    B  NaN  NaN

Alternative input:

     A    B    C
0  1.0  NaN  NaN
1  2.0  NaN  1.0
2  3.0  1.0  2.0
3  NaN  2.0  3.0
4  1.0  3.0  NaN
5  2.0  NaN  NaN
6  3.0  NaN  NaN

alternative

Leave a Comment Cancel reply