I have two data frames:
df1 = pd.DataFrame({'ID':['A', 'A', 'B', 'B'],
'Date':['31.08.2023', '12.09.2023', '13.09.2023', '20.08.2023']})
df2 = pd.DataFrame({'ID':['A', 'A', 'A', 'B', 'B'],
'Date':['30.08.2023', '14.09.2023', '10.09.2023', '28.09.2023', '19.08.2023']})
I want to add two new columns to df1 with the nearest higher and lower date in every ID, so the desire output in this example will look like this:
ID | DATE | DATE_UP | DATE_DOWN |
---|---|---|---|
A | 31.08.2023 | 10.09.2023 | 30.08.2023 |
A | 12.09.2023 | 14.09.2023 | 10.09.2023 |
B | 13.09.2023 | 28.09.2023 | 19.08.2023 |
B | 20.08.2023 | 28.09.2023 | 19.08.2023 |
I know there is a pandas function merge_asof(), but it work only for the nearest merge.
I will be greatfull for any ideas how to do it efficiently.
Code
you can solve it by merge_asof
. merge_asof doesnt work only for the nearest merge. Even nearest is not the default, the default is backward.
at first make datetime column named ‘Date1’ and sort
df1['Date1'] = pd.to_datetime(df1['Date'], dayfirst=True)
df1 = df1.sort_values('Date1')
df2['Date1'] = pd.to_datetime(df2['Date'], dayfirst=True)
df2 = df2.sort_values('Date1')
df1:
ID Date Date1
3 B 20.08.2023 2023-08-20
0 A 31.08.2023 2023-08-31
1 A 12.09.2023 2023-09-12
2 B 13.09.2023 2023-09-13
df2:
ID Date Date1
4 B 19.08.2023 2023-08-19
0 A 30.08.2023 2023-08-30
2 A 10.09.2023 2023-09-10
1 A 14.09.2023 2023-09-14
3 B 28.09.2023 2023-09-28
next use merge_asof
forward & backward(default)
tmp = pd.merge_asof(df1.reset_index(), df2, on='Date1', by='ID',
direction='forward', suffixes=['', '_up'])
out = pd.merge_asof(tmp, df2, on='Date1', by='ID', suffixes=['', '_down'])\
.drop('Date1', axis=1).set_index('index').sort_index().rename_axis('')
out:
ID Date Date_up Date_down
0 A 31.08.2023 10.09.2023 30.08.2023
1 A 12.09.2023 14.09.2023 10.09.2023
2 B 13.09.2023 28.09.2023 19.08.2023
3 B 20.08.2023 28.09.2023 19.08.2023
Create the dataframes
import pandas as pd
df1 = pd.DataFrame({
'ID': ['A', 'A', 'B', 'B'],
'Date': ['31.08.2023', '12.09.2023', '13.09.2023', '20.08.2023']
})
df2 = pd.DataFrame({
'ID': ['A', 'A', 'A', 'B', 'B'],
'Date': ['30.08.2023', '14.09.2023', '10.09.2023', '28.09.2023', '19.08.2023']
})
Convert the ‘Date’ columns to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
Sort the dataframes by ‘ID’ and ‘Date’
df1 = df1.sort_values(by=['ID', 'Date'])
df2 = df2.sort_values(by=['ID', 'Date'])
Find the nearest higher date (DATE_UP)
df1['DATE_UP'] = df1.apply(lambda row: df2[(df2['ID'] == row['ID']) & (df2['Date'] > row['Date'])]['Date'].min(), axis=1)
Find the nearest lower date (DATE_DOWN)
df1['DATE_DOWN'] = df1.apply(lambda row: df2[(df2['ID'] == row['ID']) & (df2['Date'] < row['Date'])]['Date'].max(), axis=1)
Convert ‘Date’, ‘DATE_UP’, ‘DATE_DOWN’ back to the original string format
df1['Date'] = df1['Date'].dt.strftime('%d.%m.%Y')
df1['DATE_UP'] = df1['DATE_UP'].dt.strftime('%d.%m.%Y')
df1['DATE_DOWN'] = df1['DATE_DOWN'].dt.strftime('%d.%m.%Y')