I have the following dummy df:
import pandas as pd
data = {
'address': [1234, 24389, 4384, 4484, 1234, 24389, 4384, 188],
'old_account': [200, 200, 200, 300, 200, 494, 400, 100],
'new_account': [300, 100, 494, 200, 400, 200, 200, 200]
}
df = pd.DataFrame(data)
print(df)
address old_account new_account
0 1234 200 300
1 24389 200 100
2 4384 200 494
3 4484 300 200
4 1234 200 400
5 24389 494 200
6 4384 400 200
7 188 100 200
A) I want to sort it such that I have 200
at old_account
and directly in the next row at new_account
again:
200 xxx
xxx 200
B) I further want to sort the non-200s such that I start somewhere let’s say with 300
and browse through the whole df looking for 300
s and do the switches:
200 300
300 200
200 300
...
Only once there are no 300
s anymore I would go to the next, let’s say 400
..
200 300
300 200
200 300
...
200 400
400 200
200 400
...
df
above should look like this:
address old_account new_account
0 1234 200 300
1 4484 300 200
2 24389 200 100
3 188 100 200
4 4384 200 494
5 24389 494 200
6 1234 200 400
7 4384 400 200
As you can see, the 200s are diagonal to each other and so are the non-200s.
The following code works only for A). I did not manage to also make it consider B)
I have the following code:
import pandas as pd
# Create the initial DataFrame
df= pd.read_csv('dummy_data.csv', sep=';')
# Initiate sorted df
sorted_df = pd.DataFrame(columns=df.columns)
while not df.empty:
# Find the first row where '200' is in 'old_account'
idx_old = df.index[df['old_account'] == 200].min()
if pd.notna(idx_old):
# Add the corresponding row to the sorted result
sorted_df = pd.concat([sorted_df, df.loc[[idx_old]]], ignore_index=True)
# Remove the row from the original DataFrame
df = df.drop(index=idx_old)
# Find the matching row where '200' is in 'new_account'
idx_new = df.index[df['new_account'] == 200].min()
if pd.notna(idx_new):
# Add the corresponding row to the sorted result
sorted_df = pd.concat([sorted_df, df.loc[[idx_new]]], ignore_index=True)
# Remove the row from the original DataFrame
df = df.drop(index=idx_new)
else:
break # If no matching row is found, exit the loop
else:
break # If no more '200' in 'old_account' is found, exit the loop
# Reset the index of the sorted DataFrame
sorted_df.reset_index(drop=True, inplace=True)
print(sorted_df)
It looks like you’re trying to search a Eulerian path in a directed graph.
You might want to use networkx
‘s eulerian_path
:
import networkx as nx
start = 200
G = nx.from_pandas_edgelist(df, source="old_account", target="new_account",
create_using=nx.MultiDiGraph)
tmp = pd.DataFrame(nx.eulerian_path(G, source=start, keys=True),
columns=['old_account', 'new_account', 'n'])
out = (tmp.merge(df.assign(n=df.groupby(['old_account', 'new_account']).cumcount()))
[df.columns]
)
Output:
address old_account new_account
0 1234 200 400
1 4384 400 200
2 4384 200 494
3 24389 494 200
4 24389 200 100
5 188 100 200
6 1234 200 300
7 4484 300 200
Why getting tied to hard-coded
200
?I need to start somewhere and 200 is the one that will be in every row. What would you suggest instead?
I suggest to find all cross-pairs, having them at the top, and leave all unmatched at the end