Sorting algorithm on dataframe with swapping rows

Question 1

I have the following dummy df:

import pandas as pd
data = {
    'address': [1234, 24389, 4384, 4484, 1234, 24389, 4384, 188],
    'old_account': [200, 200, 200, 300, 200, 494, 400, 100],
    'new_account': [300, 100, 494, 200, 400, 200, 200, 200]
}

df = pd.DataFrame(data)
print(df)

   address  old_account  new_account
0     1234          200          300
1    24389          200          100
2     4384          200          494
3     4484          300          200
4     1234          200          400
5    24389          494          200
6     4384          400          200
7      188          100          200

A) I want to sort it such that I have 200 at old_account and directly in the next row at new_account again:

200 xxx
xxx 200

B) I further want to sort the non-200s such that I start somewhere let’s say with 300 and browse through the whole df looking for 300s and do the switches:

Only once there are no 300s anymore I would go to the next, let’s say 400..

df above should look like this:

   address  old_account  new_account
0     1234          200          300
1     4484          300          200
2    24389          200          100
3      188          100          200
4     4384          200          494
5    24389          494          200
6     1234          200          400
7     4384          400          200

As you can see, the 200s are diagonal to each other and so are the non-200s.

The following code works only for A). I did not manage to also make it consider B)
I have the following code:

import pandas as pd

# Create the initial DataFrame
df= pd.read_csv('dummy_data.csv', sep=';')

# Initiate sorted df
sorted_df = pd.DataFrame(columns=df.columns)

while not df.empty:
    # Find the first row where '200' is in 'old_account'
    idx_old = df.index[df['old_account'] == 200].min()
    
    if pd.notna(idx_old):
        # Add the corresponding row to the sorted result
        sorted_df = pd.concat([sorted_df, df.loc[[idx_old]]], ignore_index=True)
        
        # Remove the row from the original DataFrame
        df = df.drop(index=idx_old)
        
        # Find the matching row where '200' is in 'new_account'
        idx_new = df.index[df['new_account'] == 200].min()
        
        if pd.notna(idx_new):
            # Add the corresponding row to the sorted result
            sorted_df = pd.concat([sorted_df, df.loc[[idx_new]]], ignore_index=True)
            
            # Remove the row from the original DataFrame
            df = df.drop(index=idx_new)
        else:
            break  # If no matching row is found, exit the loop
    else:
        break  # If no more '200' in 'old_account' is found, exit the loop

# Reset the index of the sorted DataFrame
sorted_df.reset_index(drop=True, inplace=True)

print(sorted_df)

Question 2

It looks like you’re trying to search a Eulerian path in a directed graph.

You might want to use networkx‘s eulerian_path:

import networkx as nx

start = 200

G = nx.from_pandas_edgelist(df, source="old_account", target="new_account",
                            create_using=nx.MultiDiGraph)

tmp = pd.DataFrame(nx.eulerian_path(G, source=start, keys=True),
                   columns=['old_account', 'new_account', 'n'])


out = (tmp.merge(df.assign(n=df.groupby(['old_account', 'new_account']).cumcount()))
       [df.columns]
       )

Output:

   address  old_account  new_account
0     1234          200          400
1     4384          400          200
2     4384          200          494
3    24389          494          200
4    24389          200          100
5      188          100          200
6     1234          200          300
7     4484          300          200

Leave a Comment Cancel reply