I have a DataFrame like this:
ID | MS | DS |
---|---|---|
654 | 1500,10000,20000,30000 | 60,365,730 |
131 | 1500,10000,20000 | 60,365,730 |
598 | 1500,10000,20000,30000 | 60,365,730 |
The desired output looks like this:
ID | MS | DS |
---|---|---|
654 | 1500 | 60 |
654 | 10000 | 365 |
654 | 20000 | 730 |
654 | 30000 | Nan |
131 | 1500 | 60 |
131 | 10000 | 365 |
131 | 20000 | 730 |
598 | 1500 | 60 |
I have tried the following code:
from itertools import zip_longest
df.apply(lambda row: list(zip_longest(row['MS'],
row['DS'],
fillvalue="Nan")),axis = 1)
This did not work as expected, as it creates a Series with list values like this: [('1500,10000,20000,30000', '60,365,730')], [('1500,10000,20000', '60,365,730')]
.
Here’s one approach:
cols = ['MS', 'DS']
dfs = [df[c].explode().str.split(',', expand=True)
.astype('Int64')
.stack()
.rename(c)
for c in cols]
out = (pd.concat(dfs, axis=1)
.droplevel(1)
.join(df['ID'])
.loc[:, df.columns]
)
out
ID MS DS
0 654 1500 60
0 654 10000 365
0 654 20000 730
0 654 30000 <NA>
1 131 1500 60
1 131 10000 365
1 131 20000 730
2 598 1500 60
2 598 10000 365
2 598 20000 730
2 598 30000 <NA>
Explanation
- Inside a list comprehension, for each column of
['MS', 'DS']
:- use
Series.explode
to get list elements into multiple rows; - use
Series.str.split
to split on','
and setexpand
toTrue
to get the result in multiple columns; - chain
Series.astype
withInt64
to turn what will be string values into proper integers, while allowing forNaN
values (the shorter sequences will have these for columns not “reached”); - apply
df.stack
to get apd.Series
result (with column values now as a second level index); - add
Series.rename
to restore the original column name.
- use
- Join both
Series
, add non-exploded column, and re-order:- use
pd.concat
alongaxis=1
(columns) - apply
df.droplevel
to drop the added index level; - use
df.join
to re-add the non-exploded columnID
; - pass
df.columns
todf.loc
to reset the column order.
- use
Data used
import pandas as pd
data = {
'ID': [654, 131, 598],
'MS': [['1500,10000,20000,30000'], ['1500,10000,20000'],
['1500,10000,20000,30000']],
'DS': [['60,365,730'], ['60,365,730'],
['60,365,730']]
}
df = pd.DataFrame(data)
df
ID MS DS
0 654 [1500,10000,20000,30000] [60,365,730]
1 131 [1500,10000,20000] [60,365,730]
2 598 [1500,10000,20000,30000] [60,365,730]
You can use str.split
and explode
:
df[["MS","DS"]] = df[["MS","DS"]].apply(lambda x: x.str.split(","),axis=1)
df["final"] = df[["MS","DS"]].apply(lambda x: list(zip_longest(x["MS"], x["DS"])),axis=1)
df = df.explode("final")
df[["MS","DS"]] = df["final"].values.tolist()
Out:
ID MS DS
0 654 1500 60
0 654 10000 365
0 654 20000 730
0 654 30000 None
1 131 1500 60
1 131 10000 365
1 131 20000 730
2 598 1500 60
2 598 10000 365
2 598 20000 730
2 598 30000 None
The values are in list format in both the columns like this [1500,10000,20000,30000], i had to remove the list as stackoverflow was not letting me post the question saying Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut. For more editing help, click the [?] toolbar icon.