I am trying a very simple join on two dataframes: df1 and df2.
I’ve read them in from a csv file, specifying the dtype of the joining column :
df1=df1.read_csv("df1.csv",dtype={"code":str}
df2=df2.read_csv("df2.csv",dtype={"code":str}
Content types are as follows:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 local 6 non-null object
1 name 6 non-null object
2 type 6 non-null int64
3 second_name 6 non-null object
4 code 6 non-null object
5 item_name 6 non-null object
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_id 1 non-null int64
1 item_name2 1 non-null object
2 code 1 non-null object
3 size 1 non-null float64
4 category_id 1 non-null object
5 quality 1 non-null object
6 quality_id 1 non-null int64
7 brand 1 non-null object
8 brand_subtype 1 non-null object
9 score 1 non-null int64
10 size.1 1 non-null object
11 country 1 non-null object
12 city 1 non-null object
13 level 1 non-null int64
dtypes: float64(1), int64(4), object(9)
memory usage: 240.0+ bytes
The actual contents:
df1
local name type second_name code item_name
0 yes bob 1 jenga 1 triple
1 yes bob 1 jenga 1 triple
2 yes bob 1 jenga 1 triple
3 yes bob 1 jenga 1 triple
4 yes bob 1 jenga 1 triple
5 yes bob 1 jenga 1 triple
df2
item_id item_name2 code size ... size.1 country city level
0 4500 triple 1 0.25 ... small china shanghai 3
Just to ensure data types for the key column “code”, I casted the column to string explicitiy:
df1.code = df1.code.astype(str)
df2.code = df2.code.astype(str)
Problem is when I try joining, (either left or right):
df1.join(df2, how='left', on='code')
I get the following error:
ValueError: You are trying to merge on object and int64 columns
As I’ve read the columns code explicitly as string types and also casted them later (rest assured, I get the same problem if I don’t repeat the casting), I don’t see how this is a problem.
I could use pd.merge instead but it doesn’t explain or solve the problem.
Working with python 3.10
Any ideas ?
merge
works on strings but
join doesn’t work on strings.
Try:
df1.merge(df2, how='left', on='code')
df1.join(df2)
always merges via the index of df2
whereas df1.merge(df2)
will merge on the column.
Edit:
Found the reason explained in this answer:
What is the difference between join and merge in Pandas?
I can think of 2 reasons:
- Presence of NUll values in the dataset
- Characters with spaces
So try this:
df1.dropna(inplace=True)
df2.dropna(inplace=True)
# Remove the white spaces from the code feature
df1['code'] = df1['code'].str.strip()
df2['code'] = df2['code'].str.strip()
# now merge
merged_df = df1.merge(df2.astype({'code': 'str'}), how='left', on='code')
Hope it helps!
Does this answer your question? What is the difference between join and merge in Pandas?