Using dateutil I created function which check if data from csv is date and if not remove it from dataframe, but I have problem with speed, similar checking for string or int takes 1-2 seconds for hundreds of thousands rows, but this takes more than 1 minute for date checking. I need some tips, how to speed up this, maybe there is some function from pandas library
df = pd.read_csv(filename, delimiter="|", dtype=str)
for i, ColumnValueDate in enumerate(df[column]):
try:
df.loc[i, column[0]] = parse(str(ColumnValueDate)).strftime("%Y-%m-%d")
except ParserError as ex:
dataframeIndexToDelete.append(i)
print(ex)
df = df.drop(dataframeIndexToDelete)
You could just use to_datetime
on the column, with errors="coerce"
and then remove all values which are NaT
(not a time):
df[column] = pd.to_datetime(df[column], errors="coerce")
df = df[~df[column].isna()]
If you don’t want to modify the dataframe, use a temporary Series instead:
dates = pd.to_datetime(df[column], errors="coerce")
df = df[~dates.isna()]
Update
Based on comments, df[column]
could have NaN
or empty (''
) values which should not be excluded. This requires making a second mask to ensure they are in the output even though the date conversion will fail:
m1 = (df['Date'].isna()) | (df['Date'] == '')
m2 = ~pd.to_datetime(df['Date'], errors="coerce").isna()
df = df[m1 | m2]