speed up parse function dateutil

Question 1

Using dateutil I created function which check if data from csv is date and if not remove it from dataframe, but I have problem with speed, similar checking for string or int takes 1-2 seconds for hundreds of thousands rows, but this takes more than 1 minute for date checking. I need some tips, how to speed up this, maybe there is some function from pandas library

df = pd.read_csv(filename, delimiter="|", dtype=str)
for i, ColumnValueDate in enumerate(df[column]):
     try:
        df.loc[i, column[0]] = parse(str(ColumnValueDate)).strftime("%Y-%m-%d")
    except ParserError as ex:
        dataframeIndexToDelete.append(i)
        print(ex)
df = df.drop(dataframeIndexToDelete)

Question 2

You could just use to_datetime on the column, with errors="coerce" and then remove all values which are NaT (not a time):

df[column] = pd.to_datetime(df[column], errors="coerce")
df = df[~df[column].isna()]

If you don’t want to modify the dataframe, use a temporary Series instead:

dates = pd.to_datetime(df[column], errors="coerce")
df = df[~dates.isna()]

Update

Based on comments, df[column] could have NaN or empty ('') values which should not be excluded. This requires making a second mask to ensure they are in the output even though the date conversion will fail:

m1 = (df['Date'].isna()) | (df['Date'] == '')
m2 = ~pd.to_datetime(df['Date'], errors="coerce").isna()

df = df[m1 | m2]

Leave a Comment Cancel reply