I’m trying to use tabula and Pandas concat
in Python to PDF bank statements into a CSV/Xlsx file, so I can automate the task of manually entering them into Excel, however after intense experimentation and testing, it refuses extract more than 1 table (only their column headers).
All the tables in the PDF have the same columns (Date, Balance, etc.), where I just want to stack them vertically and sort them by date.
Here’s how I’m trying to extract and concatenate the tables from the PDF as efficiently as possible:
df = pd.concat(tabula.read_pdf(filename, pages="all", multiple_tables=True))
tabula.convert_into(filename, "Converted Document.csv", output_format="csv", pages="all")
The columns for the other months are being shown as expected, however the transactions are missing for all months except the first.
I’ve looked at other solutions here, but I haven’t found a way to get all transactions showing in the converted Excel file.
A Minimal, Reproducible Example (MRE) would be useful here.
The 2 lines of code I’ve shown are the only ones I’m using to extract and convert from PDF to Excel. I’m not sure what else to do
Do you have an example PDF that can be used to demonstrate the issue?