I am trying to append data from a sheet in Excel file1 from folder1 to a sheet in Excel file2 in folder2. Both have large volumes of data. Currently, my code takes 9 secs. I am looking for a more efficient method, which does this in 2 or 3 seconds.
Here’s my current code:
import pandas as pd
import time
from openpyxl import load_workbook
# Define the paths to your input and output Excel files
input_file_path="path_to_input_folder/input_file.xlsx"
output_file_path="path_to_output_folder/output_file.xlsx"
n = 100 # Replace with the desired number of records to append
# Record the start time
start_time = time.time()
# Read the input Excel file
with pd.ExcelFile(input_file_path) as input_excel:
input_data = pd.read_excel(input_excel)
# Load the existing output Excel file
output_excel = load_workbook(output_file_path)
# Select the "Sheet1" worksheet
output_sheet = output_excel['Sheet1']
# Append the first 'n' records from the input to the existing "Sheet1"
for row in input_data.head(n).values:
output_sheet.append(row.tolist())
# Save the modified output Excel file
output_excel.save(output_file_path)
# Calculate and display the execution time
execution_time = time.time() - start_time
print(f"Execution time: {execution_time:.2f} seconds")
in short, the best gain (for both clean code and spead) would be to use pandas
ONLY in place of openpyxl
. This would be my default original starting point.
The for
loop is a massive red flag (unless the process is complicated).
here is template code that easily demonstrates this:
import pandas as pd
# Read in Excel file
df = pd.read_excel('data.xlsx')
# Perform operations on DataFrame
df['Sales'] = df['Quantity'] * df['Unit Price']
# Write DataFrame to new Excel file
df.to_excel('updated_data.xlsx', index=False)
This is just several lines of code and the advantage is that pandas enables vectorised calculations which will work better on large data.
The gain (in simplicity and speed) will be huge.
avoid excel:
The second part would be to avoid using excel altogether. Whilst excel is an excellent tool, big data belongs in csv files, databases (sqlite3 is good) or other efficient file types (json, pickle).
As of pandas 1.4.0
, you can use pd.ExcelWriter
for this.
Suppose we have input_file.xlsx
and output_file.xlsx
like below. The goal is to append the data in green in the first sheet to the data in the second sheet.
Code
import pandas as pd
sheet_name="Sheet1"
nrows = 2
df = pd.read_excel('input_file.xlsx', sheet_name=sheet_name, nrows=nrows)
with pd.ExcelWriter('output_file.xlsx', mode="a", if_sheet_exists="overlay") as writer:
max_row = writer.sheets[sheet_name].max_row
df.to_excel(writer, sheet_name=sheet_name, index=False, header=False,
startrow = max_row)
Explanation
- First use
pd.read_excel
to read frominput_file.xlsx
. If you are only looking to append the first n rows, simply use thenrows
parameter already here. (Note that by default the first row is assumed to be the header.) So, in this example, we can usenrows = 2
. - Next, use
pd.ExcelWriter
inside a context manager (thewith
statement), which manages opening, saving and closing properly for us. Combinemode="a"
withif_sheet_exists="overlay"
. From the docs:
overlay: Write contents to the existing sheet without first removing, but possibly over top of, the existing contents.
- What they mean is that, if we do not specify a
startrow
insidedf.to_excel
, then the data will simply be added at the start of the sheet, i.e., atA1
, leading to:
- To avoid this, we use
max_row = writer.sheets[sheet_name].max_row
. This will get us the last row “Excel-style” (so: count starting from1
). We use this as thestartrow
insidedf.to_excel
, regular “Python-style” (so: starting from0
).
Result
really, by far the biggest gain (>10x faster) would be to use a clean dataset, like a csv file and read this into a pandas dataframe and process from there. alternatively read a clean excel table into a pandas dataframe… basically, avoid
openpyxl
.