Fastest way to append one excel file to another

I am trying to append data from a sheet in Excel file1 from folder1 to a sheet in Excel file2 in folder2. Both have large volumes of data. Currently, my code takes 9 secs. I am looking for a more efficient method, which does this in 2 or 3 seconds.

Here’s my current code:

import pandas as pd
import time
from openpyxl import load_workbook

# Define the paths to your input and output Excel files
input_file_path="path_to_input_folder/input_file.xlsx"
output_file_path="path_to_output_folder/output_file.xlsx"
n = 100  # Replace with the desired number of records to append

# Record the start time
start_time = time.time()

# Read the input Excel file
with pd.ExcelFile(input_file_path) as input_excel:
    input_data = pd.read_excel(input_excel)

# Load the existing output Excel file
output_excel = load_workbook(output_file_path)

# Select the "Sheet1" worksheet
output_sheet = output_excel['Sheet1']

# Append the first 'n' records from the input to the existing "Sheet1"
for row in input_data.head(n).values:
    output_sheet.append(row.tolist())

# Save the modified output Excel file
output_excel.save(output_file_path)

# Calculate and display the execution time
execution_time = time.time() - start_time
print(f"Execution time: {execution_time:.2f} seconds")

  • really, by far the biggest gain (>10x faster) would be to use a clean dataset, like a csv file and read this into a pandas dataframe and process from there. alternatively read a clean excel table into a pandas dataframe… basically, avoid openpyxl.

    – 

in short, the best gain (for both clean code and spead) would be to use pandas ONLY in place of openpyxl. This would be my default original starting point.

The for loop is a massive red flag (unless the process is complicated).

here is template code that easily demonstrates this:

import pandas as pd

# Read in Excel file
df = pd.read_excel('data.xlsx') 

# Perform operations on DataFrame
df['Sales'] = df['Quantity'] * df['Unit Price']

# Write DataFrame to new Excel file
df.to_excel('updated_data.xlsx', index=False)

This is just several lines of code and the advantage is that pandas enables vectorised calculations which will work better on large data.

The gain (in simplicity and speed) will be huge.

avoid excel:
The second part would be to avoid using excel altogether. Whilst excel is an excellent tool, big data belongs in csv files, databases (sqlite3 is good) or other efficient file types (json, pickle).

As of pandas 1.4.0, you can use pd.ExcelWriter for this.

Suppose we have input_file.xlsx and output_file.xlsx like below. The goal is to append the data in green in the first sheet to the data in the second sheet.

input_file output_file
input_file output_file

Code

import pandas as pd

sheet_name="Sheet1"
nrows = 2

df = pd.read_excel('input_file.xlsx', sheet_name=sheet_name, nrows=nrows)

with pd.ExcelWriter('output_file.xlsx', mode="a", if_sheet_exists="overlay") as writer:
    max_row = writer.sheets[sheet_name].max_row
    df.to_excel(writer, sheet_name=sheet_name, index=False, header=False,
                startrow = max_row)

Explanation

  • First use pd.read_excel to read from input_file.xlsx. If you are only looking to append the first n rows, simply use the nrows parameter already here. (Note that by default the first row is assumed to be the header.) So, in this example, we can use nrows = 2.
  • Next, use pd.ExcelWriter inside a context manager (the with statement), which manages opening, saving and closing properly for us. Combine mode="a" with if_sheet_exists="overlay". From the docs:

overlay: Write contents to the existing sheet without first removing, but possibly over top of, the existing contents.

  • What they mean is that, if we do not specify a startrow inside df.to_excel, then the data will simply be added at the start of the sheet, i.e., at A1, leading to:

output_file without specifying startrow

  • To avoid this, we use max_row = writer.sheets[sheet_name].max_row. This will get us the last row “Excel-style” (so: count starting from 1). We use this as the startrow inside df.to_excel, regular “Python-style” (so: starting from 0).

Result

output_file updated with rows input_file

Leave a Comment