I’m quite new in stackoverflow and quite recently learnt some basic Python. This is the first time I’m using openpyxl. Before I used xlrd and xlsxwriter and I did manage to make some useful programs. But right now I need a .xlsx reader&writer.
There is a File which I need to read and edit with data already stored in the code. Let’s suppose the .xlsx has five columns with data: A, B, C, D, E. In column A, I’ve over 1000 rows with data. On Column D, I’ve 150 rows with data.
Basically, I want the program to find the last row with data on a given column (say D). Then, write the stored variable
data in the next available row (last row + 1) in column D.
The problem is that I can’t use
ws.get_highest_row() because it returns the row 1000 on column A.
Basically, so far this is all I’ve got:
data = 'xxx' from openpyxl import load_workbook wb = load_workbook('book.xlsx', use_iterators=True) ws = wb.get_sheet_by_name('Sheet1') last_row = ws.get_highest_row()
Obviously this doesn’t work at all.
last_row returns 1000.
Here’s how to do it using Pandas.
It’s easy to get the last non-null row in Pandas using
There might be a better way to write the resulting
DataFrame to your
xlsx file but, according to the docs, this very dumb way is actually how it’s done in
Let’s say you’re starting with this simple worksheet:
Let’s say we want to put
xxx into column
import openpyxl as xl import pandas as pd wb = xl.load_workbook('deleteme.xlsx') ws = wb.get_sheet_by_name('Sheet1') df = pd.read_excel('deleteme.xlsx') def replace_first_null(df, col_name, value): """ Replace the first null value in DataFrame df.`col_name` with `value`. """ return_df = df.copy() idx = list(df.index) last_valid = df[col_name].last_valid_index() last_valid_row_number = idx.index(last_valid) # This next line has mixed number and string indexing # but it should be ok, since df is coming from an # Excel sheet and should have a consecutive index return_df.loc[last_valid_row_number + 1, col_name] = value return return_df def write_df_to_worksheet(ws, df): """ Write the values in df to the worksheet ws in place """ for i, col in enumerate(replaced): for j, val in enumerate(replaced[col]): if not pd.isnull(val): # Python is zero indexed, so add one # (plus an extra one to take account # of the header row!) ws.cell(row=j + 2, column=i + 1).value = val # Here's the actual replacing happening replaced = replace_first_null(df, 'C', 'xxx') write_df_to_worksheet(ws, df) wb.save('changed.xlsx')
which results in:
The problem is that
get_highest_row() itself uses row dimensions instances to define the maximum row in the sheet.
RowDimension has no information about the columns – which means we cannot use it to solve your problem and have to approach it differently.
Here is one kind of “ugly” openpyxl-specific option that though would not work if
from openpyxl.utils import coordinate_from_string def get_maximum_row(ws, column): return max(coordinate_from_string(cell)[-1] for cell in ws._cells if cell.startswith(column))
print get_maximum_row(ws, "A") print get_maximum_row(ws, "B") print get_maximum_row(ws, "C") print get_maximum_row(ws, "D")
Aside from this, I would follow the @LondonRob’s suggestion to parse the contents with
pandas and let it do the job.
If this is a limitation of
openpyxl then you might try one of the following approaches:
- convert the Excel file to csv and use the Python
- uncompress the Excel file using
zipfileand then navigate to the “xl/worksheets” subfolder of the uncompressed file, and there you will find an XML for each of your worksheets. From there you could parse and update with
The xslx Excel format is a compressed (zipped) tree folder of XML files. You can find the specification here.
Figure I’ll start giving back to the stackoverflow community. Alecxe’s solution didn’t work for me and I didn’t want to use Pandas etc so I did this instead. It checks from the end of the spreadsheet and gives you the next available/empty row in column D.
def unassigned_row_in_column_D(): ws_max_row = int(ws.max_row) cell_coord = 'D' + str(ws_max_row) while ws.cell(cell_coord).value == None: ws_max_row -= 1 cell_coord = 'D' + str(ws_max_row) ws_max_row += 1 return 'D' + str(ws_max_row) #then add variable data = 'xxx' to that cell ws.cell(unassigned_row_in_column_D()).value = data
alexce’s solution didn’t work for me. It’s probably a question of openpyxl version, I’m on 2.4.1, here’s what worked after a small tweak:
def get_max_row_in_col(ws, column): return max([cell for cell in ws._cells if cell == column])