Home » excel » excel – Python Find highest row in a given column

excel – Python Find highest row in a given column

Posted by: admin March 9, 2020 Leave a comment

Questions:

I’m quite new in stackoverflow and quite recently learnt some basic Python. This is the first time I’m using openpyxl. Before I used xlrd and xlsxwriter and I did manage to make some useful programs. But right now I need a .xlsx reader&writer.

There is a File which I need to read and edit with data already stored in the code. Let’s suppose the .xlsx has five columns with data: A, B, C, D, E. In column A, I’ve over 1000 rows with data. On Column D, I’ve 150 rows with data.

Basically, I want the program to find the last row with data on a given column (say D). Then, write the stored variable data in the next available row (last row + 1) in column D.

The problem is that I can’t use ws.get_highest_row() because it returns the row 1000 on column A.

Basically, so far this is all I’ve got:

data = 'xxx'
from openpyxl import load_workbook
wb = load_workbook('book.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name('Sheet1')
last_row = ws.get_highest_row()

Obviously this doesn’t work at all. last_row returns 1000.

How to&Answers:

Here’s how to do it using Pandas.

It’s easy to get the last non-null row in Pandas using last_valid_index.

There might be a better way to write the resulting DataFrame to your xlsx file but, according to the docs, this very dumb way is actually how it’s done in openpyxl.

Let’s say you’re starting with this simple worksheet:

Original worksheet

Let’s say we want to put xxx into column C:

import openpyxl as xl
import pandas as pd

wb = xl.load_workbook('deleteme.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
df = pd.read_excel('deleteme.xlsx')

def replace_first_null(df, col_name, value):
    """
    Replace the first null value in DataFrame df.`col_name`
    with `value`.
    """
    return_df = df.copy()
    idx = list(df.index)
    last_valid = df[col_name].last_valid_index()
    last_valid_row_number = idx.index(last_valid)
    # This next line has mixed number and string indexing
    # but it should be ok, since df is coming from an
    # Excel sheet and should have a consecutive index
    return_df.loc[last_valid_row_number + 1, col_name] = value
    return return_df

def write_df_to_worksheet(ws, df):
    """
    Write the values in df to the worksheet ws in place
    """
    for i, col in enumerate(replaced):
        for j, val in enumerate(replaced[col]):
            if not pd.isnull(val):
                # Python is zero indexed, so add one
                # (plus an extra one to take account
                #  of the header row!)
                ws.cell(row=j + 2, column=i + 1).value = val

# Here's the actual replacing happening
replaced = replace_first_null(df, 'C', 'xxx')
write_df_to_worksheet(ws, df)
wb.save('changed.xlsx')

which results in:

Edited Excel file

Answer:

The problem is that get_highest_row() itself uses row dimensions instances to define the maximum row in the sheet. RowDimension has no information about the columns – which means we cannot use it to solve your problem and have to approach it differently.

Here is one kind of “ugly” openpyxl-specific option that though would not work if use_iterators=True:

from openpyxl.utils import coordinate_from_string

def get_maximum_row(ws, column):
    return max(coordinate_from_string(cell)[-1]
               for cell in ws._cells if cell.startswith(column))

Usage:

print get_maximum_row(ws, "A")
print get_maximum_row(ws, "B")
print get_maximum_row(ws, "C")
print get_maximum_row(ws, "D")

Aside from this, I would follow the @LondonRob’s suggestion to parse the contents with pandas and let it do the job.

Answer:

If this is a limitation of openpyxl then you might try one of the following approaches:

  • convert the Excel file to csv and use the Python csv module.
  • uncompress the Excel file using zipfile and then navigate to the “xl/worksheets” subfolder of the uncompressed file, and there you will find an XML for each of your worksheets. From there you could parse and update with BeautifulSoup or lxml.

The xslx Excel format is a compressed (zipped) tree folder of XML files. You can find the specification here.

Answer:

Figure I’ll start giving back to the stackoverflow community. Alecxe’s solution didn’t work for me and I didn’t want to use Pandas etc so I did this instead. It checks from the end of the spreadsheet and gives you the next available/empty row in column D.

def unassigned_row_in_column_D(): 
    ws_max_row = int(ws.max_row)
    cell_coord = 'D' + str(ws_max_row)
    while ws.cell(cell_coord).value == None:
        ws_max_row -= 1
        cell_coord = 'D' + str(ws_max_row)
    ws_max_row += 1
    return 'D' + str(ws_max_row)

#then add variable data = 'xxx' to that cell

ws.cell(unassigned_row_in_column_D()).value = data

Answer:

alexce’s solution didn’t work for me. It’s probably a question of openpyxl version, I’m on 2.4.1, here’s what worked after a small tweak:

def get_max_row_in_col(ws, column):
    return max([cell[0] for cell in ws._cells if cell[1] == column])