Home » excel » python – Possibility of Corruption: Reading Excel Files with Pandas

python – Possibility of Corruption: Reading Excel Files with Pandas

Posted by: admin April 23, 2020 Leave a comment

Questions:

We are in the design phase for product. The idea is that the code will read a list of values from Excel into SQL.

The requirements are as follows:

  1. Workbook may be accessed by multiple users outside of our program

  2. Workbook must remain accessible (i.e. not be corrupted) should something bad occur while our program is running

  3. Program will be executed when no users are in the file

Right now we are considering using pandas in a simple manner as follows:

    import pandas as pd
    from pandas import ExcelWriter
    from pandas import ExcelFile

    df = pd.read_excel('File.xlsx', sheetname='Sheet1')

    """Some code to write df in to SQL"""

If this code goes offline with the Excel still open, is there ANY possibility that the file will remain locked somewhere in my program or be corrupted?

To clarify, we envision something catastrophic like the server crashing or losing power.

Searched around but couldn’t find a similar question, please redirect me if necessary.
I also read through Pandas read_excel documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

How to&Answers:

With the code you provide, from my reading of the pandas and xlrd code, the given file will only be opened in read mode. That should mean, to the best of my knowledge, that there is no more risk in what you’re doing than in reading the file any other way – and you have to read it to use it, after all.

If this doesn’t sufficiently reassure you, you could minimize the time the file is open and, more importantly, not expose your file to external code, by handing pandas a BytesIO object instead of a path:

import io
import pandas as pd

data = io.BytesIO(open('File.xlsx', 'rb').read())
df = pd.read_excel(data, sheetname='Sheet1')

# etc

This way your file will only be open for the time it takes to read it into memory, and pandas and xlrd will only be working with a copy of the data.