Home » Python » python – Trying to take multiple excel spreadsheets, extract specific data, add them all to one dataframe and save it as a csv file-Exceptionshub

python – Trying to take multiple excel spreadsheets, extract specific data, add them all to one dataframe and save it as a csv file-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

Very new to this, so please go easy on me 🙂

Trying to take multiple excel spreadsheets, extract data specific from specific cells, add them all to one dataframe and save it as a csv file.

The csv output only contains the data from the last excel file. Please could you help?

 import pandas as pd
 import os
 from pathlib import Path

 ip = "//NETWORKLOCATION/In"
 op = "//NETWORKLOCATION/Out"

 file_exist = False
 dir_list = os.listdir(ip)
 print(dir_list)

 for xlfile in dir_list:
     if xlfile.endswith('.xlsx') or xlfile.endswith('.xls'):
         file_exist = True
         str_file = os.path.join(ip, xlfile)
         df1 = pd.read_excel(str_file)

         columns1 = {*VARIOUSDATA -* 
                     }

         #creates an empty dataframe for the data to all sequentially be added into
         df1a = pd.DataFrame([])

         #appends the array to the new dataframe df1a
         df1a = df1a.append(pd.DataFrame(columns1, columns = ['*VARIOUS COLUMNS*]))

         if not file_exist:
                 print('cannot find any valid excel file in the folder ' + ip)

                 print(str_file)

 df1a.to_csv('//NETWORKLOCATION/Out/Test.csv')
 print(df1a)
How to&Answers:

I think You should put:

#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])

before for xlfile in dir_list: loop not inside the loop.
Otherwise df1a recreates empty on each file iteration.

Answer:

A couple of things. First, you’ll never encounter:

if not file_exist:
                 print('cannot find any valid excel file in the folder ' + ip)

                 print(str_file)

as is written, because it’s a nested if statement and so file_exists is always set to true before it’s reached.

  1. You’re creating df1a inside of your for loop. So you’re always setting it back to empty.
  2. Why import Path, and then use os.path and os.listdir?
    Why not just use Path(ip).glob(‘.xls‘)

This would look like:

import pandas as pd
import os
from pathlib import Path

ip = "//NETWORKLOCATION/In"
op = "//NETWORKLOCATION/Out"

#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])

for xlfile in Path(ip).glob('*.xls*'):
    df1 = pd.read_excel(xlfile)

    columns1 = {"VARIOUSDATA"}

    #appends the array to the new dataframe df1a
    df1a = df1a.append(pd.DataFrame(columns1, columns = ['VARIOUS_COLUMNS']))

if df1a.empty:
    print('cannot find any valid excel file in the folder ' + ip)
    print(str_file)
else:
    df1a.to_csv(op+'/Test.csv')
    print(df1a)

Answer:

The csv output only contains the data from the last excel file.

You create the df1a DataFrame inside the for loop. Each time you read a new xlfile you create a new empty DataFrame.

You have to put df1a = pd.DataFrame([]) on the 9th line of your script before the loop.