I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
import glob import pandas as pd # get data file names path =r'C:\DRO\DCL_rawdata_files' filenames = glob.glob(path + "/*.csv") dfs =  for filename in filenames: dfs.append(pd.read_csv(filename)) # Concatenate all data into one DataFrame big_frame = pd.concat(dfs, ignore_index=True)
I guess I need some help within the for loop???
If you have same columns in all your
csv files then you can try the code below.
I have added
header=0 so that after reading
csv first row can be assigned as the column names.
path =r'C:\DRO\DCL_rawdata_files' # use your path allFiles = glob.glob(path + "/*.csv") frame = pd.DataFrame() list_ =  for file_ in allFiles: df = pd.read_csv(file_,index_col=None, header=0) list_.append(df) frame = pd.concat(list_)
An alternative to darindaCoder’s answer:
path = r'C:\DRO\DCL_rawdata_files' # use your path all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent df_from_each_file = (pd.read_csv(f) for f in all_files) concatenated_df = pd.concat(df_from_each_file, ignore_index=True) # doesn't create a list, nor does it append to one
import glob, os df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Edit: I googled my way into https://stackoverflow.com/a/21232849/186078.
However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.
I do sincerely want anyone hitting this page to consider this approach, but don’t want to attach this huge piece of code as a comment and making it less readable.
You can leverage numpy to really speed up the dataframe concatenation.
import os import glob import pandas as pd import numpy as np path = "my_dir_full_path" allFiles = glob.glob(os.path.join(path,"*.csv")) np_array_list =  for file_ in allFiles: df = pd.read_csv(file_,index_col=None, header=0) np_array_list.append(df.as_matrix()) comb_np_array = np.vstack(np_array_list) big_frame = pd.DataFrame(comb_np_array) big_frame.columns = ["col1","col2"....]
total files :192 avg lines per file :8492 --approach 1 without numpy -- 8.248656988143921 seconds --- total records old :1630571 --approach 2 with numpy -- 2.289292573928833 seconds ---
If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:
import zipfile import numpy as np import pandas as pd ziptrain = zipfile.ZipFile('yourpath/yourfile.zip') train= for f in range(0,len(ziptrain.namelist())): if (f == 0): train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f])) else: my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f])) train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), columns=list(my_df.columns.values)))
If you want to search recursively (Python 3.5 or above), you can do the following:
path = r'C:\user\your\path\**' all_rec = glob.iglob(os.path.join(path, "*.csv"), recursive=True) dataframes = (pd.read_csv(f) for f in all_rec) big_dataframe = pd.concat(dataframes, ignore_index=True)
You can find the documentation of
** here. Also, I used
glob, as it returns an iterator instead of a list.