Home » excel » python – Comparing two excel file with pandas

python – Comparing two excel file with pandas

Posted by: admin May 14, 2020 Leave a comment

Questions:

I have two excel file, A and B. A is Master copy where updated record of employee Name and Organization Name (Name and Org) is available. File B contains Name and Org columns with bit older record and many other columns which we are not interested in.

   Name      Org
0   abc    ddc systems
1   sdc    ddc systems
2   csc    ddd systems
3   rdc    kbf org
4   rfc    kbf org

I want to do two operation on this:

1) I want to compare Excel B (column Name and Org) with Excel A (column Name and Org) and update file B with all the missing entries of Name and corresponding Org.

2) For all existing entries in File B (column Name and Org), I would like to compare file and with file A and update Org column if any employee organization has changed.

For Solution 1) to find the new entries tried below approach (Not sure if this approach is correct though), output is tuple which I was not sure how to update back to DataFrame.

diff = set(zip(new_df.Name, new_df.Org)) - set(zip(old_df.Name, old_df.Org))

Any help will be appreciated. Thanks.

How to&Answers:

If names are unique, just concatenate A and B, and drop duplicates. Assuming A and B are your DataFrames,

df = pd.concat([A, B]).drop_duplicates(subset=['Name'], keep='first')

Or,

A = A.set_index('Name')
B = B.set_index('Name')

idx = B.index.difference(A.index)
df = pd.concat([A, B.loc[idx]]).reset_index()

Both should be approximately the same in terms of performance.

Answer:

Solution:

diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))

Example:

import pandas as pd
aa = ['aa1', 'aa2', 'aa3', 'aa4', 'aa5']
bb = ['bb1', 'bb2', 'bb3', 'bb4','bb5']
nest = [aa, bb]
df = pd.DataFrame(nest, ['aa', 'bb']).T
df2 = pd.DataFrame(nest, ['aa', 'bb']).T
df2['aa']=df2['aa'].shift(2)
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))

Output:

    aa   bb
0  aa1  bb1
1  aa2  bb2
2  aa3  bb3
3  aa4  bb4
4  aa5  bb5