I have two excel file, A and B. A is Master copy where updated record of employee Name and Organization Name (
Org) is available. File B contains
Org columns with bit older record and many other columns which we are not interested in.
Name Org 0 abc ddc systems 1 sdc ddc systems 2 csc ddd systems 3 rdc kbf org 4 rfc kbf org
I want to do two operation on this:
1) I want to compare Excel B (column
Org) with Excel A (column
Org) and update file B with all the missing entries of
Name and corresponding
2) For all existing entries in File B (column
Org), I would like to compare file and with file A and update
Org column if any employee organization has changed.
For Solution 1) to find the new entries tried below approach (Not sure if this approach is correct though), output is tuple which I was not sure how to update back to
diff = set(zip(new_df.Name, new_df.Org)) - set(zip(old_df.Name, old_df.Org))
Any help will be appreciated. Thanks.
If names are unique, just concatenate A and B, and drop duplicates. Assuming
B are your DataFrames,
df = pd.concat([A, B]).drop_duplicates(subset=['Name'], keep='first')
A = A.set_index('Name') B = B.set_index('Name') idx = B.index.difference(A.index) df = pd.concat([A, B.loc[idx]]).reset_index()
Both should be approximately the same in terms of performance.
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns) print(diff.sort_values(by='aa').reset_index(drop=True))
import pandas as pd aa = ['aa1', 'aa2', 'aa3', 'aa4', 'aa5'] bb = ['bb1', 'bb2', 'bb3', 'bb4','bb5'] nest = [aa, bb] df = pd.DataFrame(nest, ['aa', 'bb']).T df2 = pd.DataFrame(nest, ['aa', 'bb']).T df2['aa']=df2['aa'].shift(2) diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns) print(diff.sort_values(by='aa').reset_index(drop=True))
aa bb 0 aa1 bb1 1 aa2 bb2 2 aa3 bb3 3 aa4 bb4 4 aa5 bb5