Home » excel » excel – Creating new arrays based on unique keys and condition in Python

# excel – Creating new arrays based on unique keys and condition in Python

Questions:

I’m planning to calculate large amount of data in Python instead of Excel, but I’m stuck since I know the Excel command and I have great difficulties to replicating it in Python.

Essentially, I’d like to import CSV file, identify the location of column C, and then for all unique values in column A, sum all values in C that apply to the condition `1990 < x < 2000` in B

``````A,B,C
9,1952,125
2,1994,69
3,1973,72
5,1992,85
1,1994,38
1,1994,95
4,1992,29
8,1984,94
``````

I begin with:

``````import csv
with open('TestCase.txt', 'rb') as csvfile:
``````

Instead of writing multiple `if` statements, I’d like to create new arrays, composed from 0 and 1, and then sum all values in C.

Given another condition the result would look like this

``````1980<x<1989 94
1990<x<2000 316
``````

Extra bonus would be the total number of unique values in A, that represent the total sum

``````UniqueValues    Condition   TotalSum
1   1980<x<1989 94
4   1990<x<2000 316
``````

You can use this:

``````l = list()
d = dict()
with open('TestCase.txt', 'r') as file:
for line in file:
l.append(line.rstrip("\n").split(',')) # l=[[9,1952,125],[2,1994,69],...]

for item in l:
if 1990 < int(item) < 2000: # The desired condition for colum B
d[item] = d[item] + int(item) if item in d else int(item)
``````

The `d` dictionary will be the unique value of `A` as its key and sum of `C` as its value.

If you are happy using a 3rd party library, this can be vectorised via `pandas`:

``````import pandas as pd

# filter column B, group by A, sum C
res = df.loc[df['B'].between(1990, 2000)]\
.groupby('A')['C'].sum()\
.reset_index()
``````

Result:

``````   A    C
0  1  133
1  2   69
2  4   29
3  5   85
``````

``````from io import StringIO
import pandas as pd

txt = StringIO("""
A,B,C
9,1952,125
2,1994,69
3,1973,72
5,1992,85
1,1994,38
1,1994,95
4,1992,29
8,1984,94
""")

#condition = (df["B"] >1980) & (df["B"] < 1989)
condition = (df["B"] >1990) & (df["B"] < 2000)
df_cond = df[condition]

df_uniq = df_cond.drop_duplicates('A', keep=False)
df_uniq_keep_first = df_cond.drop_duplicates('A', keep="first")
df_uniq_keep_last = df_cond.drop_duplicates('A', keep="last")

sum_dupl = df_cond["C"].sum()
sum_uniq = df_uniq["C"].sum()
sum_uniq_keep_first = df_uniq_keep_first["C"].sum()
sum_uniq_keep_last = df_uniq_keep_last["C"].sum()

print("sum with duplicates  : " + str(sum_dupl))            #316
print("sum pure unique      : " + str(sum_uniq))            #183
print("sum unique keep first: " + str(sum_uniq_keep_first)) #221
print("sum unique keep last : " + str(sum_uniq_keep_last))  #278
``````