Home » excel » excel – Creating new arrays based on unique keys and condition in Python

excel – Creating new arrays based on unique keys and condition in Python

Posted by: admin May 14, 2020 Leave a comment

Questions:

I’m planning to calculate large amount of data in Python instead of Excel, but I’m stuck since I know the Excel command and I have great difficulties to replicating it in Python.

Essentially, I’d like to import CSV file, identify the location of column C, and then for all unique values in column A, sum all values in C that apply to the condition 1990 < x < 2000 in B

A,B,C
9,1952,125
2,1994,69
3,1973,72
5,1992,85
1,1994,38
1,1994,95
4,1992,29
8,1984,94

I begin with:

import csv
with open('TestCase.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    row1 = next(reader)

Instead of writing multiple if statements, I’d like to create new arrays, composed from 0 and 1, and then sum all values in C.

Given another condition the result would look like this

1980<x<1989 94
1990<x<2000 316

Extra bonus would be the total number of unique values in A, that represent the total sum

UniqueValues    Condition   TotalSum
1   1980<x<1989 94
4   1990<x<2000 316
How to&Answers:

You can use this:

l = list()
d = dict()
with open('TestCase.txt', 'r') as file:
    for line in file:
        l.append(line.rstrip("\n").split(',')) # l=[[9,1952,125],[2,1994,69],...]

    for item in l:
        if 1990 < int(item[1]) < 2000: # The desired condition for colum B 
            d[item[0]] = d[item[0]] + int(item[2]) if item[0] in d else int(item[2])

The d dictionary will be the unique value of A as its key and sum of C as its value.

Answer:

If you are happy using a 3rd party library, this can be vectorised via pandas:

import pandas as pd

# read csv file
df = pd.read_csv('file.csv')

# filter column B, group by A, sum C
res = df.loc[df['B'].between(1990, 2000)]\
        .groupby('A')['C'].sum()\
        .reset_index()

Result:

   A    C
0  1  133
1  2   69
2  4   29
3  5   85

Answer:

from io import StringIO
import pandas as pd

txt = StringIO("""
A,B,C
9,1952,125
2,1994,69
3,1973,72
5,1992,85
1,1994,38
1,1994,95
4,1992,29
8,1984,94
""")

df = pd.read_csv(txt )

#condition = (df["B"] >1980) & (df["B"] < 1989)
condition = (df["B"] >1990) & (df["B"] < 2000)
df_cond = df[condition]

df_uniq = df_cond.drop_duplicates('A', keep=False)
df_uniq_keep_first = df_cond.drop_duplicates('A', keep="first")
df_uniq_keep_last = df_cond.drop_duplicates('A', keep="last")

sum_dupl = df_cond["C"].sum()
sum_uniq = df_uniq["C"].sum()
sum_uniq_keep_first = df_uniq_keep_first["C"].sum()
sum_uniq_keep_last = df_uniq_keep_last["C"].sum()

print("sum with duplicates  : " + str(sum_dupl))            #316
print("sum pure unique      : " + str(sum_uniq))            #183
print("sum unique keep first: " + str(sum_uniq_keep_first)) #221 
print("sum unique keep last : " + str(sum_uniq_keep_last))  #278