Home » excel » linux – Excel and awk disagree about CSV totals

linux – Excel and awk disagree about CSV totals

Posted by: admin April 23, 2020 Leave a comment


I have a CSV file that I’m totaling up two ways: one using Excel and the other using awk. Here are the totals of my first 8 columns in Excel:

1) 2640502474.00
2) 1272849386284.00
3) 36785.00
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00

And here’s my awk output:

$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'

Notice how 1 and 2 match exactly but 8 is off by many millions. I’m assuming Excel’s total is the correct one, so why is awk handling this file differently?

How to&Answers:

You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won’t – a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.

Compare the following to see what is likely going on:

$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'

Note that the second regex above still has a problem with a trailing ” in the last field and only works at all if all field are consistently quoted – it is for illustration purposes only.