Home » excel » python – The strange jump over when I use pandas to read specific columns in .csv

python – The strange jump over when I use pandas to read specific columns in .csv

Posted by: admin April 23, 2020 Leave a comment

Questions:

1. Background

The .csv file I upload here is an example file for me to explain my problem.

This file contain all the air quality information for all cities in China(represent in Code) in at an specific day.

For example, the column 1001A represent one city and the value in this column represent the air pollutant concentration corresponding to the type column.

enter image description here

1. My problem

If I want to get the AQI value for the city of 1014A in 20160205-00:00,
I just need to use

 df = pd.read_csv("./this file")
 aqi = df["1014A"].iloc[0]

The result is 42. But look the same file in LibraOffice, the result shows like this:

enter image description here

It seems like Pandas read the 1013A and make the mistake.

So, I want to figure out what happened in column 1013A:

enter image description here

The pandas read this column(which has finite value inside) as the NaN value column. And it happened so many times in this file. It bother me in the aspects of followed:

  • Some columns which has its data are taken as NaN columns in pandas.Dataframe

  • The other columns also will be influenced by the Error-NaN columns indirectly.

The column location would be full of mistake if this problem hasn’t been solved.

Any advice would be appreciate!

How to&Answers:

Your csv has two commas in that position:

...19,20,24,19,22,24,29,,42,39...

this gets read as NaN by pandas.

It looks like in your version of LibreOffice it’s skipped and uses the subsequent value (incorrectly).


In [11]: s = open("china_sites_20160205.csv").readlines()

In [12]: s[0].split(",")[13:18]
Out[12]: ['1011A', '1012A', '1013A', '1014A', '1015A']

In [13]: s[1].split(",")[13:18]
Out[13]: ['24', '29', '', '42', '39']