Home » Python » How to count the Nan values in the column in Panda Data frame

How to count the Nan values in the column in Panda Data frame

Posted by: admin November 1, 2017 Leave a comment

Questions:

I have data, in which I want to find number of NaN, so that if it is less than some threshold, I will drop this columns. I looked, but didn’t able to find any function for this. there is count_values(), but it would be slow for me, because most of values are distinct and I want count of NaN only.

Answers:

You can use the isnull() method and then sum to count the nan values. For one column:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isnull().sum()
Out[4]: 2

For several columns, it also works:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isnull().sum()
Out[6]:
a    1
b    2
dtype: int64

Questions:
Answers:

You could subtract the total length from the count of non-nan values:

count_nan = len(df) - df.count()

You should time it on your data. For small Series got a 3x speed up in comparison with the isnull solution.

Questions:
Answers:

Since pandas 0.14.1 my suggestion here to have a keyword argument in the value_counts method has been implemented:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

Questions:
Answers:

if you are using Jupyter Notebook, How about….

 %%timeit
 df.isnull().any().any()

or

 %timeit 
 df.isnull().values.sum()

or, are there anywhere NaNs in the data, if yes, where?

 df.isnull().any()

Questions:
Answers:

Based on the second and most voted answer we can easily define a function that gives us a dataframe to preview the missing values and the % of missing values in each column:

def missing_values_table(df): 
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum()/len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        return mis_val_table_ren_columns 

Questions:
Answers:

if its just counting nan values in a pandas column here is a quick way

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

Questions:
Answers:

You can use value_counts method and print values of np.nan

s.value_counts(dropna = False)[np.nan]

Questions:
Answers:

Used the solution proposed by @sushmit in my code.
A possible variation of the same can also be –
colNullCnt = [] for z in range(len(df1.cols)):
colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

Advantage of this is that it returns the result for each of the columns in the df henceforth.

Questions:
Answers:

based to the answer that was given and some improvements this is my approach

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")