Saturday, 22 July 2017

Data Science & Machine Learning - 4.4 Pandas DataFrames Missing Values

Hi friends,

In the previous couple of posts under Data Science & Machine Learning, we learnt about the Pandas DataFrames. In this post, we will discuss another very important topic in Data Analysis about how to deal with the missing values in Pandas DataFrames.

DataFrames Missing Values

By “missing values”, it simply means values which are NaN or are not present for whatever reason. An example of a dataset with missing values could be a collection of financial time series, where some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.

The reason we cannot ignore the missing values is that some of the most popular and widely used machine learning libraries such as scikit-learn and TensorFlow in Python caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

So, now that we agree that we need to deal with missing values, let's see various ways to handle missing values in Pandas DataFrames.

Note: All the commands discussed below are run in the Jupyter Notebook environment. See this post on Jupyter Notebook to know about it in detail.

Let's first create a sample DataFrame with some missing values:


Notice that I have used the reindex() method to fill in the DataFrames with some NaN values. 

The easiest of the ways to deal with missing values is to just drop the rows or columns corresponding to the NaN values using the dropna() method:


By default, the axis parameter is set to 0 which means it drops all those rows from the DataFrame which contain NaN values. We can also make the columns drop in place of the rows by setting the axis parameter to 1


Notice that I have modified the two elements of column 'two' to prevent the result to be empty. Also, notice that the result contains only column two since it is the only column which has all non-NaN values.

However, the method of dropping the NaN values is not preferred since it results in loss of valid data. Thus, the better way to deal with NaN values is to fill the NaN values with statistical measures of mean, median or some custom value.



Remember to set the inplace parameter to True in order to make the changes permanent on the DataFrame. 

You can refer to this official link to know in detail about various other ways to deal with missing values in Python. In the next post, we will see SQL functionalities of groupby, join, etc. supported by the Pandas DataFrames.
Share:

0 comments:

Post a Comment

Contact Me

Name

Email *

Message *

Popular Posts

Blog Archive