How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (2024)

When we work in the data science industry, we’ll need to have some good knowledge about how to use NumPy, Pandas, Sklearn, etc., in order to create completely end-to-end machine learning models. One of the steps in the data science lifecycle is Data Cleaning, which is the process of finding and correcting the inaccurate/incorrect data that are present in the dataset. A part of this process is to do something about the values that are missing in the dataset. In real life, many datasets will have many missing values, and this article will tell teach you how to handle missing data.

Learning Objectives

  • In this article, we will learn all about finding and handling missing data
  • We will also look at hands-on tutorials that teach beginners how to handle missing data using python and pandas

Table of contents

  • Why Fill in the Missing Data?
  • How to Know If the Data Has Missing Values?
  • Different Methods of Dealing With Missing Data
    • 1. Deleting the column with missing data
    • 2. Deleting the row with missing data
    • 3. Filling the Missing Values – Imputation
    • 4. Other imputation methods
    • 5. Filling with a Regression Model
  • Conclusion
  • Frequently Asked Questions

Why Fill in the Missing Data?

It it necessary to fill in missing data values in data sets as most of the machine learning models that you want to use will provide an error if you pass NaN values into it. The easiest way is to just fill them up with 0, but this can reduce your model accuracy significantly.

For filling missing values, there are many methods available. For choosing the best method, you need to understand the type of missing value and its significance, before you start filling/deleting the data to completely understand how to handle missing data in Python.

First lets look at the dataset:

In this article, I will be working with the Titanic Dataset from Kaggle.

For downloading the dataset, use the following link – https://www.kaggle.com/c/titanic
Python Code:

How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (1)

See that the data contains many columns like PassengerId, Name, Age, etc. We won’t be working with all the columns in the dataset, so I am going to be deleting the columns I don’t need.

Import the required libraries that you will be using – numpy and pandas by using import pandas and import numpy

We will then use the pandas read_csv function to read the dataset.

df.drop("Name",axis=1,inplace=True)df.drop("Ticket",axis=1,inplace=True)df.drop("PassengerId",axis=1,inplace=True)df.drop("Cabin",axis=1,inplace=True)df.drop("Embarked",axis=1,inplace=True)

See that there are also categorical values in the dataset, for this, you need to use Label Encoding or One Hot Encoding.

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()df['Sex'] = le.fit_transform(df['Sex'])newdf=df
#splitting the data into x and yy = df['Survived']df.drop("Survived",axis=1,inplace=True)

How to Know If the Data Has Missing Values?

Missing Value Treatment in Python – Missing values are usually represented in the form of Nan or null or None in the dataset.

df.info() The function can be used to give information about the dataset. This function is one of the most used functions for data analysis. This will provide you with the column names and the number of non–null values in each column. It will also display the data types of each column. Thus we can find out which number columns are where null values are present, and by looking at the data types, we can have an understanding of which value to replace nulls with.

Sometimes though, instead of np.nan null values could be present as empty strings or other values that represent null values, so we must be careful and make sure that all the null values in our dataset are np.nan values.

df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null int64 2 Age 714 non-null float64 3 SibSp 891 non-null int64 4 Parch 891 non-null int64 5 Fare 891 non-null float64dtypes: float64(2), int64(4)memory usage: 41.9 KB

See that there are null values in the column Age.

The second way of finding whether we have null values in the data is by using the isnull() function.

print(df.isnull().sum())
Pclass 0Sex 0Age 177SibSp 0Parch 0Fare 0dtype: int64

See that all the null values in the dataset are in the column – Age.

Let’s try fitting the data using logistic regression.

from sklearn.model_selection import train_test_splitX_train, X_test,y_train,y_test = train_test_split(df,y,test_size=0.3)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train,y_train)
---------------------------------------------------------------------------ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

See that the logistic regression model does not work as we have NaN values in the dataset. Only some of the machine learning algorithms can work with missing data like KNN, which will ignore the values with Nan values.

Different Methods of Dealing With Missing Data

Let’s now look at the different methods that you can use to deal with the missing data.

1. Deleting the column with missing data

How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (2)

In this case, let’s delete the column, Age and then fit the model and check for accuracy.

But this is an extreme case and should only be used when there are many null values in the column.

updated_df = df.dropna(axis=1)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null int64 2 SibSp 891 non-null int64 3 Parch 891 non-null int64 4 Fare 891 non-null float64dtypes: float64(1), int64(4)memory usage: 34.9 KB
from sklearn import metricsfrom sklearn.model_selection import train_test_splitX_train, X_test,y_train,y_test = train_test_split(updated_df,y,test_size=0.3)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train,y_train)pred = lr.predict(X_test)print(metrics.accuracy_score(pred,y_test))
0.7947761194029851

See that we can achieve an accuracy of 79.4%.

The problem with this method is that we may lose valuable information on that feature, as we have deleted it completely due to some null values.

It should only be used if there are too many null values.

2. Deleting the row with missing data

If there is a certain row with missing data, then you can delete the entire row with all the features in that row.

axis=1 is used to drop the column with NaN values.

axis=0 is used to drop the row with NaN values.

updated_df = newdf.dropna(axis=0)
y1 = updated_df['Survived']updated_df.drop("Survived",axis=1,inplace=True)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>Int64Index: 714 entries, 0 to 890Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 714 non-null int64 1 Sex 714 non-null int64 2 Age 714 non-null float64 3 SibSp 714 non-null int64 4 Parch 714 non-null int64 5 Fare 714 non-null float64dtypes: float64(2), int64(4)memory usage: 39.0 KB
from sklearn import metricsfrom sklearn.model_selection import train_test_splitX_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train,y_train)pred = lr.predict(X_test)print(metrics.accuracy_score(pred,y_test))
0.8232558139534883

In this case, see that we are able to achieve better accuracy than before. This is maybe because the column Age contains more valuable information than we expected.

3. Filling the Missing Values – Imputation

How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (3)

In this case, we will be filling the missing values with a certain number.

The possible ways to do this are:

  1. Filling the missing data with the mean or median value if it’s a numerical variable.
  2. Filling the missing data with mode if it’s a categorical value.
  3. Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different.
  4. Filling the categorical value with a new type for the missing values.

You can use the fillna() function to fill the null values in the dataset.

updated_df = dfupdated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean())updated_df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Sex 891 non-null int64 3 Age 891 non-null float64 4 SibSp 891 non-null int64 5 Parch 891 non-null int64 6 Fare 891 non-null float64dtypes: float64(2), int64(5)memory usage: 48.9 KB
y1 = updated_df['Survived']updated_df.drop("Survived",axis=1,inplace=True)from sklearn import metricsfrom sklearn.model_selection import train_test_splitX_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train,y_train)pred = lr.predict(X_test)print(metrics.accuracy_score(pred,y_test))
0.7798507462686567

The accuracy value comes out to be 77.98% which is a reduction over the previous case.

This will not happen in general; in this case, it means that the mean has not filled the null value properly.

4. Other imputation methods

Just like the fillna function there is another function called interpolate, it uses linear interpolation which means that it estimates unknown values between two known data points.

We can also use the bfill function which backfills the unknown values with the value in the next row.

Imputation with an additional column

How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (4)

Use the SimpleImputer() function from sklearn module to impute the values.

Pass the strategy as an argument to the function. It can be either mean or mode or median.

The problem with the previous model is that the model does not know whether the values came from the original data or the imputed value. To make sure the model knows this, we are adding Ageismissing the column which will have True as value, if it is a null value and False if it is not a null value.

updated_df = dfupdated_df['Ageismissing'] = updated_df['Age'].isnull()from sklearn.impute import SimpleImputermy_imputer = SimpleImputer(strategy = 'median')data_new = my_imputer.fit_transform(updated_df)updated_df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null int64 2 Age 891 non-null float64 3 SibSp 891 non-null int64 4 Parch 891 non-null int64 5 Fare 891 non-null float64 6 Ageismissing 891 non-null bool dtypes: bool(1), float64(2), int64(4)memory usage: 42.8 KB
from sklearn import metricsfrom sklearn.model_selection import train_test_splitX_train, X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(X_train,y_train)pred = lr.predict(X_test)print(metrics.accuracy_score(pred,y_test))
0.7649253731343284

5. Filling with a Regression Model

In this case, the null values in one column are filled by fitting a regression model using other columns in the dataset.

I.e. in this case the regression model will contain all the columns except Age in X and Age in Y.

Then after filling the values in the Age column, then we will use logistic regression to calculate accuracy.

from sklearn.linear_model import LinearRegressionlr = LinearRegression()df.head()testdf = df[df['Age'].isnull()==True]traindf = df[df['Age'].isnull()==False]y = traindf['Age']traindf.drop("Age",axis=1,inplace=True)lr.fit(traindf,y)testdf.drop("Age",axis=1,inplace=True)pred = lr.predict(testdf)testdf['Age']= pred
How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (5)
traindf['Age']=y
y = traindf['Survived']traindf.drop("Survived",axis=1,inplace=True)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()lr.fit(traindf,y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
y_test = testdf['Survived']testdf.drop("Survived",axis=1,inplace=True)pred = lr.predict(testdf)
print(metrics.accuracy_score(pred,y_test))
0.8361581920903954

See that this model produces more accuracy than the previous model as we are using a specific regression model for filling in the missing values.

We can also use models KNN for filling in the missing values. But sometimes, using models for imputation can result in overfitting the data.

Imputing missing values using the regression model allowed us to improve our model compared to dropping those columns.

But you have to understand that There is no perfect way for filling the missing values in a dataset.

Conclusion

Each of the methods may work well with different types of datasets. You have to experiment through different methods, to check which method works the best for your dataset. Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased.

Key Takeaways

  • This article taught us about the different ways of handling missing values in our dataset.
  • If there are way too many missing values in a column then you can drop that column. Otherwise we can impute missing values with mean, median and mode.
  • Some functions that can be used in pandas for handling missing values are the fillna, dropna, bfill and interpolate.

Frequently Asked Questions

Q1. Which is the best method to fill missing data in Python?

A. There is no “best“ way to fill missing values in pandas per say, however, the function fillna() is the most widely used function to fill nan values in a dataframe. From this function, you can simply fill the values according to your column with mean, median and mode.

Q2. What is the general idea of handling missing values in Python?

A. Missing values can bias the results of your machine learning models and can result in decreased accuracy. That is why we must handle these values in the correct way, so that the data is imputed correctly.

Q3. How to use the pandas library to handle missing values in a dataset?

A. Pandas has many different functions that you can use to handle missing values. Some of these functions are the fillna function, the bfill function and the interpolate function.

Related

How to Handle Missing Data in Python? [Explained in 5 Easy Steps] (2024)
Top Articles
Latest Posts
Article information

Author: Msgr. Benton Quitzon

Last Updated:

Views: 6277

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Msgr. Benton Quitzon

Birthday: 2001-08-13

Address: 96487 Kris Cliff, Teresiafurt, WI 95201

Phone: +9418513585781

Job: Senior Designer

Hobby: Calligraphy, Rowing, Vacation, Geocaching, Web surfing, Electronics, Electronics

Introduction: My name is Msgr. Benton Quitzon, I am a comfortable, charming, thankful, happy, adventurous, handsome, precious person who loves writing and wants to share my knowledge and understanding with you.