How to Replace Missing Values(NA) in R: na.omit & na.rm (2024)

Missing values in data science arise when an observation is missing in a column of a data frame or contains a character value instead of numeric value. Missing values must be dropped or replaced in order to draw correct conclusion from the data.

In this tutorial, we will learn how to deal with missing values with the dplyr library. dplyr library is part of an ecosystem to realize a data analysis.

In this tutorial, you will learn

mutate()
Exclude Missing Values (NA)
Impute Missing Values (NA) with the Mean and Median

mutate()

The fourth verb in the dplyr library is helpful to create new variable or change the values of an existing variable.

We will proceed in two parts. We will learn how to:

exclude missing values from a data frame
impute missing values with the mean and median

The verb mutate() is very easy to use. We can create a new variable following this syntax:

mutate(df, name_variable_1 = condition, ...)arguments:-df: Data frame used to create a new variable-name_variable_1: Name and the formula to create the new variable-...: No limit constraint. Possibility to create more than one variable inside mutate()

Exclude Missing Values (NA)

The na.omit() method from the dplyr library is a simple way to exclude missing observation. Dropping all the NA from the data is easy but it does not mean it is the most elegant solution. During analysis, it is wise to use variety of methods to deal with missing values

To tackle the problem of missing observations, we will use the titanic dataset. In this dataset, we have access to the information of the passengers on board during the tragedy. This dataset has many NA that need to be taken care of.

We will upload the csv file from the internet and then check which columns have NA. To return the columns with missing data, we can use the following code:

Let’s upload the data and verify the missing data.

Impute Missing data with the Mean and Median

We could also impute(populate) missing values with the median or the mean. A good practice is to create two separate variables for the mean and the median. Once created, we can replace the missing values with the newly formed variables.

We will use the apply method to compute the mean of the column with NA. Let’s see an example

Step 1) Earlier in the tutorial, we stored the columns name with the missing values in the list called list_na. We will use this list

See Also

How to remove NaN from a Pandas Series?

Step 2) Now we need to compute of the mean with the argument na.rm = TRUE. This argument is compulsory because the columns have missing data, and this tells R to ignore them.

# Create meanaverage_missing <- apply(df_titanic[,colnames(df_titanic) %in% list_na], 2, mean, na.rm = TRUE)average_missing

Code Explanation:

We pass 4 arguments in the apply method.

df: df_titanic[,colnames(df_titanic) %in% list_na]. This code will return the columns name from the list_na object (i.e. “age” and “fare”)
2: Compute the function on the columns
mean: Compute the mean
na.rm = TRUE: Ignore the missing values

Output:

## age fare ## 29.88113 33.29548

We successfully created the mean of the columns containing missing observations. These two values will be used to replace the missing observations.

Step 3) Replace the NA Values

The verb mutate from the dplyr library is useful in creating a new variable. We don’t necessarily want to change the original column so we can create a new variable without the NA. mutate is easy to use, we just choose a variable name and define how to create this variable. Here is the complete code

# Create a new variable with the mean and mediandf_titanic_replace <- df_titanic %>% mutate(replace_mean_age = ifelse(is.na(age), average_missing[1], age), replace_mean_fare = ifelse(is.na(fare), average_missing[2], fare))

Code Explanation:

We create two variables, replace_mean_age and replace_mean_fare as follow:

replace_mean_age = ifelse(is.na(age), average_missing[1], age)
replace_mean_fare = ifelse(is.na(fare), average_missing[2],fare)

If the column age has missing values, then replace with the first element of average_missing (mean of age), else keep the original values. Same logic for fare

sum(is.na(df_titanic_replace$age))

Output:

## [1] 263

Perform the replacement

sum(is.na(df_titanic_replace$replace_mean_age))

Output:

## [1] 0

The original column age has 263 missing values while the newly created variable have replaced them with the mean of the variable age.

Step 4) We can replace the missing observations with the median as well.

median_missing <- apply(df_titanic[,colnames(df_titanic) %in% list_na], 2, median, na.rm = TRUE)df_titanic_replace <- df_titanic %>% mutate(replace_median_age = ifelse(is.na(age), median_missing[1], age), replace_median_fare = ifelse(is.na(fare), median_missing[2], fare))head(df_titanic_replace)

Output:

Step 5) A big data set could have lots of missing values and the above method could be cumbersome. We can execute all the above steps above in one line of code using sapply() method. Though we would not know the vales of mean and median.

sapply does not create a data frame, so we can wrap the sapply() function within data.frame() to create a data frame object.

# Quick code to replace missing values with the meandf_titanic_impute_mean < -data.frame( sapply( df_titanic, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x)))

Summary

We have three methods to deal with missing values:

Exclude all of the missing observations
Impute with the mean
Impute with the median

The following table summarizes how to remove all the missing observations

Library	Objective	Code
base	List missing observations	colnames(df)[apply(df, 2, anyNA)]
dplyr	Remove all missing values	na.omit(df)

Imputation with mean or median can be done in two ways

Using apply
Using sapply

Method	Details	Advantages	Disadvantages
Step by step with apply	Check columns with missing, compute mean/median, store the value, replace with mutate()	You know the value of means/median	More execution time. Can be slow with big dataset
Quick way with sapply	Use sapply() and data.frame() to automatically search and replace missing values with mean/median	Short code and fast	Don’t know the imputation values

You Might Like:

R Select(), Filter(), Arrange(), Pipeline with Example
Scatter Plot in R using ggplot2 (with Example)
boxplot() in R: How to Make BoxPlots in RStudio [Examples]
R Random Forest Tutorial with Example

FAQs

How to Replace Missing Values(NA) in R: na.omit & na.rm? ›

Use na. omit() to remove entire rows with missing values from a data frame. Use na. rm = TRUE as an argument in functions like mean() to perform calculations while ignoring missing values.

How to replace all NA values in R? ›

Method 1: Using the is.na() Function

The ` is.na(data) ` function used in this example returns a logical vector containing TRUE for NAs and FALSE for non-missing values inside the vector. This logical vector is then used to index the data vector, and all the NAs are replaced with 0s.

View Details ›

How do you exclude missing values in R? ›

In base R, use na. omit() to remove all observations with missing data on ANY variable in the dataset, or use subset() to filter out cases that are missing on a subset of variables.

Discover More Details ›

How do you replace NA by blank in R? ›

How to replace NA (missing values) with blank space or an empty string in an R dataframe? You can replace NA values with blank space on columns of R dataframe (data. frame) by using is.na() , replace() methods.

What is the difference between NA RM and NA omit? ›

The na. omit performs any calculation by considering the NA values but do not include them in the calculation, on the other hand, na. rm remove the NA values and then perform any calculation. For example, if a vector has one NA and 5 values in total then their sum using na.

View Details ›

How to replace missing values in dataset? ›

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame.

How to replace na values in R with mean? ›

How to replace NA values in columns of an R data frame form the mean of that column?

df$x[is. na(df$x)]<-mean(df$x,na. rm=TRUE) df.
df$y[is. na(df$y)]<-mean(df$y,na. rm=TRUE) df.
df$z[is. na(df$z)]<-mean(df$z,na. rm=TRUE) df.

Oct 18, 2020

Read On ›

What does NA RM do in R? ›

na. rm is used to remove the missing values from the input vector.

Find Out More ›

How to remove rows with na in a specific column in R? ›

Remove rows with NA of one column in R DataFrame Using drop_na() drop_na() Drops rows having values equal to NA. To use this approach we need to use “tidyr” library, which can be installed.

How to make empty rows na in R? ›

Method 1 : Using nrow() method

The nrow() method in R is used to return the number of rows in a dataframe. A new row can be inserted at the end of the dataframe using the indexing technique. The new row is assigned a vector NA, in order to insert blank entries.

When can we use NA omit () to remove missing data? ›

The name “na. omit” stands for “omit NAs.” This function is particularly useful when working with datasets that contain missing values, and you want to exclude observations with missing data from your analysis. Parameter: data: Set of specified values of a data frame, matrix, or vector.

Get More Info ›

What is NA in missing values in R? ›

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

View Details ›

How to replace NA values? ›

Replace NaN Values with Zeros using NumPy replace()

The dataframe. replace() function in Pandas can be defined as a simple method used to replace a string, regex, list, dictionary, etc. in a DataFrame.

View Details ›

How to change missing values in R? ›

We have three methods to deal with missing values:

Exclude all of the missing observations.
Impute with the mean.
Impute with the median.

Mar 9, 2024

See Details ›

How do I remove NA values from a list in R? ›

To remove all rows having NA, we can use na. omit() function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

See Details ›

How to replace multiple values in a column in R? ›

Using dplyr, you can efficiently replace multiple values in a data frame using functions like case_when() or recode() within mutate(). Whether you prefer the flexibility of case_when() or the simplicity of recode(), dplyr provides intuitive tools for data manipulation tasks in R.

Get More Info Here ›