In pandas, a missing value (NA: not available) is mainly represented by nan
(not a number). None
is also considered a missing value.
Contents
- Missing values caused by reading files, etc.
- nan (not a number) is considered a missing value
- None is also considered a missing value
- String is not considered a missing value
- Infinity inf is not considered a missing value by default
- pd.NA is the experimental value (as of 2.0.3)
The sample code in this article uses pandas version 2.0.3
. NumPy and math are also imported.
import mathimport numpy as npimport pandas as pdprint(pd.__version__)# 2.0.3
source: pandas_nan_none_na.py
Missing values caused by reading files, etc.
Reading a CSV file with missing values generates nan
. When printed with print()
, this missing value is represented as NaN
.
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')[:3]print(df)# name age state point other# 0 Alice 24.0 NY NaN NaN# 1 NaN NaN NaN NaN NaN# 2 Charlie NaN CA NaN NaN
source: pandas_nan_none_na.py
You can use methods like isnull()
, dropna()
, and fillna()
to detect, remove, and replace missing values.
- pandas: Detect and count NaN (missing values) with isnull(), isna()
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
print(df.isnull())# name age state point other# 0 False False False True True# 1 True True True True True# 2 False True False True Trueprint(df.dropna(how='all'))# name age state point other# 0 Alice 24.0 NY NaN NaN# 2 Charlie NaN CA NaN NaNprint(df.fillna(0))# name age state point other# 0 Alice 24.0 NY 0.0 0.0# 1 0 0.0 0 0.0 0.0# 2 Charlie 0.0 CA 0.0 0.0
source: pandas_nan_none_na.py
nan
in a column with object
is a Python built-in float
type, and nan
in a column with floatXX
is a NumPy numpy.floatXX
type. Both are treated as missing values.
print(df.dtypes)# name object# age float64# state object# point float64# other float64# dtype: objectprint(df.at[1, 'name'])# nanprint(type(df.at[1, 'name']))# <class 'float'>print(df.at[1, 'age'])# nanprint(type(df.at[1, 'age']))# <class 'numpy.float64'>
In addition to reading a file, nan
is used to represent a missing value when an element does not exist in the result of methods like reindex()
, merge()
, and others.
- pandas.DataFrame.reindex — pandas 2.0.3 documentation
- pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)
nan
(not a number) is considered a missing value
In Python, you can create nan
with float('nan')
, math.nan
, or np.nan
. nan
is considered a missing value in pandas.
- nan (not a number) in Python
s_nan = pd.Series([float('nan'), math.nan, np.nan])print(s_nan)# 0 NaN# 1 NaN# 2 NaN# dtype: float64print(s_nan.isnull())# 0 True# 1 True# 2 True# dtype: bool
source: pandas_nan_none_na.py
None
is also considered a missing value
In pandas, None
is also treated as a missing value. None
is a built-in constant in Python.
- None in Python
print(None)# Noneprint(type(None))# <class 'NoneType'>
source: pandas_nan_none_na.py
For numeric columns, None
is converted to nan
when a DataFrame
or Series
containing None
is created, or None
is assigned to an element.
s_none_float = pd.Series([None, 0.1, 0.2])s_none_float[2] = Noneprint(s_none_float)# 0 NaN# 1 0.1# 2 NaN# dtype: float64print(s_none_float.isnull())# 0 True# 1 False# 2 True# dtype: bool
source: pandas_nan_none_na.py
Since nan
is a floating-point number float
, if None
is converted to nan
, the data type dtype
of the column is changed to float
, even if the other values are integers int
.
s_none_int = pd.Series([None, 1, 2])print(s_none_int)# 0 NaN# 1 1.0# 2 2.0# dtype: float64
source: pandas_nan_none_na.py
Although None
in the object
column remains as None
, it is detected as a missing value by isnull()
. Of course, it is also handled by methods such as dropna()
and fillna()
.
s_none_object = pd.Series([None, 'abc', 'xyz'])print(s_none_object)# 0 None# 1 abc# 2 xyz# dtype: objectprint(s_none_object.isnull())# 0 True# 1 False# 2 False# dtype: boolprint(s_none_object.fillna(0))# 0 0# 1 abc# 2 xyz# dtype: object
source: pandas_nan_none_na.py
String is not considered a missing value
Though indistinguishable on display, the strings 'NaN'
and 'None'
are not treated as missing values. The empty string ''
is also not considered a missing value.
s_str = pd.Series(['NaN', 'None', ''])print(s_str)# 0 NaN# 1 None# 2 # dtype: objectprint(s_str.isnull())# 0 False# 1 False# 2 False# dtype: bool
source: pandas_nan_none_na.py
If you want to treat certain values as missing, you can use the replace()
method to replace them with float('nan')
, np.nan
, or math.nan
.
s_replace = s_str.replace(['NaN', 'None', ''], float('nan'))print(s_replace)# 0 NaN# 1 NaN# 2 NaN# dtype: float64print(s_replace.isnull())# 0 True# 1 True# 2 True# dtype: bool
source: pandas_nan_none_na.py
Note that functions to read files such as read_csv()
consider ''
, 'NaN'
, 'null'
, etc., as missing values by default and replace them with nan
.
- pandas: Read CSV into DataFrame with read_csv()
Infinity inf
is not considered a missing value by default
In Python, inf
represents infinity in floating-point numbers (float
).
- Infinity (inf) in Python
Infinity inf
is not considered a missing value by default.
s_inf = pd.Series([float('inf'), -float('inf')])print(s_inf)# 0 inf# 1 -inf# dtype: float64print(s_inf.isnull())# 0 False# 1 False# dtype: bool
source: pandas_nan_none_na.py
If pd.options.mode.use_inf_as_na
is set to True
, inf
in DataFrame
and Series
is converted to nan
and treated as a missing value. Unlike None
, inf
in the object
column is also converted to nan
.
pd.options.mode.use_inf_as_na = Trueprint(s_inf)# 0 NaN# 1 NaN# dtype: float64print(s_inf.isnull())# 0 True# 1 True# dtype: bools_inf_object = pd.Series([float('inf'), -float('inf'), 'abc'])print(s_inf_object)# 0 NaN# 1 NaN# 2 abc# dtype: objectprint(s_inf_object.isnull())# 0 True# 1 True# 2 False# dtype: bool
source: pandas_nan_none_na.py
See the following article on how to set options in pandas.
- pandas: Get and set options for display, data behavior, etc.
pd.NA
is the experimental value (as of 2.0.3)
pd.NA
was introduced as an experimental NA scalar in pandas 1.0.0
.
print(pd.NA)# <NA>print(type(pd.NA))# <class 'pandas._libs.missing.NAType'>
source: pandas_nan_none_na.py
While nan == nan
is False
, pd.NA == pd.NA
is pd.NA
as in the R language.
print(float('nan') == float('nan'))# Falseprint(pd.NA == pd.NA)# <NA>
source: pandas_nan_none_na.py
Of course, pd.NA
is treated as a missing value.
s_na = pd.Series([None, 1, 2], dtype='Int64')print(s_na)# 0 <NA># 1 1# 2 2# dtype: Int64print(s_na.isnull())# 0 True# 1 False# 2 False# dtype: boolprint(s_na.fillna(0))# 0 0# 1 1# 2 2# dtype: Int64
source: pandas_nan_none_na.py
See the following document for Int64
in the sample code above. Even if it contains missing values, other integer values are not converted to floating point numbers.
Note that as of 2.0.3
(June 2023), it is still "Experimental", and its behavior may change.
Warning
Experimental: the behaviour of pd.NA can still change without warning.Working with missing data - Experimental NA scalar to denote missing values — pandas 2.0.3 documentation