Handling Missing Data in Pandas: A Guide to dropna() and fillna()

3D visualization of repairing a data grid by filling holes or removing rows, representing Pandas dropna and fillna methods.

In the real world, your datasets will have holes. Users forget to fill out forms, sensors break, or data gets corrupted. Handling Pandas Missing Data effectively is crucial, as these missing values show up as NaN (Not a Number).

You cannot ignore them. If you try to do math with a NaN, the result is often just more NaNs. You have two main choices: Drop them or Fill them.

1. Finding Missing Data

Let’s first discover areas in the data where Pandas indicates missing values.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [10, 11, 12, 13]
})

# Check for missing values (returns True/False for every cell)
print(df.isnull())

# Count missing values in each column (SUPER USEFUL!)
print(df.isnull().sum())
# Output:
# A    1
# B    2
# C    0

2. Option A: Drop Them (dropna)

If a row has too much missing data to be useful, just get rid of it.

# Drop ANY row that has at least one missing value
clean_df = df.dropna()
print(clean_df)
# Only rows 0 and 3 remain.

You can also drop columns that have missing values by using axis=1 for Pandas dataframes with missing data.

# Drop columns with missing values
clean_cols = df.dropna(axis=1)
# Only column 'C' remains.

3. Option B: Fill Them (fillna)

Often, dropping data is too aggressive. It’s better to fill the holes with a reasonable guess, like zero, or the average value of that column.

# Fill ALL missing values with 0
filled_zero = df.fillna(0)

# Fill with the average (mean) of each column
# (This is very common in Data Science)
filled_mean = df.fillna(df.mean())
print(filled_mean)

Now your dataset, free from missing data, is clean and ready for analysis!

Similar Posts

Leave a Reply