Handling Missing Data in Polars (null, fill_null, drop_nulls)

3D visualization of drones repairing gaps in a track by filling or removing them, representing Polars null handling.

Just like Pandas has NaN, Polars has null to represent missing or empty data. Before you can analyze a dataset, you must have a strategy for dealing with these null values.

Step 1: Finding and Counting Nulls

First, let’s see how big the Polars missing data problem is.

import polars as pl

df = pl.DataFrame({
    "A": [1, 2, None, 4, 5],
    "B": [None, "x", "y", "z", None],
    "C": [100, 200, 300, 400, 500]
})

# Count nulls in every column
print(df.null_count())

Output:

shape: (1, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 0   │
└─────┴─────┴─────┘

Step 2: Option A – Dropping Nulls

If a row is useless without the data, just drop it.

# Drop any row that contains at least one null value
df_dropped = df.drop_nulls()
print(df_dropped)

Output:

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ x   ┆ 200 │
│ 4   ┆ z   ┆ 400 │
└─────┴─────┴─────┘

Step 3: Option B – Filling Nulls

Dropping data is often too extreme. It’s usually better to fill the missing values.

Polars uses the Expression API for this, making it fast and powerful.

# 1. Fill with a static value (e.g., 0)
df_filled_zero = df.fill_null(0)
print(df_filled_zero)

# 2. Fill with a "strategy" (e.g., the average)
df_filled_mean = df.fill_null(strategy="mean")
print(df_filled_mean)

# 3. Fill "forward" (use the last valid value)
df_filled_forward = df.fill_null(strategy="forward")
print(df_filled_forward)

Advanced: Chaining (The Polars Way)

You can combine this with with_columns for more control.

# Fill column A with 0, but fill column B with the word "UNKNOWN"
df_selective_fill = df.with_columns([
    pl.col("A").fill_null(0),
    pl.col("B").fill_null(pl.lit("UNKNOWN"))
])
print(df_selective_fill)

Key Takeaways

  • Polars uses null to signify missing or empty data, similar to how Pandas uses NaN.
  • The first step in handling Polars missing data involves finding and counting nulls in your dataset.
  • You can either drop rows with nulls if they’re unusable, or fill in missing values instead.
  • Filling nulls is often preferred, and Polars enables this via its fast and powerful Expression API.
  • For advanced control, you can chain operations using with_columns in Polars.

Similar Posts

Leave a Reply