How to Find and Remove Duplicate Rows in Polars (2026 Guide)

Duplicate data is a silent killer for analysis and machine learning. Polars provides high-speed, easy-to-use methods for finding and removing duplicate rows.

The Setup

Let’s create a DataFrame with duplicate entries.

import polars as pl
df = pl.DataFrame({
    "id": [1, 2, 1, 3],
    "name": ["Alice", "Bob", "Alice", "Charlie"],
    "data": [100, 200, 100, 300]
})
print(df)

shape: (4, 3)
┌─────┬─────────┬──────┐
│ id  ┆ name    ┆ data │
│ --- ┆ ---     ┆ ---  │
│ i64 ┆ str     ┆ i64  │
╞═════╪═════════╪══════╡
│ 1   ┆ Alice   ┆ 100  │
│ 2   ┆ Bob     ┆ 200  │
│ 1   ┆ Alice   ┆ 100  │
│ 3   ┆ Charlie ┆ 300  │
└─────┴─────────┴──────┘

Row 0 and Row 2 are identical.

1. Find Duplicates (`.is_duplicated()`)

This creates a new boolean column that is True for any row that has been seen before.

df.with_columns(
    pl.col("id").is_duplicated().alias("is_duplicate")
)

Output:

shape: (4, 4)
┌─────┬─────────┬──────┬──────────────┐
│ id  ┆ name    ┆ data ┆ is_duplicate │
│ --- ┆ ---     ┆ ---  ┆ ---          │
│ i64 ┆ str     ┆ i64  ┆ bool         │
╞═════╪═════════╪══════╪══════════════╡
│ 1   ┆ Alice   ┆ 100  ┆ false        │
│ 2   ┆ Bob     ┆ 200  ┆ false        │
│ 1   ┆ Alice   ┆ 100  ┆ true         │
│ 3   ┆ Charlie ┆ 300  ┆ false        │
└─────┴─────────┴──────┴──────────────┘

2. Remove Duplicates (`.unique()`)

This is the simple, one-step method to get a clean DataFrame. unique() and distinct() are aliases for the same operation.

df.unique()

Output:

shape: (3, 3)
┌─────┬─────────┬──────┐
│ id  ┆ name    ┆ data │
│ --- ┆ ---     ┆ ---  │
│ i64 ┆ str     ┆ i64  │
╞═════╪═════════╪══════╡
│ 1   ┆ Alice   ┆ 100  │
│ 2   ┆ Bob     ┆ 200  │
│ 3   ┆ Charlie ┆ 300  │
└─────┴─────────┴──────┘

3. Advanced: `unique(subset=...)`

What if you only want to find duplicates based on one column? (e.g., “Keep the first entry for each id“).

# Keep the first row it finds for each unique 'id'
df.unique(subset=["id"], keep="first")

This is a core data cleaning operation that Polars performs incredibly fast.

Key Takeaways

Duplicate data hinders analysis and machine learning, but Polars offers efficient methods for identifying and removing them.
To find duplicates, use the .is_duplicated() method, which adds a boolean column indicating previously seen rows.
You can remove duplicates easily with the .unique() method, which cleans the DataFrame in one step.
The .unique(subset=…) method allows you to find duplicates based on a specific column, such as an ‘id’.
Polars excels at these data cleaning tasks, performing them with impressive speed.

How to Find and Remove Duplicate Rows in Polars (2026 Guide)

The Setup

1. Find Duplicates (`.is_duplicated()`)

2. Remove Duplicates (`.unique()`)

3. Advanced: `unique(subset=...)`

Key Takeaways

AI Project: Quantization for Faster Models (Hugging Face optimum)

How to Fix: ValueError: operands could not be broadcast together with shapes

PyScript Project: Load and Analyze a User’s CSV File in the Browser

How to Apply Custom Functions on Polars Groups (.group_by().apply())

AI Project: Fill-in-the-Blank with Hugging Face (BERT)

AI Project: Run Llama 3 on Your Laptop (with llama-cpp-python)

Leave a Reply Cancel reply

The Setup

1. Find Duplicates (.is_duplicated())

2. Remove Duplicates (.unique())

3. Advanced: unique(subset=...)

Key Takeaways

Similar Posts

Leave a Reply Cancel reply

1. Find Duplicates (`.is_duplicated()`)

2. Remove Duplicates (`.unique()`)

3. Advanced: `unique(subset=...)`