How to Find and Remove Duplicate Rows in Polars (2026 Guide)

3D isometric illustration of a robotic arm knocking duplicate data cubes off a conveyor belt, representing Polars unique function.

Duplicate data is a silent killer for analysis and machine learning. Polars provides high-speed, easy-to-use methods for finding and removing duplicate rows.

The Setup

Let’s create a DataFrame with duplicate entries.

import polars as pl
df = pl.DataFrame({
    "id": [1, 2, 1, 3],
    "name": ["Alice", "Bob", "Alice", "Charlie"],
    "data": [100, 200, 100, 300]
})
print(df)
shape: (4, 3)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ id  โ”† name    โ”† data โ”‚
โ”‚ --- โ”† ---     โ”† ---  โ”‚
โ”‚ i64 โ”† str     โ”† i64  โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 1   โ”† Alice   โ”† 100  โ”‚
โ”‚ 2   โ”† Bob     โ”† 200  โ”‚
โ”‚ 1   โ”† Alice   โ”† 100  โ”‚
โ”‚ 3   โ”† Charlie โ”† 300  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Row 0 and Row 2 are identical.

1. Find Duplicates (.is_duplicated())

This creates a new boolean column that is True for any row that has been seen before.

df.with_columns(
    pl.col("id").is_duplicated().alias("is_duplicate")
)

Output:

shape: (4, 4)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ id  โ”† name    โ”† data โ”† is_duplicate โ”‚
โ”‚ --- โ”† ---     โ”† ---  โ”† ---          โ”‚
โ”‚ i64 โ”† str     โ”† i64  โ”† bool         โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 1   โ”† Alice   โ”† 100  โ”† false        โ”‚
โ”‚ 2   โ”† Bob     โ”† 200  โ”† false        โ”‚
โ”‚ 1   โ”† Alice   โ”† 100  โ”† true         โ”‚
โ”‚ 3   โ”† Charlie โ”† 300  โ”† false        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. Remove Duplicates (.unique())

This is the simple, one-step method to get a clean DataFrame. unique() and distinct() are aliases for the same operation.

df.unique()

Output:

shape: (3, 3)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ id  โ”† name    โ”† data โ”‚
โ”‚ --- โ”† ---     โ”† ---  โ”‚
โ”‚ i64 โ”† str     โ”† i64  โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 1   โ”† Alice   โ”† 100  โ”‚
โ”‚ 2   โ”† Bob     โ”† 200  โ”‚
โ”‚ 3   โ”† Charlie โ”† 300  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3. Advanced: unique(subset=...)

What if you only want to find duplicates based on one column? (e.g., “Keep the first entry for each id“).

# Keep the first row it finds for each unique 'id'
df.unique(subset=["id"], keep="first")

This is a core data cleaning operation that Polars performs incredibly fast.


Key Takeaways

  • Duplicate data hinders analysis and machine learning, but Polars offers efficient methods for identifying and removing them.
  • To find duplicates, use the .is_duplicated() method, which adds a boolean column indicating previously seen rows.
  • You can remove duplicates easily with the .unique() method, which cleans the DataFrame in one step.
  • The .unique(subset=…) method allows you to find duplicates based on a specific column, such as an ‘id’.
  • Polars excels at these data cleaning tasks, performing them with impressive speed.

Similar Posts

Leave a Reply