
Duplicate data is a silent killer for analysis and machine learning. Polars provides high-speed, easy-to-use methods for finding and removing duplicate rows.
The Setup
Let’s create a DataFrame with duplicate entries.
import polars as pl
df = pl.DataFrame({
"id": [1, 2, 1, 3],
"name": ["Alice", "Bob", "Alice", "Charlie"],
"data": [100, 200, 100, 300]
})
print(df)shape: (4, 3) โโโโโโโฌโโโโโโโโโโฌโโโโโโโ โ id โ name โ data โ โ --- โ --- โ --- โ โ i64 โ str โ i64 โ โโโโโโโชโโโโโโโโโโชโโโโโโโก โ 1 โ Alice โ 100 โ โ 2 โ Bob โ 200 โ โ 1 โ Alice โ 100 โ โ 3 โ Charlie โ 300 โ โโโโโโโดโโโโโโโโโโดโโโโโโโ
Row 0 and Row 2 are identical.
1. Find Duplicates (.is_duplicated())
This creates a new boolean column that is True for any row that has been seen before.
df.with_columns(
pl.col("id").is_duplicated().alias("is_duplicate")
)Output:
shape: (4, 4) โโโโโโโฌโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโโโโโโ โ id โ name โ data โ is_duplicate โ โ --- โ --- โ --- โ --- โ โ i64 โ str โ i64 โ bool โ โโโโโโโชโโโโโโโโโโชโโโโโโโชโโโโโโโโโโโโโโโก โ 1 โ Alice โ 100 โ false โ โ 2 โ Bob โ 200 โ false โ โ 1 โ Alice โ 100 โ true โ โ 3 โ Charlie โ 300 โ false โ โโโโโโโดโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโโโโโ
2. Remove Duplicates (.unique())
This is the simple, one-step method to get a clean DataFrame. unique() and distinct() are aliases for the same operation.
df.unique()
Output:
shape: (3, 3) โโโโโโโฌโโโโโโโโโโฌโโโโโโโ โ id โ name โ data โ โ --- โ --- โ --- โ โ i64 โ str โ i64 โ โโโโโโโชโโโโโโโโโโชโโโโโโโก โ 1 โ Alice โ 100 โ โ 2 โ Bob โ 200 โ โ 3 โ Charlie โ 300 โ โโโโโโโดโโโโโโโโโโดโโโโโโโ
3. Advanced: unique(subset=...)
What if you only want to find duplicates based on one column? (e.g., “Keep the first entry for each id“).
# Keep the first row it finds for each unique 'id' df.unique(subset=["id"], keep="first")
This is a core data cleaning operation that Polars performs incredibly fast.
Key Takeaways
- Duplicate data hinders analysis and machine learning, but Polars offers efficient methods for identifying and removing them.
- To find duplicates, use the .is_duplicated() method, which adds a boolean column indicating previously seen rows.
- You can remove duplicates easily with the .unique() method, which cleans the DataFrame in one step.
- The .unique(subset=…) method allows you to find duplicates based on a specific column, such as an ‘id’.
- Polars excels at these data cleaning tasks, performing them with impressive speed.





