Cleaning Text in Polars: The .str Expression Namespace

3D visualization of robotic lasers and brushes cleaning rough text blocks, representing the Polars string manipulation - Polars .str namespace.

Text data is almost always messy. One of the most efficient ways to tackle this is with Polars string manipulation. In Pandas, you use .str to clean it. In Polars, you do the same, but it’s part of the powerful Expression API, which makes it faster and more consistent.

All string expressions are available under pl.col("my_column").str, ensures that your data wrangling tasks remain efficient and manageable.

The Setup “Polars string manipulation”

import polars as pl
df = pl.DataFrame({
    "email": ["alice@gmail.com", "bob@yahoo.com", "carol@hotmail.com"],
    "notes": ["Item 1", "Item 2", "Item 1, Item 3"]
})

1. contains(): Filtering with Text

Find all rows where the “notes” column mentions “Item 1”.

df.filter(
    pl.col("notes").str.contains("Item 1")
)

2. replace(): Cleaning Data

Let’s change all “gmail.com” to “https://www.google.com/url?sa=E&source=gmail&q=google.com”.

df.with_columns(
    pl.col("email").str.replace("gmail.com", "google.com")
)

3. extract(): Using Regex to Get Data

This is the most powerful tool. Let’s extract just the domain name from the emails. The regex r"@(.+)" captures everything after the @.

df.with_columns(
    pl.col("email").str.extract(r"@(.+)", 1).alias("domain")
)

Output:

shape: (3, 3)
┌───────────────────┬──────────────────┬─────────────┐
│ email             ┆ notes            ┆ domain      │
│ ---               ┆ ---              ┆ ---         │
│ str               ┆ str              ┆ str         │
╞═══════════════════╪══════════════════╪═════════════╡
│ alice@gmail.com   ┆ Item 1           ┆ gmail.com   │
│ bob@yahoo.com     ┆ Item 2           ┆ yahoo.com   │
│ carol@hotmail.com ┆ Item 1, Item 3   ┆ hotmail.com │
└───────────────────┴──────────────────┴─────────────┘

Key Takeaways

  • Text data is often messy, and Polars string manipulation offers an efficient way to clean it.
  • You use the Expression API in Polars for enhanced speed and consistency, similar to Pandas’ .str.
  • Key functions include contains() for filtering, replace() for data cleaning, and extract() for regex data extraction.

Similar Posts

Leave a Reply