
Text data is almost always messy. One of the most efficient ways to tackle this is with Polars string manipulation. In Pandas, you use .str to clean it. In Polars, you do the same, but it’s part of the powerful Expression API, which makes it faster and more consistent.
All string expressions are available under pl.col("my_column").str, ensures that your data wrangling tasks remain efficient and manageable.
The Setup “Polars string manipulation”
import polars as pl
df = pl.DataFrame({
"email": ["alice@gmail.com", "bob@yahoo.com", "carol@hotmail.com"],
"notes": ["Item 1", "Item 2", "Item 1, Item 3"]
})1. contains(): Filtering with Text
Find all rows where the “notes” column mentions “Item 1”.
df.filter(
pl.col("notes").str.contains("Item 1")
)2. replace(): Cleaning Data
Let’s change all “gmail.com” to “https://www.google.com/url?sa=E&source=gmail&q=google.com”.
df.with_columns(
pl.col("email").str.replace("gmail.com", "google.com")
)3. extract(): Using Regex to Get Data
This is the most powerful tool. Let’s extract just the domain name from the emails. The regex r"@(.+)" captures everything after the @.
df.with_columns(
pl.col("email").str.extract(r"@(.+)", 1).alias("domain")
)Output:
shape: (3, 3) ┌───────────────────┬──────────────────┬─────────────┐ │ email ┆ notes ┆ domain │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞═══════════════════╪══════════════════╪═════════════╡ │ alice@gmail.com ┆ Item 1 ┆ gmail.com │ │ bob@yahoo.com ┆ Item 2 ┆ yahoo.com │ │ carol@hotmail.com ┆ Item 1, Item 3 ┆ hotmail.com │ └───────────────────┴──────────────────┴─────────────┘
Key Takeaways
- Text data is often messy, and Polars string manipulation offers an efficient way to clean it.
- You use the Expression API in Polars for enhanced speed and consistency, similar to Pandas’
.str. - Key functions include
contains()for filtering,replace()for data cleaning, andextract()for regex data extraction.





