Polars Performance: String Caching (Categorical Type)

3D visualization of replacing heavy books with lightweight reference cards, representing Polars string caching and categorical types.

Let’s say you have a 10GB file with a “Country” column. The string “United States of America” might appear 50 million times, using a massive amount of RAM.

This is very inefficient. A Categorical (or “string cache” in Polars) is a smarter way to store this.

How it works: Polars creates a “lookup table” for all unique strings:

  • 0 = "United States of America"
  • 1 = "Germany"
  • 2 = "Japan"

Then, in your main DataFrame, it just stores the integer (0, 1, 2) instead of the long string. This uses 100x less memory and makes groupby and join operations on that column blazingly fast.

Step 1: Standard (Inefficient) Way

import polars as pl
# This will store "A" and "B" as full strings
df = pl.DataFrame({"category": ["A", "B", "A", "A", "B"]})
print(df.dtype)
# Output: Utf8 (a string type)

Step 2: The Categorical Way

You can convert an existing column or set the type during a read_csv.

# Convert an existing column
df_cat = df.with_columns(
    pl.col("category").cast(pl.Categorical)
)
print(df_cat.dtype)
# Output: Categorical

Step 3: Global String Caching (The Pro Move)

If you are loading multiple DataFrames that share the same categories (e.g., sales_jan.csv and sales_feb.csv), you can enable the Global String Cache.

# 1. Turn on the global cache
pl.enable_string_cache()

# 2. Now, read all your files
df1 = pl.read_csv("jan.csv", dtypes={"category": pl.Categorical})
df2 = pl.read_csv("feb.csv", dtypes={"category": pl.Categorical})

# 3. Combine them
df_all = pl.concat([df1, df2])

# 4. Turn off the cache
pl.disable_string_cache()

Now, Polars knows that the “A” in df1 is the exact same thing as the “A” in df2, making your concat and join operations lightning fast. This is a key skill for high-performance data work.


Key Takeaways

  • Large file sizes with repeated strings can waste RAM, but using Polars Categorical can optimise storage.
  • Polars creates a lookup table for unique strings, storing integers instead of long strings for efficiency.
  • Converting existing columns to Categorical or setting types during read_csv reduces memory usage significantly.
  • Global String Caching allows Polars to recognise identical categories across multiple DataFrames, enhancing performance.
  • Using Categorical in Polars speeds up groupby and join operations while conserving memory.

Similar Posts

Leave a Reply