How to Apply Custom Functions on Polars Groups (.group_by().apply())

3D visualization of robots using hand tools at separate workbenches, representing Polars group_by apply with custom functions.

We know that .map_elements() is slow because it runs row-by-row. We know that .group_by().agg() is super fast, but it’s limited to simple functions (like sum, mean). In this article, we’ll look at how to use Polars groupby apply to handle more complex operations efficiently.

Warning: This is slower than .agg() because it breaks out of the optimized Polars engine into pure Python. But it’s much faster than .map_elements() because it only runs once per group, not once per row.

The Goal

Let’s find the sales value for the second transaction in each product group.

import polars as pl
df = pl.DataFrame({
    "product": ["A", "B", "A", "B", "A"],
    "sales": [10, 20, 30, 40, 50]
})

The .apply() Method

The function you pass to .apply() will receive a full DataFrame (the sub-group) as its input.

# 1. Define a function that takes a DataFrame
def get_second_sale(group_df):
    if len(group_df) > 1:
        # Return the 'sales' value from the 2nd row (index 1)
        return group_df.item(1, "sales")
    return None

# 2. Use .group_by().apply()
result = df.group_by("product").apply(get_second_sale)
print(result)

Output:

shape: (2, 2)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ product โ”† apply โ”‚
โ”‚ ---     โ”† ---   โ”‚
โ”‚ str     โ”† i64   โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ B       โ”† 40    โ”‚
โ”‚ A       โ”† 30    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

This is a powerful tool for when you need to run complex logic (like a mini-machine learning model or a statistical test) on each group of your data.


Key Takeaways

  • The .map_elements() method is slow, running row-by-row, while .group_by().agg() is fast but limited to simple functions.
  • For complex functions on entire groups, use .group_by().apply() in Polars.
  • The .apply() method processes data once per group, making it faster than .map_elements() but slower than .agg() due to Python overhead.
  • It allows for advanced logic, such as mini-machine learning models or statistical tests, to be applied to each group.
  • The goal is to find the sales value for the second transaction in each product group using this method.

Similar Posts

Leave a Reply