Pandas groupby.apply Deprecation Warning
Problem Statement
When using groupby.apply()
in pandas 2.2.0+, you may encounter this warning:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping columns
will be excluded from the operation.
This occurs because pandas historically included the group-by columns in the group
DataFrame passed to apply()
. The new behavior (coming in pandas 3.0) will exclude these columns by default. The warning helps you update your code before the breaking change.
In your specific operation:
fprice = df.groupby(['StartDate', 'Commodity', 'DealType']).apply(
lambda group: -(group['MTMValue'].sum() -
(group['FixedPriceStrike'] * group['Quantity']).sum()) /
group['Quantity'].sum()
).reset_index(name='FloatPrice')
The grouping columns (StartDate
, Commodity
, DealType
) are included in group
but aren't used in your calculation.
Solution
Add include_groups=False
to your apply()
call:
fprice = df.groupby(['StartDate', 'Commodity', 'DealType']).apply(
lambda group: -(group['MTMValue'].sum() -
(group['FixedPriceStrike'] * group['Quantity']).sum()) /
group['Quantity'].sum(),
include_groups=False # ← Silences the warning
).reset_index(name='FloatPrice')
Why this works:
- Your lambda function only uses
MTMValue
,FixedPriceStrike
, andQuantity
include_groups=False
excludes the group-by columns fromgroup
, matching pandas' future behavior- This fixes the warning while maintaining identical results
Key Insight
You only need the grouping columns in the final aggregation result—not during the calculation. Pandas automatically handles their inclusion in the index when you call reset_index()
.
Explanation
Behavior Change in Pandas 2.2+
include_groups= | Current Default | Future (3.0+) | Behavior |
---|---|---|---|
True (default) | ✓ | ✗ | Group-by columns included in group |
False | ✗ | Default | Group-by columns excluded from group |
Why this matters
- Avoid bugs: Including group-by columns can distort calculations (e.g., if they're numeric and you call
mean()
) - Efficiency: Excluding unused columns saves memory
- Consistency: Matches what developers intuitively expect
Incorrect Usage Example
This calculates incorrect means because it includes the numeric group-by column a
:
# Bad: Includes group-by column 'a' in operations
df.groupby('a').apply(np.mean)
Output with a=[1,1,2,2]
:
a b
1 1.5 # Incorrect! (1+1+1+2)/4 = 1.25
2 3.5 # Incorrect! (2+2+4+5)/4 = 3.25
Solution:
df.groupby('a').apply(np.mean, include_groups=False)
Gives correct:
b
a
1 1.5 # (1+2)/2 = 1.5
2 4.5 # (4+5)/2 = 4.5
Alternative Solutions
If you do need access to the group-by columns during apply()
, use:
1. Explicitly include columns in the group operation
# Manually list ALL columns to use (including group-by columns)
group_cols = ['StartDate', 'Commodity', 'DealType']
calc_cols = group_cols + ['MTMValue', 'FixedPriceStrike', 'Quantity']
fprice = df.groupby(group_cols)[calc_cols].apply(
lambda group: ... # Your logic
).reset_index(name='FloatPrice')
2. Use group names via group.name
fprice = df.groupby(['DealType']).apply(
lambda group: (
group.value.sum()
+ group.name # ← Access group key (e.g., 'DealType=A')
),
include_groups=False
)
Final Recommendation
For most users (especially if you don't use group-by columns in calculations):
- Add
include_groups=False
toapply()
calls - Test results with small datasets to confirm identical output
Your corrected code:
fprice = df.groupby(['StartDate', 'Commodity', 'DealType']).apply(
lambda group: -(group['MTMValue'].sum() -
(group['FixedPriceStrike'] * group['Quantity']).sum()) /
group['Quantity'].sum(),
include_groups=False # Fixes warning + future-proofs code
).reset_index(name='FloatPrice')