Henry Harbeck commited on
Commit
926e69f
·
1 Parent(s): 3fb7b66

add window functions in Polars!

Browse files
Files changed (1) hide show
  1. polars/13_window_functions.py +538 -0
polars/13_window_functions.py ADDED
@@ -0,0 +1,538 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.13"
3
+ # dependencies = [
4
+ # "duckdb==1.2.2",
5
+ # "marimo",
6
+ # "polars==1.29.0",
7
+ # "pyarrow==20.0.0",
8
+ # "sqlglot==26.16.4",
9
+ # ]
10
+ # ///
11
+
12
+ import marimo
13
+
14
+ __generated_with = "0.12.9"
15
+ app = marimo.App(width="medium", app_title="Window Functions")
16
+
17
+
18
+ @app.cell
19
+ def _():
20
+ import marimo as mo
21
+ return (mo,)
22
+
23
+
24
+ @app.cell(hide_code=True)
25
+ def _(mo):
26
+ mo.md(
27
+ r"""
28
+ # Window Functions
29
+ _By [Henry Harbeck](https://github.com/henryharbeck)._
30
+
31
+ In this notebook, you'll learn how to perform different types of window functions in Polars.
32
+ You'll work with partitions, ordering and Polars' available "mapping strategies".
33
+
34
+ We'll use a dataset with a few days of paid and organic digital revenue data.
35
+ """
36
+ )
37
+ return
38
+
39
+
40
+ @app.cell
41
+ def _():
42
+ from datetime import date
43
+
44
+ import polars as pl
45
+
46
+ dates = pl.date_range(date(2025, 2, 1), date(2025, 2, 5), eager=True)
47
+
48
+ df = pl.DataFrame(
49
+ {
50
+ "date": pl.concat([dates, dates]).sort(),
51
+ "channel": ["Paid", "Organic"] * 5,
52
+ "revenue": [6000, 2000, 5200, 4500, 4200, 5900, 3500, 5000, 4800, 4800],
53
+ }
54
+ )
55
+
56
+ df
57
+ return date, dates, df, pl
58
+
59
+
60
+ @app.cell(hide_code=True)
61
+ def _(mo):
62
+ mo.md(
63
+ r"""
64
+ ## What is a window function?
65
+
66
+ A window function performs a calculation across a set of rows that are related to the current row.
67
+ They allow you to perform aggregations and other calculations within a group without collapsing
68
+ the number of rows (opposed to a group by aggregation, which does collapse the number of rows). Typically the result of a
69
+ window function is assigned back to rows within the group, but Polars also offers additional alternatives.
70
+
71
+ Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
72
+ method on an expression.
73
+ """
74
+ )
75
+ return
76
+
77
+
78
+ @app.cell(hide_code=True)
79
+ def _(mo):
80
+ mo.md(
81
+ r"""
82
+ ## Partitions
83
+ Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
84
+ which the function will be applied.
85
+ """
86
+ )
87
+ return
88
+
89
+
90
+ @app.cell(hide_code=True)
91
+ def _(mo):
92
+ mo.md(
93
+ r"""
94
+ ### Partitioning by a single column
95
+
96
+ Let's get the total revenue per date...
97
+ """
98
+ )
99
+ return
100
+
101
+
102
+ @app.cell
103
+ def _(df, pl):
104
+ daily_revenue = pl.col("revenue").sum().over("date")
105
+
106
+ df.with_columns(daily_revenue.alias("daily_revenue"))
107
+ return (daily_revenue,)
108
+
109
+
110
+ @app.cell(hide_code=True)
111
+ def _(mo):
112
+ mo.md(r"""And then see what percentage of the daily total was Paid and what percentage was Organic.""")
113
+ return
114
+
115
+
116
+ @app.cell
117
+ def _(daily_revenue, df, pl):
118
+ df.with_columns(daily_revenue_pct=(pl.col("revenue") / daily_revenue))
119
+ return
120
+
121
+
122
+ @app.cell(hide_code=True)
123
+ def _(mo):
124
+ mo.md(
125
+ r"""
126
+ Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
127
+ all partitioned (split) by channel.
128
+ """
129
+ )
130
+ return
131
+
132
+
133
+ @app.cell
134
+ def _(df, pl):
135
+ df.with_columns(
136
+ maximum_revenue=pl.col("revenue").max().over("channel"),
137
+ cumulative_revenue=pl.col("revenue").cum_sum().over("channel"),
138
+ revenue_rank=pl.col("revenue").rank(descending=True).over("channel"),
139
+ day_on_day_change=pl.col("revenue").diff().over("channel"),
140
+ )
141
+ return
142
+
143
+
144
+ @app.cell(hide_code=True)
145
+ def _(mo):
146
+ mo.md(
147
+ r"""
148
+ Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
149
+ (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
150
+ still only consider rows within their partition.
151
+ """
152
+ )
153
+ return
154
+
155
+
156
+ @app.cell(hide_code=True)
157
+ def _(mo):
158
+ mo.md(
159
+ r"""
160
+ ### Partitioning by multiple columns
161
+
162
+ We can also partition by multiple columns.
163
+
164
+ Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
165
+ the channel.
166
+ """
167
+ )
168
+ return
169
+
170
+
171
+ @app.cell
172
+ def _(df, pl):
173
+ (
174
+ df.with_columns(
175
+ is_weekday=pl.col("date").dt.is_business_day(),
176
+ ).with_columns(
177
+ max_rev_by_channel_and_weekday=pl.col("revenue").max().over("is_weekday", "channel"),
178
+ )
179
+ )
180
+ return
181
+
182
+
183
+ @app.cell(hide_code=True)
184
+ def _(mo):
185
+ mo.md(
186
+ r"""
187
+ ### Partitioning by expressions
188
+
189
+ Polars also lets you partition by expressions without needing to create them as columns first.
190
+
191
+ So, we could re-write the previous window function as...
192
+ """
193
+ )
194
+ return
195
+
196
+
197
+ @app.cell
198
+ def _(df, pl):
199
+ df.with_columns(
200
+ max_rev_by_channel_and_weekday=pl.col("revenue")
201
+ .max()
202
+ .over((pl.col("date").dt.is_business_day()), "channel")
203
+ )
204
+ return
205
+
206
+
207
+ @app.cell(hide_code=True)
208
+ def _(mo):
209
+ mo.md(
210
+ r"""
211
+ Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
212
+ so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
213
+ and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
214
+ """
215
+ )
216
+ return
217
+
218
+
219
+ @app.cell(hide_code=True)
220
+ def _(mo):
221
+ mo.md(
222
+ r"""
223
+ ## Ordering
224
+
225
+ The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
226
+ order.
227
+
228
+ Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
229
+ DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
230
+ """
231
+ )
232
+ return
233
+
234
+
235
+ @app.cell(hide_code=True)
236
+ def _(mo):
237
+ mo.md(
238
+ """
239
+ ### Ordering in a window function
240
+
241
+ Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
242
+ ordered by date and partitioned by channel...
243
+ """
244
+ )
245
+ return
246
+
247
+
248
+ @app.cell
249
+ def _(df, pl):
250
+ # Monday = 1, Sunday = 7
251
+ df_sorted = (
252
+ df.sort(pl.col("date").dt.weekday())
253
+ # Show the weekday for transparency
254
+ .with_columns(pl.col("date").dt.to_string("%a").alias("weekday"))
255
+ )
256
+
257
+ df_sorted.select(
258
+ "date",
259
+ "weekday",
260
+ "channel",
261
+ "revenue",
262
+ pl.col("revenue").cum_sum().over("channel", order_by="date").alias("cumulative_revenue"),
263
+ pl.col("revenue").first().over("channel", order_by="date").alias("first_revenue"),
264
+ )
265
+ return (df_sorted,)
266
+
267
+
268
+ @app.cell(hide_code=True)
269
+ def _(mo):
270
+ mo.md(
271
+ r"""
272
+ ### Note about window function ordering compared to SQL
273
+
274
+ It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
275
+ equivalent functions in Polars.
276
+
277
+ For example, an SQL `RANK()` expression like...
278
+ """
279
+ )
280
+ return
281
+
282
+
283
+ @app.cell
284
+ def _(df, mo):
285
+ _df = mo.sql(
286
+ f"""
287
+ SELECT
288
+ date,
289
+ channel,
290
+ revenue,
291
+ RANK() OVER (PARTITION BY channel ORDER BY revenue DESC) AS revenue_rank
292
+ FROM df
293
+ -- re-sort the output back to the original order for ease of comparison
294
+ ORDER BY date, channel DESC
295
+ """
296
+ )
297
+ return
298
+
299
+
300
+ @app.cell(hide_code=True)
301
+ def _(mo):
302
+ mo.md(
303
+ r"""
304
+ ...does not require an `order_by` in Polars as the column and the function are already bound (including with the
305
+ `descending=True` argument).
306
+ """
307
+ )
308
+ return
309
+
310
+
311
+ @app.cell
312
+ def _(df, pl):
313
+ df.select(
314
+ "date",
315
+ "channel",
316
+ "revenue",
317
+ revenue_rank=pl.col("revenue").rank(descending=True).over("channel"),
318
+ )
319
+ return
320
+
321
+
322
+ @app.cell(hide_code=True)
323
+ def _(mo):
324
+ mo.md(
325
+ r"""
326
+ ### Descending order
327
+
328
+ We can also order in descending order by passing `descending=True`...
329
+ """
330
+ )
331
+ return
332
+
333
+
334
+ @app.cell
335
+ def _(df_sorted, pl):
336
+ (
337
+ df_sorted.select(
338
+ "date",
339
+ "weekday",
340
+ "channel",
341
+ "revenue",
342
+ pl.col("revenue").cum_sum().over("channel", order_by="date").alias("cumulative_revenue"),
343
+ pl.col("revenue").first().over("channel", order_by="date").alias("first_revenue"),
344
+ pl.col("revenue")
345
+ .first()
346
+ .over("channel", order_by="date", descending=True)
347
+ .alias("last_revenue"),
348
+ # Or, alternatively
349
+ pl.col("revenue").last().over("channel", order_by="date").alias("also_last_revenue"),
350
+ )
351
+ )
352
+ return
353
+
354
+
355
+ @app.cell(hide_code=True)
356
+ def _(mo):
357
+ mo.md(
358
+ """
359
+ ## Mapping Strategies
360
+
361
+ Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
362
+
363
+ Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
364
+ strategies, we will explore other possibilities.
365
+ """
366
+ )
367
+ return
368
+
369
+
370
+ @app.cell(hide_code=True)
371
+ def _(mo):
372
+ mo.md(
373
+ """
374
+ ### Group to rows
375
+
376
+ "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
377
+ window.
378
+ """
379
+ )
380
+ return
381
+
382
+
383
+ @app.cell
384
+ def _(df, pl):
385
+ df.with_columns(
386
+ cumulative_revenue=pl.col("revenue").cum_sum().over("channel", mapping_strategy="group_to_rows")
387
+ )
388
+ return
389
+
390
+
391
+ @app.cell(hide_code=True)
392
+ def _(mo):
393
+ mo.md(
394
+ """
395
+ ### Join
396
+
397
+ The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
398
+ """
399
+ )
400
+ return
401
+
402
+
403
+ @app.cell
404
+ def _(df, pl):
405
+ df.with_columns(
406
+ cumulative_revenue=pl.col("revenue").cum_sum().over("channel", mapping_strategy="join")
407
+ )
408
+ return
409
+
410
+
411
+ @app.cell(hide_code=True)
412
+ def _(mo):
413
+ mo.md(
414
+ r"""
415
+ ### Explode
416
+
417
+ The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
418
+ rows. Due to this, it requires sorting columns (including those not in the window function) for the result to make sense.
419
+ It should also only be used in a `select` context and not `with_columns`.
420
+
421
+ The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
422
+ """
423
+ )
424
+ return
425
+
426
+
427
+ @app.cell
428
+ def _(df, pl):
429
+ df.select(
430
+ pl.all().over("channel", order_by="date", mapping_strategy="explode"),
431
+ cumulative_revenue=pl.col("revenue")
432
+ .cum_sum()
433
+ .over("channel", order_by="date", mapping_strategy="explode"),
434
+ )
435
+ return
436
+
437
+
438
+ @app.cell(hide_code=True)
439
+ def _(mo):
440
+ mo.md(r"""Note the modified order of the rows in the output, (but data is the same)...""")
441
+ return
442
+
443
+
444
+ @app.cell(hide_code=True)
445
+ def _(mo):
446
+ mo.md(r"""## Other tips and tricks""")
447
+ return
448
+
449
+
450
+ @app.cell(hide_code=True)
451
+ def _(mo):
452
+ mo.md(
453
+ r"""
454
+ ### Reusing a window
455
+
456
+ In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
457
+ without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
458
+ """
459
+ )
460
+ return
461
+
462
+
463
+ @app.cell
464
+ def _(df_sorted, pl):
465
+ window = {
466
+ "partition_by": "date",
467
+ "order_by": "date",
468
+ "mapping_strategy": "group_to_rows",
469
+ }
470
+
471
+ df_sorted.with_columns(
472
+ pct_daily_revenue=(pl.col("revenue") / pl.col("revenue").sum()).over(**window),
473
+ highest_revenue_channel=pl.col("channel").top_k_by("revenue", k=1).first().over(**window),
474
+ daily_revenue_rank=pl.col("revenue").rank().over(**window),
475
+ cumulative_daily_revenue=pl.col("revenue").cum_sum().over(**window),
476
+ )
477
+ return (window,)
478
+
479
+
480
+ @app.cell(hide_code=True)
481
+ def _(mo):
482
+ mo.md(
483
+ r"""
484
+ ### Rolling Windows
485
+
486
+ Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
487
+ is also aware of temporal data, making it easy to express if the data is not contiguous (i.e., observations are missing).
488
+
489
+ Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
490
+ max revenue split by channel...
491
+ """
492
+ )
493
+ return
494
+
495
+
496
+ @app.cell
497
+ def _(date, df, pl):
498
+ (
499
+ df.filter(pl.col("date") != date(2025, 2, 2))
500
+ .with_columns(
501
+ # "3d" -> 3 days
502
+ rev_3_day_max=pl.col("revenue").rolling_max_by("date", "3d", min_samples=1).over("channel"),
503
+ rev_3_row_max=pl.col("revenue").rolling_max(3, min_samples=1).over("channel"),
504
+ )
505
+ # sort to make the output a little easier to analyze
506
+ .sort("channel", "date")
507
+ )
508
+ return
509
+
510
+
511
+ @app.cell(hide_code=True)
512
+ def _(mo):
513
+ mo.md(r"""Notice the difference in the 2nd last row...""")
514
+ return
515
+
516
+
517
+ @app.cell(hide_code=True)
518
+ def _(mo):
519
+ mo.md(r"""We hope you enjoyed this notebook, demonstrating window functions in Polars!""")
520
+ return
521
+
522
+
523
+ @app.cell
524
+ def _(mo):
525
+ mo.md(
526
+ r"""
527
+ ## Additional References
528
+
529
+ - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
530
+ - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
531
+ - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
532
+ """
533
+ )
534
+ return
535
+
536
+
537
+ if __name__ == "__main__":
538
+ app.run()