--- title: "Getting started with slider" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with slider} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(slider) library(dplyr, warn.conflicts = FALSE) library(lubridate, warn.conflicts = FALSE) ``` This vignette is meant to serve as an introduction to {slider}. In it, you'll learn about the three core functions in the package: `slide()`, `slide_index()`, and `slide_period()`, along with their many variants. slider is a package for rolling analysis using window functions. "Window functions" is a term that I've borrowed from SQL that means that some function is repeatedly applied to different "windows" of your data as you step through it. Typical examples of applications of window functions include rolling averages, cumulative sums, and more complex things such as rolling regressions. ## slide() To better understand window functions, we'll turn to our first core function, `slide()`. `slide()` is a bit like `purrr::map()`. You supply a vector to slide over, `.x`, and a function to apply to each window, `.f`. With those two things alone, `slide()` is almost identical to `map()`. ```{r} slide(1:4, ~.x) ``` On top of this, you can control the size and placement of the window by using the additional arguments to `slide()`. For example, you can ask for a window of size 3 containing "the current element, as well as the 2 before it" like this: ```{r} slide(1:4, ~.x, .before = 2) ``` You'll notice that the first two elements of the list contain partial or "incomplete" windows. By default, `slide()` assumes that you want to compute on these windows anyways, but if you don't care about them, you can change the `.complete` argument. ```{r} slide(1:4, ~.x, .before = 2, .complete = TRUE) ``` `slide()` is _size stable_, so you always get an output that is the same size as your input. Because of that, the partial results have been replaced by the corresponding missing value. For a list, that is `NULL`. Sometimes, changing the placement of the window is a critical part of your calculation. For example, you might want a "center alignment" where you have an equal number of values before and after the current element. To accomplish this, you can combine the `.before` argument with `.after` to get a centered window. Here we ask for a window of size 3 containing "the current element, as well as 1 element before and 1 element after". It is "centered" because in position 2 we have a complete window of the current element (2), along with one element before (1) and one after (3). ```{r} slide(1:4, ~.x, .before = 1, .after = 1) ``` `slide()` can also perform _expanding_ windows. These are the type that allow _cumulative_ operations to work. In prose, an expanding window would be "the current element, along with every element before this one". To construct this kind of window, you can set `.before` to `Inf`. ```{r} slide(1:4, ~.x, .before = Inf) ``` `slide()` is _type-stable_, meaning that it always returns an object of the same type, and the base form of `slide()` always returns a list. So far, this is all that we have used to illustrate how it works, but practically you are more likely to use one of the suffixed forms like `slide_dbl()` or `slide_int()`. For example, you might have a vector of sales data that you want to compute a 3 value moving average on. ```{r} sales_vec <- c(2, 4, 6, 2) slide_dbl(sales_vec, mean, .before = 2) ``` ## slide_index() To make things a bit more interesting, let's assume that the sales vector from the example above is also tied to some "index", like a date vector of when the sale actually occurred. ```{r} index_vec <- as.Date("2019-08-29") + c(0, 1, 5, 6) wday_vec <- as.character(wday(index_vec, label = TRUE)) company <- tibble( sales = sales_vec, index = index_vec, wday = wday_vec ) company ``` This index is increasing but irregular, meaning that we "jumped" from Friday to Tuesday because there were no sales between those dates. For the purpose of this example, let's assume that this is an online company where it is perfectly reasonable that you _could_ have sales on both Saturday and Sunday (If your use case requires that you "skip over" weekends and even holidays, you might like [{almanac}](https://github.com/DavisVaughan/almanac)). A reasonable business question to ask would be to compute a _3 day_ moving average. Is this different from the 3 value moving average we computed before? Here is the expected result, side by side with the 3 value one computed using `slide_dbl()` from before. ```{r, echo=FALSE} mutate( company, roll_val = slide_dbl(sales, mean, .before = 2), roll_day = slide_index_dbl(sales, index, mean, .before = 2) ) ``` The difference shows up in the third row, when computing the 3 day moving average looking back from Tuesday. To understand why they are different, consider what `slide_dbl()` does. It uses the `sales` column and looks at the "current row, along with two rows before it" to compute the result. When you are on row 3, this would select rows 1-3 giving the date range of `[Thu, Tue]`, which isn't 3 days. The correct answer would have been to look back 2 days from Tuesday, not 2 rows from row 3. This would have given us the date window of `[Sun, Tue]`, and only values in that range should be included in the moving average calculation for row 3. The only row in that range is row 3, so we should just be averaging the single value of `6` to get our result. `slide_dbl()` doesn't give us what we want because it is _unaware of the index column_. It just looks back a set number of values. What we need is a function that "knows" about the `index` and can adjust accordingly. For that, you can use `slide_index(.x, .i, .f, ...)` which has a `.i` argument to pass an index vector through. To understand how `slide_index()` works, take a look at the following comparison to `slide()`. For illustration, the current window of the weekday vector is printed out. Notice that in position 3, `slide()` gives us the "wrong" result of Thursday, Friday and Tuesday, because it just looks back 2 values. ```{r} wday_vec slide(wday_vec, ~.x, .before = 2) ``` On the other hand, `slide_index()` can be "aware" of the irregular index vector. By passing it through as `.i`, and by swapping a look back period of 2 for the lubridate object of `days(2)`, the start of the range is computed as `.i - days(2)`, which correctly computes a date window of `[Sun, Tue]` for the third element, so that we only capture Tuesday in the window. ```{r} slide_index(wday_vec, index_vec, ~.x, .before = days(2)) ``` Knowing this, we can swap out `slide_dbl()` for `slide_index_dbl()` to see how to correctly compute our 3 day rolling average. ```{r} mutate( company, roll_val = slide_dbl(sales, mean, .before = 2), roll_day = slide_index_dbl(sales, index, mean, .before = days(2)) ) ``` ## slide_period() With `slide_index()`, we always returned a vector of the same size as `.x`, and the idea was to build indices to slice `.x` with using "the current element of `.i` + some number of elements before/after it". `slide_period()` works a bit differently. It first breaks `.i` up into "time blocks" by some period (like monthly), and then uses those blocks to define how to slide over `.x`. To see an example, let's expand out our `company` sales data frame. ```{r} big_index_vec <- c( as.Date("2019-08-30") + 0:4, as.Date("2019-11-30") + 0:4 ) big_sales_vec <- c(2, 4, 6, 2, 8, 10, 9, 3, 5, 2) big_company <- tibble( sales = big_sales_vec, index = big_index_vec ) big_company ``` Now say we want to compute the monthly sales, and just return 1 value per month. Since we have 4 months, we should get 4 values back. What we really want to do here is break the `index` up into "time blocks" of 1 month, and then slide over those. That's what `slide_period()` does. ```{r} slide_period(big_company, big_company$index, "month", ~.x) ``` Since this returns 4 values, and not the same number of values as there are in `.x`, it won't fit naturally in a `mutate()` or `summarise()` statement. I find the easiest way to do this is to create a helper function that takes a data frame and returns one with the summary result for one time block, and then call that with `slide_period_dfr()`. ```{r} monthly_summary <- function(data) { summarise(data, index = max(index), sales = sum(sales)) } slide_period_dfr( big_company, big_company$index, "month", monthly_summary ) ``` Now you might be thinking, "I can do that with dplyr and lubridate!", and you'd be right: ```{r} big_company %>% mutate(monthly = floor_date(index, "month")) %>% group_by(monthly) %>% summarise(sales = sum(sales)) ``` But here is where things get interesting! Now what if we want to compute those monthly sales, but we want the time blocks to be made of the "current month block, plus 1 month block before it". For example, for the month of `2019-09`, it would include `2019-08` and `2019-09` together in the rolling summary. There isn't an easy way to do this in dplyr alone. With slider, there are two ways to do this. The first is with `slide_period_dfr()`, and it is as easy as adding `.before = 1`, to select the current month block and 1 before it. ```{r} slide_period_dfr( big_company, big_company$index, "month", monthly_summary, .before = 1 ) ``` Depending on your use case, you might want to append these results as a new column in `big_company`. To do this, we can instead go back to using `floor_date()` to generate monthly groupings, and slide over them using `slide_index_dbl()` with a lookback period of 1 month. ```{r} big_company %>% mutate( monthly = floor_date(index, "month"), sales_summary = slide_index_dbl(sales, monthly, sum, .before = months(1)) ) ```