--- title: "Design behind sparsevctrs" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Design behind sparsevctrs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(sparsevctrs) ``` The sparsevctrs package produces 3 things; ALTREP classes, matrix/data.frame converting functions, helper functions. This document outlines the rationale behind each of these and the decisions behind them. The primary objective of this package is to provide tools to work with sparse data in data.frames/tibbles. The next highest priority is execution speed. This means that algorithms and methods in this package are written to minimize memory allocations whenever possible, once that is done, running the code as fast as we can. These choices are made because this package was written to deal with tasks that were otherwise not possible due to memory constraints. ## Altrep Functions The functions `sparse_double()` and its relatives are used to construct sparse vectors of the noted type. To work they all need 4 pieces of information: - `values` - `positions` - `length` - `default` (defaults to 0) The values need to match the type of the function name or be easily coerced into the type (double -> integer). The positions should be integers or doubles that can losslessly be turned into integers. The length should be a single non-negative integer-like value. Values and positions are paired, and will thus be expected to be the same length, furthermore, positions are expected to be sorted in increasing order with no duplicates. The ordering is done to let the various extraction methods work as efficiently as possible. These functions have quite strict input checking by choice, to allow the inner workings to be as efficient as possible. The input of these functions mirrors the values stored in the ALTREP class that they produce. ## Converting Functions 3 functions fall into this category: - `coerce_to_sparse_data_frame()` - `coerce_to_sparse_tibble()` - `coerce_to_sparse_matrix()` the first two take a sparse matrix from the Matrix package and produce a data.frame/tibble with sparse columns. The last one takes a data.frame/tibble with sparse columns and produces a sparse matrix using the Matrix package. These functions are expected to be inverse of each other, such that `coerce_to_sparse_matrix(coerce_to_sparse_data_frame(x))` returns `x` back. They are made to be highly performant both in terms of speed and memory consumption, Meaning that sparsity is applied when appropriate. These functions have quite strict input checking by choice, to allow the inner workings to be as efficient as possible. It is in part why data.frames with sparse vectors with different can't be used with `coerce_to_sparse_matrix()` yet. ## Helper Functions There are 3 types of helper functions. First, we have the `is_*` family of functions. The specific `is_sparse_double()` and more general `is_sparse_vector()` can be used as a way to determine whether a vector is an ALTREP sparse vector. This is otherwise hard to tell as `as.numeric()` can't tell the difference. Secondly, we have the extraction functions. They are `sparse_values()` and `sparse_positions()`. These extract the values and positions respectively, without materializing the whole dense vector. These functions are made to work with non-sparse vectors as well to make them more ergonomic for the user. Internally they call `is_sparse_vector()`, so the choice to return something useful as the alternative wasn't hard. There is no `sparse_length()` function as `length()` works with these types of The last type of helper function is less clearly defined and is expanded as needed. The functions provide alternatives to functions that don't have ALTREP support. Such as `mean()`. Calling `mean()` on a sparse vector will force materialization, and then calculate the mean. This is memory inefficient as it could have been calculated like so. ```r sum(sparse_values(x)) / length(x) ``` These functions, all starting with the name prefix `sparse_*`, are made to work with non-sparse vectors for the same reasons listed above regarding ergonomic use. ## FAQ > Why aren't the results returned as {vctrs} classes? As it stands right now, it is viewed to be beneficial to have the users not be alerted to these vectors as they are expected to be used internally in packages and rarely by the end user. Furthermore having these sparse vectors produce the same result as dense vectors with `class()` is a big plus. > Will this package try to replace the {Matrix} package? Not at all. The sparse vector types provided in this package mimic those created with `Matrix::sparseVector()`. They work with different types and allow for different defaults. None of the matrix operations will be reimplemented here.