Working Efficiently with Data

In this section will focus on best practices for working efficiently with data, primarily with tabular data.

Use appropriate data structures

To begin with, let’s have a look at differences in performance of working with data in stored in different data structures.

Matrix instead of data.frame

If all your tabular data is of a single data type (e.g. numeric, logical, character etc), it can be more efficient to store it as a matrix than as data.frame. That’s because many functions can operate it on it with the confidence that all data will indeed be of a single data type instead of having to perform data type coercions or checks on whether the operation is possible.

Let’s have a look at an example where we store the same data as a matrix and a data.frame. Our data is a large table with 10,000 rows and 150 columns.

set.seed(1)
rows <- 10000
cols <- 150

data_mat <- matrix(rnorm(rows * cols, mean = 10), ncol = cols)
data_df <- as.data.frame(data_mat)

Let’s now set up a bench::press() and test the performance of a number of function on the two data structures.

bench::press(
    fn_name = c("rowSums", "rowMeans", "colSums", "colMeans", "sqrt"),
    {
        fn <- get(fn_name)
        bench::mark(
            df = fn(data_df),
            mat = fn(data_mat),
            check = FALSE
        )
    }
)

Running with:
  fn_name

1 rowSums

2 rowMeans

3 colSums

4 colMeans

5 sqrt

# A tibble: 10 × 7
   expression fn_name       min   median `itr/sec` mem_alloc `gc/sec`
   <bch:expr> <chr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
 1 df         rowSums    2.08ms   2.71ms      361.   11.53MB   356.  
 2 mat        rowSums  354.12µs 366.38µs     2724.   78.17KB     8.17
 3 df         rowMeans   2.07ms   2.87ms      366.   11.53MB   228.  
 4 mat        rowMeans 354.12µs  366.7µs     2723.   78.17KB     4.05
 5 df         colSums    3.09ms    3.6ms      270.   11.45MB   137.  
 6 mat        colSums    1.37ms   1.37ms      728.    1.22KB     0   
 7 df         colMeans   3.08ms    3.4ms      277.   11.45MB   142.  
 8 mat        colMeans   1.37ms   1.37ms      729.    1.22KB     0   
 9 df         sqrt       2.46ms   2.58ms      386.   11.74MB   252.  
10 mat        sqrt       1.56ms   1.66ms      571.   11.44MB   194.

Matrices are clearly more performant especially on row wise operations. They are still faster on column wise operations by a decent margin as well as vectorised mathematical operations like sqrt() although less so.

Single vs double precision floating point

Base R numeric types are either “double” which indicates double precision floating points or integer.

Package float enables storing numeric values as single precision (aka). Floats have half the precision of double precision data, for a pretty obvious performance vs accuracy tradeoff.

A matrix of floats should use about half as much memory as a matrix of doubles which can be really useful if you’re dealing with data approaching memory limits:

library(float)
data_fl_mat <- fl(data_mat)

pryr::object_size(data_mat)

12.00 MB

pryr::object_size(data_fl_mat)

6.00 MB

In addition many matrix routines will generally compute about twice as fast on them as well.

bench::press(
    fn_name = c("rowSums", "colSums", "sqrt"),
    {
        fn <- get(fn_name)
        bench::mark(
            mat = fn(data_mat),
            fl_mat = fn(data_fl_mat),
            check = FALSE
        )
    }
)

Running with:
  fn_name

1 rowSums

2 colSums

3 sqrt

# A tibble: 6 × 7
  expression fn_name      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <chr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mat        rowSums 348.17µs  356.5µs     2794.   78.17KB     4.05
2 fl_mat     rowSums 155.72µs 160.23µs     6110.   41.35KB    21.2 
3 mat        colSums   1.37ms   1.37ms      727.    1.22KB     0   
4 fl_mat     colSums   1.41ms   1.41ms      708.      976B     4.04
5 mat        sqrt       1.6ms   1.96ms      539.   11.44MB   183.  
6 fl_mat     sqrt       487µs 518.69µs     1895.    5.72MB   190.

By switching to single precision floating points, we see additional speed ups of row-wise operations although none for column-wise operations. We do however see a huge speed up of the vectorised calculation of sqrt().

For more details on the routines available, consult the float pkg documentation.

Danger

A note of caution: The results of matrix routines on float data will not be as accurate, and are much more prone to roundoff error/mass cancellation issues. If your data is well-conditioned, then using floats is “probably” fine for many applications. They can be an invaluable approach to consider when working with data that cannot but must fit into memory as double.

Efficient Indexing

Before moving on to data munging and more complex manipulating of data, let’s briefly have at look at different approaches for indexing, i.e. the process of extracting specific elements of a data structure.

Indexing tabular data

Let’s set up a very simple test, say we want to extract rows 10 to 15 from the 55th column of our example data.

Let’s compare using a variety of base R approaches on both our matrix and data.frame. We’ll also include the dplyr version of extracting the same values.

library(dplyr)

bench::mark(
    data_mat[10:15, 55],
    data_df[[55]][10:15],
    data_df[["V55"]][10:15],
    data_df[10:15, 55],
    data_df[10:15, "V55"],
    dplyr = {
    select(data_df, V55) %>%
        slice(10:15) %>%
        pull()
    }
)

# A tibble: 6 × 6
  expression                   min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 data_mat[10:15, 55]     205.01ns 287.02ns  3139300.        0B      0  
2 data_df[[55]][10:15]      1.89µs   2.13µs   454729.        0B     45.5
3 data_df[["V55"]][10:15]   2.13µs   2.46µs   399046.        0B     39.9
4 data_df[10:15, 55]        3.94µs   4.39µs   220546.    1.22KB     22.1
5 data_df[10:15, "V55"]     4.18µs   4.76µs   202873.    1.22KB     20.3
6 dplyr                   964.52µs      1ms      955.    2.05MB     24.0

Indexing matrices is by far the fastest, almost by an order of magnitude!
It can be more efficient to extract vectors from a data.frame and then subset those instead of indexing rows and columns in the data.frame directly.
Base R is much faster than dplyr for simple indexing.

Indexing lists

Similarly there is a variety of ways of indexing lists, some more efficient than others. Here I’ve just created a list where each element is an element of letters while each element’s name in the corresponding capital letter

letter_list <- setNames(letters, LETTERS) |>
    as.list()

head(letter_list)

$A
[1] "a"

$B
[1] "b"

$C
[1] "c"

$D
[1] "d"

$E
[1] "e"

$F
[1] "f"

Say we want to access the value in the fourth element, element "D".

I’ve put together a number of approaches, ranging from using purrr::pluck() and piping the data (something I see quite often these days) to standard base indexing using in numeric indexing and indexing by name.

bench::mark(
    magrittr_pipe = {
        letter_list %>%
        purrr::pluck("D")
    },
    base_r_pipe = {
        letter_list |>
        purrr::pluck("D")
    },
    purrr_no_pipe = {
        purrr::pluck(letter_list, "D") 
    },
    base_dollar_idx ={
        letter_list$D
    },
    base_chr_idx ={
        letter_list[["D"]]
    },
    base_int_idx ={
        letter_list[[4]]
    }
)

# A tibble: 6 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 magrittr_pipe     3.73µs   4.18µs   235530.     273KB     23.6
2 base_r_pipe       2.95µs    3.2µs   306067.        0B     30.6
3 purrr_no_pipe     2.95µs    3.2µs   307520.        0B     30.8
4 base_dollar_idx  41.04ns 123.05ns  7614980.        0B      0  
5 base_chr_idx           0  82.02ns 12174874.        0B      0  
6 base_int_idx           0  41.04ns 14392882.        0B   1439.

The fastest way to index a list is using base R and a numeric index or character name.
Dollar sign indexing is slower (and not as safe) because R looks for partial matches.
purrr::pluck() is orders of magnitude slower, especially when a pipe is thrown in there for good measure! If you’re doing simple indexing, I would avoid pluck() all together.

Base R can be extremely efficient at indexing into data structures, especially matrices.

Again, here we are in the real of micro-optimisations, but if you’re going to be running your code 1000s of times, these micro-optimisations soon add up.

Efficient Data munging

In the real word, we’re often dealing with mixed types of data that cannot be stored in a matrix and need to perform more complicated operations than summing rows or indexing.

So let’s now turn to some of the most common data munging operations and compare and contrast the performance of a few of the most well known packages in use today to work with data.frames.

`dplyr`

dplyr is the flag ship package of the tidyverse, providing a consistent set of verb functions to help address the most common data manipulation challenges in a user friendly way.

PROs

well integrated collection of functions for data munging.
easy to read and interpret code even as a beginner.
reasoning about operations made easier by the use of pipes as well as doing away with the need for intermediate objects.
In addition to data frames/tibbles, dplyr makes working with other computational backends like databases and arrow tables accessible and efficient.

CONs

quite verbose and code can end up running across many lines.
can be (comparatively) slow.

Example

mtcars %>% 
    filter(wt < 5) %>% 
    mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
    group_by(cyl) %>% 
    summarise(l100k = mean(l100k))

# A tibble: 3 × 2
    cyl l100k
  <dbl> <dbl>
1     4  9.05
2     6 12.0 
3     8 14.9

`data.table`

Provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. It has it’s own compact syntax that feels like a blend of base R and some dplyr concepts (e.g. the by argument for grouping operations by syntax.

At it’s most basic, data.table syntax can be summarised as

DT[i, j, by]

where i is used for filtering or reordering rows, j is used for manipulating and selecting columns and by is used for grouping operations.

Instead of piping data.table uses the concept of chaining, where subsequent expressions are performed creating a chain of operations through the construct DT[...][...][...] etc.

PROs

syntax is very compact
generally faster for many operation, especially as the sizes of datasets grow
operations that modify data in place improve memory efficiency and can also boost performance
extremely fast functionality for reading in data through function fread.

CONs

Syntax can be confusing to understand and work with without familiarity with the package especially when chaining multiple operations

Example

library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

mtcars_dt <- as.data.table(mtcars)
mtcars_dt[wt < 5, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
    keyby = .(cyl)]

   cyl     l100k
1:   4  9.048898
2:   6 11.970180
3:   8        NA

`dtplyr`

dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code. The current implementation focuses on lazy evaluation triggered by use of lazy_dt(). This means that no computation is performed until you explicitly request it with as.data.table(), as.data.frame() or as_tibble().

PROs

provides ability to write dplyr code with often improved performance.
can be useful for learning how to translate dplyr code to data.table code.

CONs

does not always reach the performance of data.table and translations for some operations are yet to be available.

Example

First a lazy_dt needs to be created with lazy_dt(). You can then use most dplyr functions as normal. Executing the code shows the data.table translation at the top section in Call:. This can be really useful for trying to learn dplyr syntax.

library(dtplyr)
mtcars_dtp <- lazy_dt(mtcars)
mtcars_dtp %>% 
    filter(wt < 5) %>% 
    mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
    group_by(cyl) %>% 
    summarise(l100k = mean(l100k))

Source: local data table [3 x 2]
Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
    keyby = .(cyl)]

    cyl l100k
  <dbl> <dbl>
1     4  9.05
2     6 12.0 
3     8 14.9 

# Use as.data.table()/as.data.frame()/as_tibble() to access results

The results of executing the above code cannot be accessed until one of as.data.table()/as.data.frame()/as_tibble() or even collect() is called at the end.

mtcars_dtp %>% 
    filter(wt < 5) %>% 
    mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
    group_by(cyl) %>% 
    summarise(l100k = mean(l100k)) %>%
    as_tibble()

# A tibble: 3 × 2
    cyl l100k
  <dbl> <dbl>
1     4  9.05
2     6 12.0 
3     8 14.9

Benchmarking data munging operations

Load data and create data structures for comparison

In this section we’ll use the synthetic datasets created for this course. I’ll be using the file with 10,000,000 but If you prefer to use a smaller one feel free to. Note though that, in general, the benefits of using data.table increase with the size of the dataset.

Let’s go ahead and load our parquet data using arrow::read_parquet() which loads the data as a tibble.

data_df <- arrow::read_parquet("data/synthpop_10000000.parquet")

We can have a look at the characteristics if our data using skimr::skim().

Note: this can take a lot of time to compute so feel free to skip this step.

skimr::skim(data_df)

Data summary
Name	data_df
Number of rows	10000000
Number of columns	13
_______________________
Column type frequency:
character	6
logical	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
sex	0	1.00	4	6	2
agegr	8096	1.00	3	5	6
socprof	66184	0.99	6	27	9
marital	17069	1.00	6	18	6
edu	13038	1.00	9	24	4
location	0	1.00	3	26	2623

Variable type: logical

skim_variable	complete_rate	mean	count
sport	1	0.66	TRU: 6565029, FAL: 3434971
smoke	1	0.26	FAL: 7420496, TRU: 2579504
alcabuse	1	0.06	FAL: 9370742, TRU: 629258

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1.00	47.68	18.51	16.00	32.0	49.00	61.00	97.00	▆▆▇▃▁
income	2573137	0.74	1639.42	1222.73	100.00	960.0	1330.00	2000.00	16000.00	▇▁▁▁▁
nociga	7473351	0.25	15.41	7.91	1.00	10.0	15.00	20.00	60.00	▇▇▁▁▁
bmi	121355	0.99	25.95	4.67	12.96	22.6	25.61	28.72	61.36	▃▇▁▁▁

Let’s now create a data.table and lazy_dt from our data to run our benchmarks against.

data_dt <- as.data.table(data_df)
data_dtp <- lazy_dt(data_dt)

Basic Benchmarks

First, let’s compare some basic operations on our data and include base R approaches.

Sub-setting

Let’s start with some simple sub-setting.

Column sub-setting

First let’s look at column sub-setting for the columns "age", "marital", "income" and "sport" and compare base R, dplyr, dtplyr and data.table approaches:

bench::mark(
    "Base R" = {
        data_df[, c("age", "marital", "income", "sport")]
    },
    "dplyr" = {
        select(data_df, age, marital, income, sport)
    },
    "dtplyr" = {
        select(data_dtp, age, marital, income, sport) %>%
            as_tibble()
    },
    "data.table" = {
        data_dt[, .(age, marital, income, sport)]
    },
    check = FALSE)

# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 Base R       9.43µs  10.04µs   98372.         0B     19.7
2 dplyr      990.11µs   1.03ms     968.     16.8KB     16.9
3 dtplyr      33.82ms  34.34ms      25.4   381.8MB     33.9
4 data.table  32.32ms  32.46ms      30.6   381.5MB     38.3

We can see that in the case of column sub-setting, base R is actually highly performant, both in terms of memory and speed, almost 10x faster than dplyr which is the next fastest.

Surprisingly, data.table and dtplyr are both comparatively slow for simple column sub-setting operations.

Row filtering

Let’s move on to comparing row filtering approaches. Let’s filter for rows where values of income are not NA and age is greater than 30:

bench::mark(
  "Base R" = {
    data_df[!is.na(data_df$income) & data_df$age > 30, ]
  },
  "dplyr" = {
    filter(data_df, 
                  !is.na(income) & age > 30) 
  },
  "dtplyr" = {
    filter(data_dtp, 
                  !is.na(income) & age > 30) %>%
          as_tibble()
  },
  "data.table" = {
    data_dt[!is.na(income) & age > 30, ]
  },
  check = FALSE)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 Base R        1.03s    1.03s     0.966     662MB     7.73
2 dplyr      972.55ms 972.55ms     1.03      777MB     9.25
3 dtplyr     287.76ms 303.52ms     3.29      700MB     1.65
4 data.table 282.24ms 312.47ms     3.20      700MB     1.60

In the case of row filtering, we find exactly the opposite! Both base R and dplyr perform similarly but are both significantly slower than data.table and dtplyr.

This is actually where data.table (and conversely dtplyr) really shine, in filtering rows.

Tip

Filtering using data.table can be speeded up even more using keys and secondary indices. So there’s a lot of potential for further optimisation if you need to perform repeated filtering or aggregating on specific columns. Consult the data.table documentation for more details.

Combined column and row sub-setting

Lastly, let’s perform sub-setting involving both the row and column sub-setting we looked at previously.

bench::mark(
  "Base R" = {
    data_df[!is.na(data_df$income) & data_df$age > 30, 
            c("age", "marital", "income", "sport")]
  },
  "dplyr" = {
    filter(data_df, 
                  !is.na(income) & age > 30) %>%
          select(age, marital, income, sport)
  },
  "dtplyr" = {
    filter(data_dtp, 
                  !is.na(income) & age > 30) %>%
          select(age, marital, income, sport) %>%
          as_tibble()
  },
  "data.table" = {
    data_dt[!is.na(income) & age > 30, 
            .(age, marital, income, sport)]
  },
  check = FALSE)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 Base R        750ms    750ms      1.33     298MB    10.7 
2 dplyr         946ms    946ms      1.06     777MB     9.51
3 dtplyr        158ms    162ms      5.68     482MB     1.89
4 data.table    172ms    174ms      5.31     482MB     5.31

Overall, the computing requirements of filtering rows overshadows that of sub-setting columns so data.table emerge as the overall winner and the performance boost will increase in most cases with the size of the data set.

Ordering

Let’s now have a look at performance of ordering. Let’s order our data on the values of a single column, income.

In data.table there are a couple of approaches that can be used. The first is to use order() in i which creates a vector of indices indicating the order of the values in the column name passed to order() and effectively uses those indices to sub-set the rows in the correct order. This version orders NAs at the bottom as do all the other approaches.

A more efficient approach is to use data.table’s function setorder(). This version orders NAs at the top.

Because setorder() would modify data_dt in place, in one test we’ll perform the ordering on a copy of data_dt using function copy to better reflect the behaviour of the other expressions.

We’ll also test the speed of modifying in place though too. The modification in place behaviour of data.table poses challenges in a repeated testing environment, because, once the object is modified in place the first time the ordering is performed, subsequent runs do not reflect any ordering operation as the object is already ordered. To address this, I’ve turned memory profiling off, as this runs the code once to get the memory profile regardless of the number of test iterations, and then set the number of iterations to 1 so that each test is run only once.

bench::mark(
  "Base R" = {
    data_df[order(data_df$income),]
  },
  "dplyr" = {
    arrange(data_df, income)
  },
  
  "dtplyr" = {
    arrange(data_dtp, income) %>%
          as_tibble()
  },
  "data.table_order" = {
    data_dt[order(income)]
  },
  "data.table_setorder_copy" = {
    setorder(copy(data_dt), income)
  },
  "data.table_setorder" = {
      setorder(data_dt, income)
  },
  iterations = 1,
  memory = FALSE,
  check = FALSE)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 6 × 6
  expression                    min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 Base R                      1.24s    1.24s     0.806        NA    1.61 
2 dplyr                       1.12s    1.12s     0.892        NA    0.892
3 dtplyr                   977.45ms 977.45ms     1.02         NA    0    
4 data.table_order            1.08s    1.08s     0.922        NA    0.922
5 data.table_setorder_copy 526.98ms 526.98ms     1.90         NA    0    
6 data.table_setorder      455.79ms 455.79ms     2.19         NA    0

Base R, dplyr and ordering a data.table using order() come back as the slowest approaches. setorder() is faster, even on a copy of the object, with dtplyr coming up as marginally fastest.

Before moving on, let’s reset data_df which we just modified.

data_dt <- as.data.table(data_df)

Mutating

For our final basic test, let’s have a look at mutating, i.e. creating a new column from calculation performed using values from another column in our dataset. For this example, we’ll calculate the relative income compared to mean income across the whole population.

To compare similar behaviour and allow us to include a comparison to base R, we’ll write our tests so that the original object is actually modified. We’ll again turn off memory profiling and set the number of iterations again to ensure we’re not re-modifying previously modified objects which could affect our results. Also to ensure our original data objects are not overwritten by the test, we perform the testing in a new and separate environment by using bench::marks()’s env argument. This however does not work for data.tables so we’ll again need to reset data_dt once we’re done.

bench::mark(
    "Base R" = {
        data_df$rel_income <- data_df$income/mean(data_df$income, na.rm = TRUE)
    },
    "dplyr" = {
        data_df <- mutate(data_df, rel_income = income/mean(income, na.rm = TRUE))
    },
    "dtplyr" = {
        data_dtp <- mutate(data_dtp, rel_income = income/mean(income, na.rm = TRUE)) %>%
            as_tibble()
    },
    "data.table" = {
        data_dt[, rel_income := income/mean(income, na.rm = TRUE)]
    },
  iterations = 1,
  memory = FALSE,
  check = FALSE,
  env = new.env())

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 Base R        1.39s    1.39s     0.719        NA    10.1 
2 dplyr         1.07s    1.07s     0.931        NA    14.0 
3 dtplyr     116.46ms 116.46ms     8.59         NA     0   
4 data.table 131.61ms 131.61ms     7.60         NA     7.60

Here data.table() is the clear winner, over 10 faster than base R and 8 faster than dplyr. dtplyr is still fast but almost 2x slower than data.table.

Because we’ve modified data_dt again, let’s go ahead and reset it before moving on.

data_dt <- as.data.table(data_df)

More complex examples

Now that we’ve looked at the isolated performance of different types of data munging, let’s explore performance of more complex computations on our data. This also gives us an opportunity to explore data.table syntax a bit more and compare to dplyr syntax.

Example 1

In this example we’ll combine filtering, selecting and arranging operations and this time we’ll perform arranging over a larger number of columns:

bench::mark(
    dplyr = {
        filter(data_df,
               age > 50L & age < 60L, 
               income < 300) %>%
            select(bmi, age, income, nociga, sex) %>%
            arrange(bmi, age, income, nociga, sex)
    },
    dtplyr = {
        filter(data_dtp,
               age > 50L & age < 60L, 
               income < 300) %>%
            select(bmi, age, income, nociga, sex) %>%
            arrange(bmi, age, income, nociga, sex) %>%
            as_tibble()
        },
    data.table = {
        data_dt[age > 50L & age < 60L & income < 300, 
                .(bmi, age, income, nociga, sex)][
                    order(bmi, age, income, nociga, sex)
                    ]
    },
    iterations = 5,
    check = FALSE
)

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 dplyr       117.6ms  124.3ms      8.04     269MB     2.01
2 dtplyr       90.8ms   91.4ms     10.9      230MB     2.72
3 data.table   89.9ms     91ms     11.0      230MB     2.75

data.table is fastest with dtplyr close behind, yet the differences are not as big as some of the order of magnitude differences we’ve seen in other examples.

Example 2

Let’s explore some performance differences on aggregating across groups and calculating summary statistics. We’ll start some example with some complex filtering, then group our results by marital status and then calculate min, max and mean income across each group.

bench::mark(
        dplyr = {
            filter(data_df,
                   age > 65L,
                   sex == "MALE",
                   sport == TRUE,
                   !is.na(income),
                   !is.na(marital)) %>%
                group_by(marital) %>%
                summarise(min_income = min(income),
                          max_income = max(income),
                          mean_income = mean(income))
    },
    dtplyr = {
            filter(data_dtp,
                   age > 65L,
                   sex == "MALE",
                   sport == TRUE,
                   !is.na(income),
                   !is.na(marital)) %>%
                group_by(marital) %>%
                summarise(min_income = min(income),
                          max_income = max(income),
                          mean_income = mean(income)) %>%
            as_tibble()
    },
    data.table = {
        data_dt[ age > 65L &
                   sex == "MALE" &
                   sport == TRUE &
                   !is.na(income) &
                   !is.na(marital), 
                  .(min_income = min(income),
                          max_income = max(income),
                          mean_income = mean(income)),
                 keyby = .(marital)]
        },
    iterations = 5,
    memory = FALSE,
    check = FALSE
)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 dplyr         895ms    899ms      1.05        NA    8.84 
2 dtplyr        215ms    221ms      4.41        NA    0.882
3 data.table    206ms    207ms      4.71        NA    0.941

Here we’re back to an order of magnitude difference in performance between data.table/dtplyr and dplyr, primarily due to the excellent performance of data.table on filtering.

Example 3

In our third example we’ll add some mutating and creating a new column, income_group, which splits income into income brackets. We’ll then calculate mean bmi across each income group.

bench::mark(
    dplyr = {
        filter(data_df, !is.na(income)) %>%
            mutate(income_group = cut(income,
                                      breaks = seq(0, 16000, by = 1000),
                                      include.lowest = T,
                                      right = F)
                   
            ) %>%
            group_by(income_group) %>%
            summarise(bmi_mean = mean(bmi, na.rm = TRUE))
    },
    dtplyr = {
        filter(data_dtp, !is.na(income)) %>%
            mutate(income_group = cut(income,
                                      breaks = seq(0, 16000, by = 1000),
                                      include.lowest = T,
                                      right = F)
                   
            ) %>%
            group_by(income_group) %>%
            summarise(bmi_mean = mean(bmi, na.rm = TRUE)) %>%
            as_tibble()
    },
    data.table = {
        data_dt[!is.na(income)][, 
                                `:=`(income_group = cut(income, 
                                                        breaks = seq(0, 16000, by = 1000),
                                                        include.lowest = TRUE, 
                                                        right = FALSE)
                                )][, .(bmi_mean = mean(bmi, na.rm = TRUE)), 
                                   keyby = .(income_group)]
    },
    iterations = 5,
    check = FALSE
)

Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 dplyr         1.23s    1.34s     0.739    1.38GB     6.36
2 dtplyr     537.83ms 659.77ms     1.49     1.18GB     1.49
3 data.table 587.47ms 635.62ms     1.50     1.18GB     1.81

In this example, data.table and dtplyr are comparable and still much faster than that dplyr.

Example 4

For our final example, we’ll again perform some initial filtering but this time aggregate across location which has a much higher number of groups than marital status. We’ll then calculate mean number of cigarettes smoked (across smokers only) and the proportion of the population at a given location that are smokers.

bench::mark(
    dplyr = {
        filter(data_df, 
               age < 30) %>%
            group_by(location) %>%
            summarise(nociga_mean = mean(nociga, na.rm = TRUE),
                      prop_smoke = sum(smoke)/n())
    },
    dtplyr = {
        filter(data_dtp, 
               age < 30) %>%
            group_by(location) %>%
            summarise(nociga_mean = mean(nociga, na.rm = TRUE),
                      prop_smoke = sum(smoke)/n()) %>%
            as_tibble()
    },
    data.table = {
        data_dt[age < 30][, .(nociga_mean = mean(nociga, na.rm = TRUE), 
                              prop_smoke = sum(smoke)/.N), 
                          keyby = .(location)]
    },
    iterations = 5,
    check = FALSE
)

# A tibble: 3 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 dplyr         171ms    173ms      5.80     412MB     3.87
2 dtplyr        134ms    136ms      7.36     262MB     0   
3 data.table    129ms    131ms      7.55     262MB     1.89

Here, although data.table and dtplyr are again comparable and still faster than that dplyr the difference in perfromance is much smaller, indicating that when aggregating across many groups, dplyr’s relative performance appears to catch up.

Take Aways

Overall data.table performs better across most data munging tasks, often significantly so, with dtplyr being comparable or slightly slower but generally faster than dplyr.
data.table/dtplyr are especially fast on row filtering rows, less so on ordering data.
Base R is generally much slower for most data munging operations apart from selecting columns where it can actually be the fastest option by quite some margin.

A side note on copying

One of the features that makes data.table very efficient is that it modifies the data in place. None of it’s functions and operators create copies when manipulating a data.table. This behaviour also extends to using dtplyr on a lazy_dt object.

Let’s have a look at a quick example. Let’s perform some data munging on a tibble, data.table and lazy_dt version of the mtcars data while using tracemem() to keep track of any copies being made during the operations.

Let’s filter for wt < 5 and convert miles per gallon (mpg) into liters per 100 km in a new column called l100k.

# Create data structures
mtcars_tbl <- as_tibble(mtcars)
mtcars_dt <- as.data.table(mtcars)
mtcars_dtp <- lazy_dt(mtcars)

# Tracemem objects
tracemem(mtcars_tbl)

[1] "<0x145ceefc8>"

tracemem(mtcars_dt)

[1] "<0x12493ee00>"

tracemem(mtcars_dtp)

[1] "<0x120bed068>"

`tibble`

# Munge tibble
mtcars_tbl %>% 
    filter(wt < 5) %>% 
    mutate(l100k = 235.21 / mpg)

tracemem[0x145ceefc8 -> 0x145c1b078]: initialize <Anonymous> filter_rows filter.data.frame filter mutate %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x145c1b078 -> 0x145c1b128]: names<-.tbl_df names<- initialize <Anonymous> filter_rows filter.data.frame filter mutate %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x145c1b128 -> 0x145c1b1d8]: initialize <Anonymous> filter_rows filter.data.frame filter mutate %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main

# A tibble: 29 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb l100k
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 11.2 
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 11.2 
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 10.3 
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 11.0 
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 12.6 
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 13.0 
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 16.4 
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2  9.64
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 10.3 
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 12.3 
# … with 19 more rows

Performing the filtering and mutating on a tibble results internally in 3 copies being made! Surprisingly, this happens even when we are just filtering:

mtcars_tbl %>% 
    filter(wt < 5)

tracemem[0x145ceefc8 -> 0x127de6208]: initialize <Anonymous> filter_rows filter.data.frame filter %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x127de6208 -> 0x127de62b8]: names<-.tbl_df names<- initialize <Anonymous> filter_rows filter.data.frame filter %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x127de62b8 -> 0x127de6368]: initialize <Anonymous> filter_rows filter.data.frame filter %>% eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> execute .main

# A tibble: 29 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 19 more rows

`data.table`

When performing the same operation on a data.table, no copies are made:

# Munge data.table
mtcars_dt[wt < 5, `:=`(l100k = 235.21/mpg)]

Indeed we don’t even get the results printed out. That because the data.table was modified in place, without having to assign the result back to the original object. If we have a look at the object we can see that it now has the new l100k column.

head(mtcars_dt)

    mpg cyl disp  hp drat    wt  qsec vs am gear carb    l100k
1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 11.20048
2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 11.20048
3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 10.31623
4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 10.99112
5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 12.57807
6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 12.99503

lazy_dt

Let’s have a look at what happens when using dtplyr on a lazy_dt object.

mtcars_dtp %>% 
    filter(wt < 5) %>% 
    mutate(l100k = 235.21 / mpg) %>%
    as_tibble()

# A tibble: 29 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb l100k
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 11.2 
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 11.2 
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 10.3 
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 11.0 
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 12.6 
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 13.0 
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 16.4 
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2  9.64
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 10.3 
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 12.3 
# … with 19 more rows

Interestingly we get similar behaviour to data.table with respect to coying, in that, no copies are made during the operation. However, it does not modify in place either. If we inspect the original object, there is no l100k column.

mtcars_dtp

Source: local data table [32 x 11]
Call:   `_DT3`

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# … with 26 more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results

We would need to assign it back to mtcars_dtp if we wanted to replicate data.table behaviour.