Data too big for memory

Working efficiently with Data

We’ve a look at approaches to speed up working with data in memory more efficiently. But what if the data we want to work with just cannot fit into memory?

There a number of approaches to deal with this situation which will depend on what exactly we need from the data to perform our operation.

For example:

Databases

Databases are an appropriate choice if you have large amounts of data that can’t fit into memory which you only require subsets from that you can extract using queries.

There are many types of databases which are beyond the scope of this workshop. What we we will focus on here is simple relational databases that store tabular data in single flat files (a.k.a. embedded databases) as opposed to databases which are run through a server like MySQL, Microsoft SQL Server PostgresSQL or which do not store tabular data, for example MongoDB which stores data as documents.

We also focus on databases that can be queried using SQL. SQL (which stands for Structured Query Language) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them.

It’s good to have an idea of SQL basics when interacting with databases, but in R, many of the dplyr verbs are inspired by SQL commands while package dbplyr can take dplyr operations and translate them to SQL for querying databases as you would data.frames or tibbles.

As such we can build up our queries using a connection to a database and only collect our data explicitly when we are ready for R to execute the query.

SQLite

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.

The SQLite file format is stable, cross-platform, and backwards compatible.

SQLite source code is in the public-domain and is free to everyone to use for any purpose.

Let’s start our experimentation by creating a simple SQLite database with a single table.

The data we will use is contained in data/synthpop_10000000.parquet and represents characteristics of 10,000,000 individuals from a synthetic population.

I know this section is about data too big for memory and this is not an approach you can use for data truly larger than memory). But because I want to benchmark against in memory data and can fit it into memory on my machine, I will actually load it into memory and write it directly to the database. However, I also show a way to populate it in batches if you find the file is overloading your memory. You can also choose to use one of the smaller synthpop parquet files (e.g data/synthpop_1000000.parquet which contains 1,000,000 rows).

So let’s load our data and have a look at it:

data <- arrow::read_parquet("data/synthpop_10000000.parquet")
head(data)
     sex age agegr                     socprof income  marital
1   MALE  47 45-59                      FARMER   2000  MARRIED
2 FEMALE  43 35-44 OTHER ECONOMICALLY INACTIVE     NA  MARRIED
3   MALE  26 25-34  EMPLOYED IN PRIVATE SECTOR   1400   SINGLE
4 FEMALE  51 45-59   EMPLOYED IN PUBLIC SECTOR     NA DIVORCED
5 FEMALE  67   65+     LONG-TERM SICK/DISABLED    750  WIDOWED
6   MALE  56 45-59                  UNEMPLOYED   1200  MARRIED
                       edu sport smoke nociga alcabuse      bmi   location
1                SECONDARY  TRUE FALSE     NA    FALSE 27.45982 Birmingham
2 POST-SECONDARY OR HIGHER FALSE FALSE     NA    FALSE 22.30815    Grimsby
3                SECONDARY FALSE  TRUE     15    FALSE 20.58967  Liverpool
4 POST-SECONDARY OR HIGHER FALSE FALSE     NA    FALSE 28.93407 Colchester
5       VOCATIONAL/GRAMMAR  TRUE  TRUE     20    FALSE 27.34375   Caerdydd
6 POST-SECONDARY OR HIGHER  TRUE FALSE     NA    FALSE 25.71166     Welwyn

This is quite a large dataframe.

pryr::object_size(data)
600.17 MB

Next let’s load some useful libraries, create our database and store the connection to said database in a variable.


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
con <- dbConnect(drv = RSQLite::SQLite(), "data/db.sqlite")

The above command creates an SQLite database at path data/db.sqlite.

con
<SQLiteConnection>
  Path: /Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite
  Extensions: TRUE

We can also see that the con object is tiny, only 2504. That’s because it’s just a connection to the database and does not contain any data itself.

pryr::object_size(con)
2.50 kB

Now let’s go ahead and write our data to a table in our database. For this we can use DBI’s function dbWriteTable. This will both create the table in our database and also write the data to it. The arguments we need to provide are:

  • conn (the first argument) where we provide the connection to the database we want to write to.

  • name the name of the table we want to to create.

  • value the object containing the data we want to write to the table. This must be a data.frame or an object coercible to a data.frame.

dbWriteTable(con, name = "synthpop", data)

Once writing the table is complete (this might take a little while), we can do some initial checks on our data using some more DBI functions:

  • dbListTables lists the names of all tables in the database

  • dbListFields lists all fields in a given table ("synthpop")

[1] "synthpop"
dbListFields(con, "synthpop")
 [1] "sex"      "age"      "agegr"    "socprof"  "income"   "marital" 
 [7] "edu"      "sport"    "smoke"    "nociga"   "alcabuse" "bmi"     

Chunked method to populate the database

If the data is too big to load into memory and then write to a database, an option is to populate it in chunks. This involves using readr::read_csv_chunked to populate a database in batches detailed in the following blogpost by Michael Culshaw-Maurer.

Here’s some example code of how we could populate our database from the 1,000,000 row csv file we created in the Data Input/Output section:

readr::read_csv_chunked("data/write/synthpop_1000000.csv", 
                        callback = function(chunk, dummy){
                            dbWriteTable(con, "synthpop", chunk, append = T)
                        }, 
                        chunk_size = 10000)

Interacting with database through dplyr & dbplyr

dbplyr is the database backend for dplyr. It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL.

All dplyr calls are evaluated lazily, generating SQL that is only sent to the database when you request the data.

So let’s start using our connection to access some data. For that we can use function tbl(). Just as dbConnect() opens a connection to a database, we can think of tbl() as opening a connection to a single table in the database, in this case `“synthpop”`.

tbl <- tbl(con, "synthpop")

If we print tbl we can see all columns in the database and the first 10 rows, which looks a bit like printing a tibble, but if we look at the header of information above the data, we can see the database source as well as [?? x 12] in the dimensions summary. That’s because tbl does not contain the full table data, just a connection to it, and therefore is not aware of the number of rows of the complete table.

tbl
# Source:   table<synthpop> [?? x 12]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
   sex      age agegr socprof    income marital edu   sport smoke nociga alcab…¹
   <chr>  <dbl> <chr> <chr>       <dbl> <chr>   <chr> <int> <int>  <dbl>   <int>
 1 MALE      47 45-59 FARMER       2000 MARRIED SECO…     1     0     NA       0
 2 FEMALE    43 35-44 OTHER ECO…     NA MARRIED POST…     0     0     NA       0
 3 MALE      26 25-34 EMPLOYED …   1400 SINGLE  SECO…     0     1     15       0
 4 FEMALE    51 45-59 EMPLOYED …     NA DIVORC… POST…     0     0     NA       0
 5 FEMALE    67 65+   LONG-TERM…    750 WIDOWED VOCA…     1     1     20       0
 6 MALE      56 45-59 UNEMPLOYED   1200 MARRIED POST…     1     0     NA       0
 7 MALE      86 65+   RETIRED      1260 MARRIED SECO…     1     0     NA       0
 8 MALE      59 45-59 LONG-TERM…   1400 MARRIED POST…     1     0     NA       0
 9 FEMALE    52 45-59 EMPLOYED …   1500 MARRIED SECO…     1     1     30       1
10 FEMALE    22 16-24 PUPIL OR …    600 SINGLE  SECO…     0     0     NA       0
# … with more rows, 1 more variable: bmi <dbl>, and abbreviated variable name
#   ¹​alcabuse

Let’s have a look at the tbl class. The important class identifiers are "tbl_dbi" and "tbl_sql" which indicate any data manipulation on the tbl will be translated to SQL, will be lazy and will return another "tbl_dbi", not the actual result of the query.

class(tbl)
[1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
[4] "tbl_lazy"             "tbl"                 

Getting the number of rows of a database table

So what if we did want to know the number of rows of the "synthpop" table to double check we have written in fully?

We might try a familiar R function, nrow():

nrow(tbl)
[1] NA

But this doesn’t work! That’s because there is no translation of R function nrow() to SQL.

We’ll need to frame our request as something that can be translated into an SQL query by dbplyr.

tbl %>% 
    summarize(n = n())
# Source:   SQL [1 x 1]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
         n
     <int>
1 10000000

summarise(n()) get’s translated to an SQL COUNT function, which is an SQL aggregate function that returns one value, hence what is returned to us is another tbl_dbi of 1 x 1 dimensions.

Note

To learn more about which R functions are translated by dbplyr to SQL have a look at the package’s vignettes on Verb and Function translation.

We can inspect the SQL statement generated by dbplyr by piping the query to dplyr’s show_query() function.

<SQL>
SELECT COUNT(*) AS `n`
FROM `synthpop`

Remember that just running the query returns another tbl_dbi. To be able to compute on it in R need to collect it.

db_nrows <- tbl %>% 
    summarize(n = n()) %>%
    collect()

db_nrows
# A tibble: 1 × 1
         n
     <int>
1 10000000
pull(db_nrows) == nrow(data)
[1] TRUE

We have now checked that our data was fully written to our database table.

Filtering, selecting and summarising

As mentioned, many of dplyr verbs as well as number of aggregating and arithmetic functions can be translated to SQL by dbplyr. For greatest it’s good to try and perform as many operations in SQL before collecting the data. These are performed by the databases SQL engine which is generally more efficient when working with large data.

Let’s try a few examples.

Let’s put together a query that filter values in a few columns and then selects a few columns to return:

filter(tbl,
    age > 30,
    sex == "MALE",
    sport == TRUE
) %>%
    select(income, age, marital)
# Source:   SQL [?? x 3]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
   income   age marital
    <dbl> <dbl> <chr>  
 1   2000    47 MARRIED
 2   1200    56 MARRIED
 3   1260    86 MARRIED
 4   1400    59 MARRIED
 5   1100    78 MARRIED
 6   1500    34 MARRIED
 7     NA    44 SINGLE 
 8   1500    50 MARRIED
 9     NA    46 MARRIED
10     NA    56 MARRIED
# … with more rows

Again, running the query without collecting does not return the full query result but can help check what your query is doing.

filter(tbl,
       age > 30,
       sex == "MALE",
       sport == TRUE) %>%
    select(income, age, marital) %>%
    show_query()
<SQL>
SELECT `income`, `age`, `marital`
FROM `synthpop`
WHERE (`age` > 30.0) AND (`sex` = 'MALE') AND (`sport` = 1)

And adding show_query() to the end of the pipe shows the SQL translation of the query.

Query 1

Let’s try another one:

filter(tbl,
       age > 50L & age < 60L, 
       income < 300) %>%
    arrange(bmi, age, income, nociga, sex) 
# Source:     SQL [?? x 12]
# Database:   sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
# Ordered by: bmi, age, income, nociga, sex
   sex     age agegr socprof     income marital edu   sport smoke nociga alcab…¹
   <chr> <dbl> <chr> <chr>        <dbl> <chr>   <chr> <int> <int>  <dbl>   <int>
 1 MALE     51 45-59 OTHER ECON…    200 MARRIED VOCA…     1     0     NA       0
 2 MALE     51 45-59 OTHER ECON…    200 MARRIED VOCA…     1     0     NA       0
 3 MALE     51 45-59 OTHER ECON…    200 DIVORC… VOCA…     1     0     NA       0
 4 MALE     51 45-59 OTHER ECON…    200 MARRIED VOCA…     0     0     NA       0
 5 MALE     51 45-59 OTHER ECON…    200 LEGALL… VOCA…     1     1     10       0
 6 MALE     51 45-59 OTHER ECON…    200 SINGLE  PRIM…     1     1     10       0
 7 MALE     51 45-59 OTHER ECON…    200 LEGALL… VOCA…     1     1     10       1
 8 MALE     51 45-59 OTHER ECON…    200 LEGALL… VOCA…     1     1     20       1
 9 MALE     51 45-59 OTHER ECON…    200 DIVORC… VOCA…     1     1     20       1
10 MALE     51 45-59 OTHER ECON…    200 LEGALL… PRIM…     1     1     20       1
# … with more rows, 1 more variable: bmi <dbl>, and abbreviated variable name
#   ¹​alcabuse
filter(tbl,
       age > 50L & age < 60L, 
       income < 300) %>%
    arrange(bmi, age, income, nociga, sex) %>%
    show_query()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

Let’s wrap this query in a function we can use to benchmark how long it takes to execute.

query_1 <- function(tbl) {
    filter(tbl,
       age > 50L & age < 60L, 
       income < 300) %>%
    arrange(bmi, age, income, nociga, sex)
}

Query 2

Let’s put together one more example to use for benchmarking which includes some aggregating and arithmetic functions.

filter(tbl,
       age > 65L,
       sex == "MALE",
       sport == TRUE,
       !is.na(income),
       !is.na(marital)) %>%
    group_by(marital) %>%
    summarise(min_income = min(income),
              max_income = max(income),
              mean_income = mean(income))
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source:   SQL [6 x 4]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
  marital            min_income max_income mean_income
  <chr>                   <dbl>      <dbl>       <dbl>
1 DE FACTO SEPARATED        300       7000       1227.
2 DIVORCED                  300      10000       1403.
3 LEGALLY SEPARATED         300       1350        969.
4 MARRIED                   158      10000       1323.
5 SINGLE                    300      10000       1334.
6 WIDOWED                   158      10000       1485.

Let’s look at the SQL translation:

filter(tbl,
       age > 65L,
       sex == "MALE",
       sport == TRUE,
       !is.na(income),
       !is.na(marital)) %>%
    group_by(marital) %>%
    summarise(min_income = min(income),
              max_income = max(income),
              mean_income = mean(income)) %>%
    show_query()
<SQL>
SELECT
  `marital`,
  MIN(`income`) AS `min_income`,
  MAX(`income`) AS `max_income`,
  AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
  (`age` > 65) AND
  (`sex` = 'MALE') AND
  (`sport` = 1) AND
  (NOT((`income` IS NULL))) AND
  (NOT((`marital` IS NULL)))
GROUP BY `marital`

And again, wrap it in a function:

query_2 <- function(tbl) {
    filter(tbl,
    age > 65L,
       sex == "MALE",
       sport == TRUE,
       !is.na(income),
       !is.na(marital)) %>%
    group_by(marital) %>%
    summarise(min_income = min(income),
              max_income = max(income),
              mean_income = mean(income)) %>%
        arrange(marital)
}

OK, let’s now run some starting benchmarks against running the same query on the data in memory:

Query 1
bench::mark(
    df = query_1(data),
    sqlite = query_1(tbl) %>%
        collect(),
    check = FALSE
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df            121ms    121ms      8.30  422.77MB     33.2
2 sqlite        606ms    606ms      1.65    3.49MB      0  

Note I’ve turned off checking for this benchmark because of the difference in how dplyr handles NAs when arranging data in data.frames (NA’s at the end) vs how SQLite’s engine does (NA’s at the top).

Query 2
bench::mark(
    df = query_2(data),
    sqlite = query_2(tbl) %>%
        collect()
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df            787ms    787ms      1.27     435MB    10.2 
2 sqlite        681ms    681ms      1.47     567KB     1.47

In this first test of performance, databases come out slower. That shouldn’t surprise us though. Working with in memory data (that still allows for the memory required for computation) will always be faster because there is no I/O cost to the query (once it has been loaded into memory), whereas executing and collecting the query from the database involves returning the data from disk. We can see though that working with a database is much more memory efficient, which given the topic of the chapter is working with data that does not fit into memory, shows it is a good approach for this use case.

Indexing

Indexes are a way to improve the performance of your read queries, particularly ones with filters (WHERE) on them. They’re data structures that exist in your database engine, outside of whatever table they work with, that point to data you’re trying to query.

They work similar to how indexes in the back of a book do. They contain the ordered values of the column you create them on along with information about the location of the rows containing each value in the original table.

So just like you might use an index to find a recipe instead of flicking through an entire recipe book, database indexes allow you to look up the values in columns and the location of the rows containing them in your original table without scanning the full table. A well crafted index can produce impressive query speed ups!

This does come at a cost. They take up space within your database, increasing it’s overall size, and they also slow down updating any tables containing indexes as the indexes must also be updated. Crafting indexes is also a bit of an art, as creating an index that speeds up a given query might actually slow another one down!

The details of good indexing strategy are a big topic that is well beyond the scope of this workshop.

file.copy(
    from = "data/db.sqlite",
    to = "data/db_idx.sqlite"
)
[1] TRUE

Let’s connect to the database we’re going to index as well as create a connection to the "synthpop" table.

con_idx <- dbConnect(RSQLite::SQLite(), "data/db_idx.sqlite")

tbl_idx <- tbl(con_idx, "synthpop")

Now let’s create our first index to try and improve the performance of our select (WHERE) operation in query 1.

Query 1

Let’s remind ourself what the query is actually doing. This time we’ll use another dplyr function, explain().

query_1(tbl_idx) %>%
  explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused                       detail
1  3      0       0                SCAN synthpop
2 33      0       0 USE TEMP B-TREE FOR ORDER BY

explain() translates to the EXPLAIN QUERY PLAN command in SQLite databases. It includes the SQL translation of our query but is used primarily to obtain a high-level description of the strategy or plan that SQLite uses to implement a specific SQL query. Most significantly, EXPLAIN QUERY PLAN reports on the way in which the query uses database indices. The relevant information is found in the detail column of the bottom table of the output.

The output of piping query 1 into explain() indicates that the SQLite engine is using a full scan of the "synthpop" table to locate the rows matching our select (WHERE) condition. It then uses a temporary Sorting B-Tree for ordering the data. When you use ORDER BY without an index on the column to be sorted, SQLite builds up a temporary data structure that contains the sorted results each time the query is executed. That data structure will use up space in memory (or on disk, depending on the circumstances) and will be torn down after the query is executed.

Tip

To find out more about the SQLite EXPLAIN QUERY PLAN command, head to SQLite doumentation.

Time to create an index. To do so we use function dbExecute() on our database connection and pass it a character string of the SQL statement to create an index.

dbExecute(con_idx,
          "CREATE INDEX synthpop_age_inc_idx ON synthpop (age, income);")
[1] 0

Let’s break down the statement:

  • CREATE INDEX is the command to create an index.

  • synthpop_age_inc_idx is the name of the index we want to create. It’s good practice to include an indication of the table as well as the columns used to create the index in name of the index.

  • ON synthpop indicates that the index is being created on table synthpop.

  • (age, income) the parenthesis indicates the columns we want to include in our index. Indexes can be created using one or multiple columns. Here, because our filter statement includes seraching for values on both age and income, we include both columns for better performance. Note however that this inevitably takes up more disk space and more time to create (and in future update) the index.

OK, let’s check the query plan to see if SQLite plans to use our index:

query_1(tbl_idx) %>%
    explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused
1  4      0       0
2 32      0       0
                                                              detail
1 SEARCH synthpop USING INDEX synthpop_age_inc_idx (age>? AND age<?)
2                                       USE TEMP B-TREE FOR ORDER BY

Indeed! The query is not SCANning the full table anymore but is instead using our index to SEARCH for values in the index matching our filter statement.

Let’s see whether we’ve improved our query’s performance:

bench::mark(
   no_index = query_1(tbl) %>%
    collect(),
   index = query_1(tbl_idx) %>%
    collect()
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 no_index      617ms    617ms      1.62    3.44MB        0
2 index         126ms    126ms      7.89    3.44MB        0

Indeed we have, roughly a 5x speed up. Not bad! But we could do better!

Because indexes are sorted data structures and their benefit comes from how binary search works, it’s important to ensure that our indexed columns have what is called “high cardinality”. All this means is that the indexed data has a lot of uniqueness.

While our age column has 79 unique values, our income column has 406, i.e. income has higher cardinality than income.

A multi-column index will initially use the first column to search, then the second an so on. So instead of putting age at the front of our index, let’s drop our first index using the DROP INDEX command and let’s instead create a new index with income first:

dbExecute(con_idx,
          "DROP INDEX synthpop_age_inc_idx;")
[1] 0
dbExecute(con_idx,
          "CREATE INDEX synthpop_inc_age_idx ON synthpop (income, age);")
[1] 0

Let’s inspect our query plan which reveals that, indeed, our index now searches through income first:

query_1(tbl_idx) %>%
    explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused                                                      detail
1  4      0       0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35      0       0                                USE TEMP B-TREE FOR ORDER BY

Let’s run our benchmarks again:

bench::mark(
   no_index = query_1(tbl) %>%
    collect(),
   index = query_1(tbl_idx) %>%
    collect()
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 no_index    622.9ms  622.9ms      1.61    3.44MB     0   
2 index        66.9ms   68.8ms     14.4     3.44MB     2.06

Much better! We’re now approaching a 10x speed up!

So, do you think we can speed up the query even more? What about the arrange part of the query?

You may have noticed that the ORDER BY part of the query is still using a temporary B-TREE.

query_1(tbl_idx) %>%
    explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused                                                      detail
1  4      0       0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35      0       0                                USE TEMP B-TREE FOR ORDER BY

An index can be used to speed up sorting only if the query allows to return the rows in the order in which they are stored in the index. Because our index does not include many of the columns we are using in the sort operation, and most importantly, the first one (bmi) the index is ignored by ORDER BY.

We might consider creating another index to take care of the ORDER BY operation and include all the columns involved in the order that we want them sorted.

dbExecute(con_idx,
          "CREATE INDEX synthpop_arrange1_idx ON synthpop (bmi, age, income, nociga, sex);")
[1] 0

Let’s see if that improves performance:

bench::mark(
   no_index = query_1(tbl) %>%
    collect(),
   index = query_1(tbl_idx) %>%
    collect()
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 no_index      611ms    611ms      1.64    3.44MB     0   
2 index         486ms    486ms      2.06    3.44MB     2.06

Oh dear! The query is now much slower and not a huge improvement to our non-indexed database! What’s going on?

Let’s inspect our query plan:

query_1(tbl_idx) %>%
    explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused                                          detail
1  4      0       0 SCAN synthpop USING INDEX synthpop_arrange1_idx

Now what we see is that the engine is indeed using our synthpop_arrange1_idx index but is only using that one. Not only that, it is now performing a full SCAN of the arrange index table.

An important thing to note is that, in SQLite, each table in the FROM clause of a query can use at most one index and SQLite strives to use at least one index on each table. So it cannot use one index for the WHERE part of the query and another for the ORDER BY part.

In this case, the engine determines that the least costly query plan is to just use the synthpop_arrange1_idx index because all the information it needs is stored within and therefore does not require a lookup in the original synthpop table to retrieve further data. It knows the data is stored in the correct order but to perform the WHERE operation, it does need to scan the full index table.

But why does this in practice end up slower? That’s because the WHERE operation actually returns a much smaller subset of the data. So optimising that part of the query and the using a B-TREE for sorting actually ends up much faster in practice. However, the query optimiser has no way of knowing this upfront (and may not be the case if the WHERE operation returns a much bigger subset), so concludes that (wrongly in our case) that using the synthpop_arrange1_idx index is most efficient.

So at least for this query, let’s consider the synthpop_arrange1_idx index an drop it.

dbExecute(con_idx,
          "DROP INDEX synthpop_arrange1_idx;")
[1] 0
query_1(tbl_idx) %>%
    explain()
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`

<PLAN>
  id parent notused                                                      detail
1  4      0       0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35      0       0                                USE TEMP B-TREE FOR ORDER BY

Now the optimiser goes back to using the synthpop_inc_age_idx index.

Query 2

So we’ve made Query 1 faster but what about query 2?

Let’s check whether it also helps with query 2:

bench::mark(
   no_index = query_2(tbl) %>%
    collect(),
   index = query_2(tbl_idx) %>%
    collect()
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 no_index      752ms    752ms      1.33     528KB     1.33
2 index         714ms    714ms      1.40     525KB     1.40

Well that’s not good! The index seems to have made query 2 a slower?! If we use explain() to dig into it we see it’s still doing a full scan but now the optimiser has to also evaluate a potential query plan that might involve our synthpop_inc_age_idx index.

query_2(tbl_idx) %>%
    explain()
<SQL>
SELECT
  `marital`,
  MIN(`income`) AS `min_income`,
  MAX(`income`) AS `max_income`,
  AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
  (`age` > 65) AND
  (`sex` = 'MALE') AND
  (`sport` = 1) AND
  (NOT((`income` IS NULL))) AND
  (NOT((`marital` IS NULL)))
GROUP BY `marital`
ORDER BY `marital`

<PLAN>
  id parent notused                       detail
1  7      0       0                SCAN synthpop
2 21      0       0 USE TEMP B-TREE FOR GROUP BY

Let’s create an index to improve the performance of query 2. Let’s focus again on the WHERE part of the query.

We might start by creating an index using all columns involved in the order of decreasing cardinality:

dbExecute(con_idx,
          "CREATE INDEX synthpop_inc_age_mar_sex_sp_idx ON synthpop (income, age, marital, sex, sport);")
[1] 0

Let’s check our query plan and benchmark:

query_2(tbl_idx) %>%
    explain()
<SQL>
SELECT
  `marital`,
  MIN(`income`) AS `min_income`,
  MAX(`income`) AS `max_income`,
  AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
  (`age` > 65) AND
  (`sex` = 'MALE') AND
  (`sport` = 1) AND
  (NOT((`income` IS NULL))) AND
  (NOT((`marital` IS NULL)))
GROUP BY `marital`
ORDER BY `marital`

<PLAN>
  id parent notused
1  7      0       0
2 21      0       0
                                                              detail
1 SCAN synthpop USING COVERING INDEX synthpop_inc_age_mar_sex_sp_idx
2                                       USE TEMP B-TREE FOR GROUP BY
bench::mark(
   no_index = query_2(tbl) %>%
    collect(),
   index = query_2(tbl_idx) %>%
    collect()
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 no_index      672ms    672ms      1.49     524KB     1.49
2 index         526ms    526ms      1.90     524KB     1.90

We see a small improvement. At least the query is not slower now!

Indexing Take Aways:

Indexes can be a useful strategy for improving specific query perfomance. HOWEVER:

  • We have only scraped the surface of the types of indexes available as well as how to determine when and how to deploy them.

  • They are fiddly to create and can have unexpected effects on different queries.

  • They take time to create and update and take up space of disk (our indexed database is now 1.2402647^{9} compared to 8.2244403^{8} of our original database!

  • Trying to create new indexes to optimise each new query quickly get out of hand and required a lot of knowledge/experimentation.

But! The next section provides some useful perspective!

DuckDB

While SQLite is ubiquitous in the world of embedded databases, and it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution.

There is however a rather recent type of embedded (flat) database called DuckDB.

DuckDB can be far more efficient for complex analytics queries on large amount of data from a database, more common in analytics workflow.

From the DuckDB website:

DuckDB is designed to support analytical query workloads, also known as Online analytical processing (OLAP).

These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables.

Changes to the data are expected to be rather large-scale as well, with several rows being appended, or large portions of tables being changed or added at the same time.

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially

It also has a nice API to R handled through package duckdb. I highly recommend checking the DuckDB documentation to learn more about it’s features, but in general, you can interact with DuckDB databases in R as you would any other database.

So let’s create a DuckDB database with the same data and benchmark our queries against it.

Again we can use dbConnect() to both create a database using a duckdb::duckdb() driver and open a connection to it.

con_duckdb <- dbConnect(duckdb::duckdb(), "data/db.duckdb")
con_duckdb
<duckdb_connection fadd0 driver=<duckdb_driver 923e0 dbdir='data/db.duckdb' read_only=FALSE bigint=numeric>>
dbWriteTable(con_duckdb, "synthpop", data)
dbListTables(con_duckdb)
[1] "synthpop"
dbListFields(con_duckdb, "synthpop")
 [1] "sex"      "age"      "agegr"    "socprof"  "income"   "marital" 
 [7] "edu"      "sport"    "smoke"    "nociga"   "alcabuse" "bmi"     
tbl_duckdb <- tbl(con_duckdb, "synthpop")
tbl_duckdb
# Source:   table<synthpop> [?? x 12]
# Database: DuckDB 0.6.2-dev1166 [root@Darwin 21.4.0:R 4.2.1/data/db.duckdb]
   sex      age agegr socprof    income marital edu   sport smoke nociga alcab…¹
   <chr>  <dbl> <chr> <chr>       <dbl> <chr>   <chr> <lgl> <lgl>  <dbl> <lgl>  
 1 MALE      47 45-59 FARMER       2000 MARRIED SECO… TRUE  FALSE     NA FALSE  
 2 FEMALE    43 35-44 OTHER ECO…     NA MARRIED POST… FALSE FALSE     NA FALSE  
 3 MALE      26 25-34 EMPLOYED …   1400 SINGLE  SECO… FALSE TRUE      15 FALSE  
 4 FEMALE    51 45-59 EMPLOYED …     NA DIVORC… POST… FALSE FALSE     NA FALSE  
 5 FEMALE    67 65+   LONG-TERM…    750 WIDOWED VOCA… TRUE  TRUE      20 FALSE  
 6 MALE      56 45-59 UNEMPLOYED   1200 MARRIED POST… TRUE  FALSE     NA FALSE  
 7 MALE      86 65+   RETIRED      1260 MARRIED SECO… TRUE  FALSE     NA FALSE  
 8 MALE      59 45-59 LONG-TERM…   1400 MARRIED POST… TRUE  FALSE     NA FALSE  
 9 FEMALE    52 45-59 EMPLOYED …   1500 MARRIED SECO… TRUE  TRUE      30 TRUE   
10 FEMALE    22 16-24 PUPIL OR …    600 SINGLE  SECO… FALSE FALSE     NA FALSE  
# … with more rows, 1 more variable: bmi <dbl>, and abbreviated variable name
#   ¹​alcabuse

Benchmark Queries

Now let’s go ahead and run our queries again, this time including running them on the duckdb database we just created.

bench::mark(
    df = query_1(data),
    sqlite = query_1(tbl) %>%
        collect(),
    sqlite_idx = query_1(tbl_idx) %>%
        collect(),
    duckdb = query_1(tbl_duckdb) %>%
        collect(),
    check = FALSE
)
# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df            117ms  119.7ms      7.99  269.51MB     2.66
2 sqlite        635ms    635ms      1.57    3.45MB     0   
3 sqlite_idx     65ms   65.4ms     15.3     3.44MB     2.18
4 duckdb         89ms   89.8ms     11.1     1.65MB     2.22
bench::mark(
    df = query_2(data),
    sqlite = query_2(tbl) %>%
        collect(),
    sqlite_idx = query_2(tbl_idx) %>%
        collect(),
    duckdb = query_2(tbl_duckdb) %>%
        collect()
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 4 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df          919.4ms  919.4ms      1.09     435MB     9.79
2 sqlite      663.9ms  663.9ms      1.51     529KB     0   
3 sqlite_idx  541.4ms  541.4ms      1.85     527KB     1.85
4 duckdb       94.8ms   96.4ms     10.4      861KB    10.4 

Wow! DuckDB is much faster than SQLite, can compete with and beat an indexed SQL database and can be even faster than running the queries on in-memory data!! And still very memory efficient. And all this without even having to think about indexes!! 🤩 🎉

This is definitely a database type you should know about!

Accessing data through the arrow package

Arrow is software development platform for building high performance applications. As we’ve seen already, The arrow R package provides functionality for fast reading and writing of flat files as well as more efficient binary file formats.

It also provides functions for opening connections to files as well as directories of files, much like we did with databases, and because it has deep integration with dplyr, it allows us to perform queries on out of memory data as we’ve been doing with our databases.

Accessing single files as arrow tables

We can read in a single large csv, arrow or parquet file using the appropriate arrow function but instead of reading it in as a data.frame, we can use as_data_frame = FALSE to open it as an arrow table. Because of how Arrow allocates memory, arrow tables are much more memory efficient representations of tabular data that could mean data that won’t fit into memory as an

arrow_tbl <- arrow::read_parquet("data/synthpop_10000000.parquet", 
                      as_data_frame = FALSE)


arrow_tbl
Table
10000000 rows x 13 columns
$sex <string>
$age <int32>
$agegr <string>
$socprof <string>
$income <int32>
$marital <string>
$edu <string>
$sport <bool>
$smoke <bool>
$nociga <int32>
$alcabuse <bool>
$bmi <double>
$location <string>

Many dplyr verbs can be used to interrogate this arrow table. To demonstrated let’s execute query 1 on our data.

query_1(arrow_tbl)
Table (query)
sex: string
age: int32
agegr: string
socprof: string
income: int32
marital: string
edu: string
sport: bool
smoke: bool
nociga: int32
alcabuse: bool
bmi: double
location: string

* Filter: (((age > 50) and (age < 60)) and (income < 300))
* Sorted by bmi [asc], age [asc], income [asc], nociga [asc], sex [asc]
See $.data for the source Arrow object

Just like with databases, the query does not return a tibble. We again need to collect() the results of our query for it be converted to a tibble:

query_1(arrow_tbl) %>%
    collect()
# A tibble: 15,427 × 13
   sex      age agegr socprof income marital edu      sport smoke nociga alcab…¹
   <chr>  <int> <chr> <chr>    <int> <chr>   <chr>    <lgl> <lgl>  <int> <lgl>  
 1 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       5 FALSE  
 2 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       8 FALSE  
 3 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       8 FALSE  
 4 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       8 FALSE  
 5 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       8 FALSE  
 6 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE       8 FALSE  
 7 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE      10 FALSE  
 8 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE      10 FALSE  
 9 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE      10 FALSE  
10 FEMALE    51 45-59 FARMER     250 MARRIED VOCATIO… TRUE  TRUE      10 FALSE  
# … with 15,417 more rows, 2 more variables: bmi <dbl>, location <chr>, and
#   abbreviated variable name ¹​alcabuse

Given that the arrow_tbl is actually in memory, we can compare query execution time to the in memory data

bench::mark(in_mem_csv = query_1(data),
            arrow_tbl = query_1(arrow_tbl) %>%
    collect(),
    check = FALSE)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 in_mem_csv    117ms    117ms      8.54     270MB    25.6 
2 arrow_tbl    33.8ms   34.7ms     27.9      282KB     2.15

WOw! that’s must faster than performing the query even on an in memory data.frame. Impressive!

Accessing data as arrow datasets

Another way to access files through R is by opening them as a dataset with function arrow::open_dataset().

We can open a single file or a whole directory of files, formatted in any of the formats arrow can handle.

This does not load the data into memory. Instead open_dataset() scans the content of the file(s) to identify the name of the columns and their data types.

Accessing single files as arrow datasets

Let’s open a single file as a dataset first. To do so we supply the path to the files as well as the format it’s stored in.

arrow_dt_csv <- arrow::open_dataset("data/synthpop_10000000.parquet", format = "parquet")


bench::mark(
    df = query_1(data),
    sqlite = query_1(tbl) %>%
        collect(),
    duckdb = query_1(tbl_duckdb) %>%
        collect(),
    arrow_dt_csv = query_1(arrow_dt_csv) %>%
    collect(),
    check = FALSE
)
# A tibble: 4 × 6
  expression        min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df            117.8ms    121ms      8.30  269.51MB     8.30
2 sqlite        611.8ms    612ms      1.63    3.44MB     0   
3 duckdb         87.2ms     88ms     11.4     1.58MB     2.27
4 arrow_dt_csv  329.3ms    346ms      2.89  279.69KB     0   

Accessing directories as arrow datasets

We can also use the same function to open a directory of files stored in the same format. This might be appropriate when your data generation involves creating data in batches that end up in separate files and for some reason you don’t want to be writing them to a database.

The directory structure can help improve performance of queries too depending on how data is partitioned across directories. In some ways you can think of the physical partitioning as a physical index that can be used in a query to completely skip certain files.

Let’s have a look at what this means by actually creating such a directory structure from our dataset.

First let’s create a directory to partition it into. Then we can use function arrow::write_dataset() to write out our data partitioned according to any variables we specify in the partitioning argument. Here we choose to partition across age. Let’s also write data out in efficient parquet files.

arrow::write_dataset(data, 
                     path = "data/arrow_dataset",
                     format = "parquet",
                     partitioning = "age")

Let’s use fs::dir_tree() to see the structure of the arrow_dat directory we just created:

fs::dir_tree("data/arrow_dataset/")
data/arrow_dataset/
├── age=16
│   └── part-0.parquet
├── age=17
│   └── part-0.parquet
├── age=18
│   └── part-0.parquet
├── age=19
│   └── part-0.parquet
├── age=20
│   └── part-0.parquet
├── age=21
│   └── part-0.parquet
├── age=22
│   └── part-0.parquet
├── age=23
│   └── part-0.parquet
├── age=24
│   └── part-0.parquet
├── age=25
│   └── part-0.parquet
├── age=26
│   └── part-0.parquet
├── age=27
│   └── part-0.parquet
├── age=28
│   └── part-0.parquet
├── age=29
│   └── part-0.parquet
├── age=30
│   └── part-0.parquet
├── age=31
│   └── part-0.parquet
├── age=32
│   └── part-0.parquet
├── age=33
│   └── part-0.parquet
├── age=34
│   └── part-0.parquet
├── age=35
│   └── part-0.parquet
├── age=36
│   └── part-0.parquet
├── age=37
│   └── part-0.parquet
├── age=38
│   └── part-0.parquet
├── age=39
│   └── part-0.parquet
├── age=40
│   └── part-0.parquet
├── age=41
│   └── part-0.parquet
├── age=42
│   └── part-0.parquet
├── age=43
│   └── part-0.parquet
├── age=44
│   └── part-0.parquet
├── age=45
│   └── part-0.parquet
├── age=46
│   └── part-0.parquet
├── age=47
│   └── part-0.parquet
├── age=48
│   └── part-0.parquet
├── age=49
│   └── part-0.parquet
├── age=50
│   └── part-0.parquet
├── age=51
│   └── part-0.parquet
├── age=52
│   └── part-0.parquet
├── age=53
│   └── part-0.parquet
├── age=54
│   └── part-0.parquet
├── age=55
│   └── part-0.parquet
├── age=56
│   └── part-0.parquet
├── age=57
│   └── part-0.parquet
├── age=58
│   └── part-0.parquet
├── age=59
│   └── part-0.parquet
├── age=60
│   └── part-0.parquet
├── age=61
│   └── part-0.parquet
├── age=62
│   └── part-0.parquet
├── age=63
│   └── part-0.parquet
├── age=64
│   └── part-0.parquet
├── age=65
│   └── part-0.parquet
├── age=66
│   └── part-0.parquet
├── age=67
│   └── part-0.parquet
├── age=68
│   └── part-0.parquet
├── age=69
│   └── part-0.parquet
├── age=70
│   └── part-0.parquet
├── age=71
│   └── part-0.parquet
├── age=72
│   └── part-0.parquet
├── age=73
│   └── part-0.parquet
├── age=74
│   └── part-0.parquet
├── age=75
│   └── part-0.parquet
├── age=76
│   └── part-0.parquet
├── age=77
│   └── part-0.parquet
├── age=78
│   └── part-0.parquet
├── age=79
│   └── part-0.parquet
├── age=80
│   └── part-0.parquet
├── age=81
│   └── part-0.parquet
├── age=82
│   └── part-0.parquet
├── age=83
│   └── part-0.parquet
├── age=84
│   └── part-0.parquet
├── age=85
│   └── part-0.parquet
├── age=86
│   └── part-0.parquet
├── age=87
│   └── part-0.parquet
├── age=88
│   └── part-0.parquet
├── age=89
│   └── part-0.parquet
├── age=90
│   └── part-0.parquet
├── age=91
│   └── part-0.parquet
├── age=92
│   └── part-0.parquet
├── age=96
│   └── part-0.parquet
└── age=97
    └── part-0.parquet

As we can see, a folder has been created for each value of age and the rows where the original data matched that condition are contained in parquet files within.

The dataset directory is nonetheless still more efficient than the original csv.

# parquet arrow data set
fs::dir_info("data/arrow_dataset", recurse = TRUE)$size %>% sum()
140M
# original csv
fs::file_size("data/synthpop_10000000.csv")
NA

Now that we’ve got a partitioned directory of our data, let’s go ahead and open it as an arrow dataset.

arrow_dir_dataset <- arrow::open_dataset("data/arrow_dataset", format = "parquet")

Summary Benchmarks

bench::mark(
    df = query_1(data),
    sqlite = query_1(tbl) %>%
        collect(),
    sqlite_idx = query_1(tbl_idx) %>%
        collect(),
    duckdb = query_1(tbl_duckdb) %>%
        collect(),
    arrow_tbl = query_1(arrow_tbl) %>%
    collect(),
    arrow_csv_dataset = query_1(arrow_dt_csv) %>%
    collect(),
    arrow_dir_dataset = query_1(arrow_dir_dataset) %>%
        collect(),
    check = FALSE
)
# A tibble: 7 × 6
  expression             min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df                 117.2ms  117.3ms      8.53  269.51MB     8.53
2 sqlite             609.7ms  609.7ms      1.64    3.44MB     0   
3 sqlite_idx          65.7ms   67.2ms     14.6     3.44MB     2.09
4 duckdb              87.2ms   87.9ms     11.4     1.58MB     5.69
5 arrow_tbl           34.5ms   35.2ms     28.1   281.53KB     5.11
6 arrow_csv_dataset  346.7ms  366.9ms      2.73  267.62KB     0   
7 arrow_dir_dataset  653.8ms  653.8ms      1.53  426.72KB     0   
bench::mark(
    df = query_2(data),
    sqlite = query_2(tbl) %>%
        collect(),
    sqlite_idx = query_2(tbl_idx) %>%
        collect(),
    duckdb = query_2(tbl_duckdb) %>%
        collect(),
    arrow_tbl = query_2(arrow_tbl) %>%
    collect(),
    arrow_csv_dataset = query_2(arrow_dt_csv) %>%
    collect(),
    arrow_dir_dataset = query_2(arrow_dir_dataset) %>%
        collect(),
    check = FALSE
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 7 × 6
  expression             min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 df                   801ms    801ms      1.25     435MB    13.7 
2 sqlite             669.2ms  669.2ms      1.49     530KB     1.49
3 sqlite_idx         527.5ms  527.5ms      1.90     528KB     1.90
4 duckdb              94.7ms   95.5ms     10.3      856KB    12.1 
5 arrow_tbl           59.2ms   62.7ms     15.7      762KB     5.88
6 arrow_csv_dataset  221.1ms    223ms      4.48     189KB     1.49
7 arrow_dir_dataset  798.2ms  798.2ms      1.25     222KB     0   
Overall Take Aways
  • DuckDB can be a very efficient database format for complex queries involving large amounts of data due to it’s OLAP nature owing to it’s columnar-vectorised operation engine.

  • Indexing can improve queries in SQLite and other OLTP type databases. However they are not flexible, take a lot of knowledge and experimentation, increase disk space and can also reduce performance on other queries or if mis-applied.

  • The arrow package provide another option for loading or opening connections to files or directories of data and has deep integration with dplyr for performing queries.

  • Partitioning can improve querying directories of data as arrow datasets. they are however inflexible and represent a single physical index applied on the whole dataset.

  • Arrow tables allows loading large datasets in a more memory efficient way and support very fast querying.

Batch processing

In the previous sections we were focusing on a narrow set of operations, in particular the efficiency of accessing, filtering, selecting, ordering and aggregating subsets of data from data too large to fit into memory. But often we need to perform some processing on the whole dataset, as we saw in the example of populating our database in batches.

Other times our analyses, for example fitting a model, might require the full dataset to produce a result which can be problematic even if we can just load our data in our memory as that may leaves us with little RAM to compute.

An option in this case would be to use algorithms that can compute on chunks or batches of the data. These algorithms are known as external memory algorithms (EMA), or batch processing.

Here’s a simple example of how we might write a batch processing algorithm to calculate the mean across multiple files, specifically the parquet files we just created in data/arrow_dataset/arrow_dat.

batch_mean <- function(file_name) {
    dat <- arrow::read_parquet(file_name, col_select = "income")
    
    income_vct <- na.omit(dat[["income"]])
    
    c(mean = mean(income_vct),
      n = length(income_vct))
}

In this function we are given a file name. For each file, we load only the column we are interested (income) remove NAs and calculate the mean. To be able to aggregate the mean across all files, we also record n, the number of values used to calculate the mean.

We can then apply the function to a list of file names and aggregate the results in a tibble using the purrr::map_dfr.

file_names <- fs::dir_ls("data/arrow_dataset", 
                         recurse = TRUE,
                         type = "file")

means <- purrr::map_dfr(file_names, ~batch_mean(.x))
means
# A tibble: 79 × 2
    mean      n
   <dbl>  <dbl>
 1  543.   5138
 2  530.  17675
 3  548.  20947
 4  678.  21002
 5  770.  17801
 6 1109.  71922
 7 1107.  89002
 8 1275.  95972
 9 1411.  92471
10 1555. 115432
# … with 69 more rows

Now that we’ve got our batched mean we can calculate a weighted mean an use the n column as the weight, which indeed gives us the same mean as we had calculated it on the whole dataset.

weighted.mean(x = means$mean, w = means$n)
[1] 1639.416
mean(data$income, na.rm = TRUE)
[1] 1639.416

Specialised R packages

There are a number of R packages that provide EMA solutions for analysis bigger than memory data.

For example function biglm from package biglm allows for fitting a linear model in batches.

In the following example from the package documentation, an lm model is fitted initially to a small subset of the data with function biglm. The model is subsequently updated with additional chunks of using the update.

data(trees)
ff<-log(Volume)~log(Girth)+log(Height)

chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]

library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)

coef(a)
(Intercept)  log(Girth) log(Height) 
  -6.631617    1.982650    1.117123 

The list of R packages available are numerous and their suitability varies according to the data and analysis you need to perform.

  • bigmemory: Manage Massive Matrices with Shared Memory and Memory-Mapped Files. The package uses memory mapping where RAM addresses are mapped to a file on disk. While innnevitably reducing performance, this extends the memory available for computation to memory on disk

    • A number of analytics packages build on bigmemory including:

      • bigtabulate: Extend the bigmemory package with ‘table’, ‘tapply’, and ‘split’ support for ‘big.matrix’ objects.

      • bigalgebra: For matrix operation.

      • biganalytics: Extend the ‘bigmemory’ package with various analytics, eg bigkmeans.

      • bigFastlm: for (fast) linear models.

      • biglasso: extends lasso and elastic nets.

      • GHap: Haplotype calling from phased SNP data.

      • oem: Penalized Regression for Big Tall Data.

      • bigstep: Uses the bigmemory framework to perform stepwise model selection, when the data cannot fit into RAM.

  • ff : The ff package provides data structures that are stored on disk in a binary format but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory. These data structures lend themselves to efficient chunking. Unlike bigmemory which on support numeric data types, ff supports all of R vector types including factors (which any character data is converted to for memory efficiency.

    • ffbase: extends the ff package with a number of methods for working with ff objects.

    • Package biglm also has methods for ff type objects so is not limited to fitting on numeric data.

A good place to find up to date information on available packages is the CRAN Task View on High-Performance and Parallel Computing with R, especially the section on Large memory and out-of-memory data.

Note

I should acknowledge that this brief section has been a summarisation of the chapter on Efficient Memory from BGUs Department of Industrial Engineering and Management “R” course by Jonathan D. Rosenblatt. For more information I highly recommend reviewing it as well as the chapter on Sparse Representations.