We’ve a look at approaches to speed up working with data in memory more efficiently. But what if the data we want to work with just cannot fit into memory?
There a number of approaches to deal with this situation which will depend on what exactly we need from the data to perform our operation.
For example:
We might need to process the whole dataset but can somehow split our computation to work on smaller chunks of the data i.e. batch processing. Perhaps our data is split in many individual csv files, in which case we could write a function that works on a single file at a time and use an apply function to process all files and agreggate any result. This sort of processing is highly amenable to parallelisation.
We might need a subset of our data which we define through filtering, selecting and other aggregating functions. In this situation, storing our data in a database and using SQL to query it for the subset of data of interest is our best option.
A harder problem is when we indeed require all data in memory. Here our choice might require using distributed memory between many machines (e.g. on an HPC platform using MPI) as well as considering options like using single precision floats and mathematical optimisations of our algorithms.
Databases
Databases are an appropriate choice if you have large amounts of data that can’t fit into memory which you only require subsets from that you can extract using queries.
There are many types of databases which are beyond the scope of this workshop. What we we will focus on here is simple relational databases that store tabular data in single flat files (a.k.a. embedded databases) as opposed to databases which are run through a server like MySQL, Microsoft SQL Server PostgresSQL or which do not store tabular data, for example MongoDB which stores data as documents.
We also focus on databases that can be queried using SQL. SQL (which stands for Structured Query Language) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them.
It’s good to have an idea of SQL basics when interacting with databases, but in R, many of the dplyr verbs are inspired by SQL commands while package dbplyr can take dplyr operations and translate them to SQL for querying databases as you would data.frames or tibbles.
As such we can build up our queries using a connection to a database and only collect our data explicitly when we are ready for R to execute the query.
Let’s start our experimentation by creating a simple SQLite database with a single table.
The data we will use is contained in data/synthpop_10000000.parquet and represents characteristics of 10,000,000 individuals from a synthetic population.
I know this section is about data too big for memory and this is not an approach you can use for data truly larger than memory). But because I want to benchmark against in memory data and can fit it into memory on my machine, I will actually load it into memory and write it directly to the database. However, I also show a way to populate it in batches if you find the file is overloading your memory. You can also choose to use one of the smaller synthpop parquet files (e.g data/synthpop_1000000.parquet which contains 1,000,000 rows).
sex age agegr socprof income marital
1 MALE 47 45-59 FARMER 2000 MARRIED
2 FEMALE 43 35-44 OTHER ECONOMICALLY INACTIVE NA MARRIED
3 MALE 26 25-34 EMPLOYED IN PRIVATE SECTOR 1400 SINGLE
4 FEMALE 51 45-59 EMPLOYED IN PUBLIC SECTOR NA DIVORCED
5 FEMALE 67 65+ LONG-TERM SICK/DISABLED 750 WIDOWED
6 MALE 56 45-59 UNEMPLOYED 1200 MARRIED
edu sport smoke nociga alcabuse bmi location
1 SECONDARY TRUE FALSE NA FALSE 27.45982 Birmingham
2 POST-SECONDARY OR HIGHER FALSE FALSE NA FALSE 22.30815 Grimsby
3 SECONDARY FALSE TRUE 15 FALSE 20.58967 Liverpool
4 POST-SECONDARY OR HIGHER FALSE FALSE NA FALSE 28.93407 Colchester
5 VOCATIONAL/GRAMMAR TRUE TRUE 20 FALSE 27.34375 Caerdydd
6 POST-SECONDARY OR HIGHER TRUE FALSE NA FALSE 25.71166 Welwyn
Now let’s go ahead and write our data to a table in our database. For this we can use DBI’s function dbWriteTable. This will both create the table in our database and also write the data to it. The arguments we need to provide are:
conn (the first argument) where we provide the connection to the database we want to write to.
name the name of the table we want to to create.
value the object containing the data we want to write to the table. This must be a data.frame or an object coercible to a data.frame.
If the data is too big to load into memory and then write to a database, an option is to populate it in chunks. This involves using readr::read_csv_chunked to populate a database in batches detailed in the following blogpost by Michael Culshaw-Maurer.
Here’s some example code of how we could populate our database from the 1,000,000 row csv file we created in the Data Input/Output section:
dbplyr is the database backend for dplyr. It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL.
All dplyr calls are evaluated lazily, generating SQL that is only sent to the database when you request the data.
So let’s start using our connection to access some data. For that we can use function tbl(). Just as dbConnect() opens a connection to a database, we can think of tbl() as opening a connection to a single table in the database, in this case `“synthpop”`.
If we print tbl we can see all columns in the database and the first 10 rows, which looks a bit like printing a tibble, but if we look at the header of information above the data, we can see the database source as well as [?? x 12] in the dimensions summary. That’s because tbl does not contain the full table data, just a connection to it, and therefore is not aware of the number of rows of the complete table.
tbl
# Source: table<synthpop> [?? x 12]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
sex age agegr socprof income marital edu sport smoke nociga alcab…¹
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <int> <int> <dbl> <int>
1 MALE 47 45-59 FARMER 2000 MARRIED SECO… 1 0 NA 0
2 FEMALE 43 35-44 OTHER ECO… NA MARRIED POST… 0 0 NA 0
3 MALE 26 25-34 EMPLOYED … 1400 SINGLE SECO… 0 1 15 0
4 FEMALE 51 45-59 EMPLOYED … NA DIVORC… POST… 0 0 NA 0
5 FEMALE 67 65+ LONG-TERM… 750 WIDOWED VOCA… 1 1 20 0
6 MALE 56 45-59 UNEMPLOYED 1200 MARRIED POST… 1 0 NA 0
7 MALE 86 65+ RETIRED 1260 MARRIED SECO… 1 0 NA 0
8 MALE 59 45-59 LONG-TERM… 1400 MARRIED POST… 1 0 NA 0
9 FEMALE 52 45-59 EMPLOYED … 1500 MARRIED SECO… 1 1 30 1
10 FEMALE 22 16-24 PUPIL OR … 600 SINGLE SECO… 0 0 NA 0
# … with more rows, 1 more variable: bmi <dbl>, and abbreviated variable name
# ¹alcabuse
Let’s have a look at the tbl class. The important class identifiers are "tbl_dbi" and "tbl_sql" which indicate any data manipulation on the tbl will be translated to SQL, will be lazy and will return another "tbl_dbi", not the actual result of the query.
# Source: SQL [1 x 1]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
n
<int>
1 10000000
summarise(n()) get’s translated to an SQL COUNT function, which is an SQL aggregate function that returns one value, hence what is returned to us is another tbl_dbi of 1 x 1 dimensions.
Note
To learn more about which R functions are translated by dbplyr to SQL have a look at the package’s vignettes on Verb and Function translation.
We can inspect the SQL statement generated by dbplyr by piping the query to dplyr’s show_query() function.
We have now checked that our data was fully written to our database table.
Filtering, selecting and summarising
As mentioned, many of dplyr verbs as well as number of aggregating and arithmetic functions can be translated to SQL by dbplyr. For greatest it’s good to try and perform as many operations in SQL before collecting the data. These are performed by the databases SQL engine which is generally more efficient when working with large data.
Let’s try a few examples.
Let’s put together a query that filter values in a few columns and then selects a few columns to return:
# Source: SQL [?? x 3]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
income age marital
<dbl> <dbl> <chr>
1 2000 47 MARRIED
2 1200 56 MARRIED
3 1260 86 MARRIED
4 1400 59 MARRIED
5 1100 78 MARRIED
6 1500 34 MARRIED
7 NA 44 SINGLE
8 1500 50 MARRIED
9 NA 46 MARRIED
10 NA 56 MARRIED
# … with more rows
Again, running the query without collecting does not return the full query result but can help check what your query is doing.
# Source: SQL [?? x 12]
# Database: sqlite 3.40.0 [/Users/Anna/Documents/workflows/OHID/optimise-r/data/db.sqlite]
# Ordered by: bmi, age, income, nociga, sex
sex age agegr socprof income marital edu sport smoke nociga alcab…¹
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <int> <int> <dbl> <int>
1 MALE 51 45-59 OTHER ECON… 200 MARRIED VOCA… 1 0 NA 0
2 MALE 51 45-59 OTHER ECON… 200 MARRIED VOCA… 1 0 NA 0
3 MALE 51 45-59 OTHER ECON… 200 DIVORC… VOCA… 1 0 NA 0
4 MALE 51 45-59 OTHER ECON… 200 MARRIED VOCA… 0 0 NA 0
5 MALE 51 45-59 OTHER ECON… 200 LEGALL… VOCA… 1 1 10 0
6 MALE 51 45-59 OTHER ECON… 200 SINGLE PRIM… 1 1 10 0
7 MALE 51 45-59 OTHER ECON… 200 LEGALL… VOCA… 1 1 10 1
8 MALE 51 45-59 OTHER ECON… 200 LEGALL… VOCA… 1 1 20 1
9 MALE 51 45-59 OTHER ECON… 200 DIVORC… VOCA… 1 1 20 1
10 MALE 51 45-59 OTHER ECON… 200 LEGALL… PRIM… 1 1 20 1
# … with more rows, 1 more variable: bmi <dbl>, and abbreviated variable name
# ¹alcabuse
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
<SQL>
SELECT
`marital`,
MIN(`income`) AS `min_income`,
MAX(`income`) AS `max_income`,
AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
(`age` > 65) AND
(`sex` = 'MALE') AND
(`sport` = 1) AND
(NOT((`income` IS NULL))) AND
(NOT((`marital` IS NULL)))
GROUP BY `marital`
Note I’ve turned off checking for this benchmark because of the difference in how dplyr handles NAs when arranging data in data.frames (NA’s at the end) vs how SQLite’s engine does (NA’s at the top).
In this first test of performance, databases come out slower. That shouldn’t surprise us though. Working with in memory data (that still allows for the memory required for computation) will always be faster because there is no I/O cost to the query (once it has been loaded into memory), whereas executing and collecting the query from the database involves returning the data from disk. We can see though that working with a database is much more memory efficient, which given the topic of the chapter is working with data that does not fit into memory, shows it is a good approach for this use case.
Indexing
Indexes are a way to improve the performance of your read queries, particularly ones with filters (WHERE) on them. They’re data structures that exist in your database engine, outside of whatever table they work with, that point to data you’re trying to query.
They work similar to how indexes in the back of a book do. They contain the ordered values of the column you create them on along with information about the location of the rows containing each value in the original table.
So just like you might use an index to find a recipe instead of flicking through an entire recipe book, database indexes allow you to look up the values in columns and the location of the rows containing them in your original table without scanning the full table. A well crafted index can produce impressive query speed ups!
This does come at a cost. They take up space within your database, increasing it’s overall size, and they also slow down updating any tables containing indexes as the indexes must also be updated. Crafting indexes is also a bit of an art, as creating an index that speeds up a given query might actually slow another one down!
The details of good indexing strategy are a big topic that is well beyond the scope of this workshop.
file.copy( from ="data/db.sqlite", to ="data/db_idx.sqlite")
[1] TRUE
Let’s connect to the database we’re going to index as well as create a connection to the "synthpop" table.
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused detail
1 3 0 0 SCAN synthpop
2 33 0 0 USE TEMP B-TREE FOR ORDER BY
explain() translates to the EXPLAIN QUERY PLAN command in SQLite databases. It includes the SQL translation of our query but is used primarily to obtain a high-level description of the strategy or plan that SQLite uses to implement a specific SQL query. Most significantly, EXPLAIN QUERY PLAN reports on the way in which the query uses database indices. The relevant information is found in the detail column of the bottom table of the output.
The output of piping query 1 into explain() indicates that the SQLite engine is using a full scan of the "synthpop" table to locate the rows matching our select (WHERE) condition. It then uses a temporary Sorting B-Tree for ordering the data. When you use ORDER BY without an index on the column to be sorted, SQLite builds up a temporary data structure that contains the sorted results each time the query is executed. That data structure will use up space in memory (or on disk, depending on the circumstances) and will be torn down after the query is executed.
Tip
To find out more about the SQLite EXPLAIN QUERY PLAN command, head to SQLite doumentation.
Time to create an index. To do so we use function dbExecute() on our database connection and pass it a character string of the SQL statement to create an index.
dbExecute(con_idx,"CREATE INDEX synthpop_age_inc_idx ON synthpop (age, income);")
[1] 0
Let’s break down the statement:
CREATE INDEX is the command to create an index.
synthpop_age_inc_idx is the name of the index we want to create. It’s good practice to include an indication of the table as well as the columns used to create the index in name of the index.
ON synthpop indicates that the index is being created on table synthpop.
(age, income) the parenthesis indicates the columns we want to include in our index. Indexes can be created using one or multiple columns. Here, because our filter statement includes seraching for values on both age and income, we include both columns for better performance. Note however that this inevitably takes up more disk space and more time to create (and in future update) the index.
OK, let’s check the query plan to see if SQLite plans to use our index:
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused
1 4 0 0
2 32 0 0
detail
1 SEARCH synthpop USING INDEX synthpop_age_inc_idx (age>? AND age<?)
2 USE TEMP B-TREE FOR ORDER BY
Indeed! The query is not SCANning the full table anymore but is instead using our index to SEARCH for values in the index matching our filter statement.
Let’s see whether we’ve improved our query’s performance:
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 no_index 617ms 617ms 1.62 3.44MB 0
2 index 126ms 126ms 7.89 3.44MB 0
Indeed we have, roughly a 5x speed up. Not bad! But we could do better!
Because indexes are sorted data structures and their benefit comes from how binary search works, it’s important to ensure that our indexed columns have what is called “high cardinality”. All this means is that the indexed data has a lot of uniqueness.
While our age column has 79 unique values, our income column has 406, i.e. income has higher cardinality than income.
A multi-column index will initially use the first column to search, then the second an so on. So instead of putting age at the front of our index, let’s drop our first index using the DROP INDEX command and let’s instead create a new index with income first:
dbExecute(con_idx,"DROP INDEX synthpop_age_inc_idx;")
[1] 0
dbExecute(con_idx,"CREATE INDEX synthpop_inc_age_idx ON synthpop (income, age);")
[1] 0
Let’s inspect our query plan which reveals that, indeed, our index now searches through income first:
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused detail
1 4 0 0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35 0 0 USE TEMP B-TREE FOR ORDER BY
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused detail
1 4 0 0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35 0 0 USE TEMP B-TREE FOR ORDER BY
An index can be used to speed up sorting only if the query allows to return the rows in the order in which they are stored in the index. Because our index does not include many of the columns we are using in the sort operation, and most importantly, the first one (bmi) the index is ignored by ORDER BY.
We might consider creating another index to take care of the ORDER BY operation and include all the columns involved in the order that we want them sorted.
dbExecute(con_idx,"CREATE INDEX synthpop_arrange1_idx ON synthpop (bmi, age, income, nociga, sex);")
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused detail
1 4 0 0 SCAN synthpop USING INDEX synthpop_arrange1_idx
Now what we see is that the engine is indeed using our synthpop_arrange1_idx index but is only using that one. Not only that, it is now performing a full SCAN of the arrange index table.
An important thing to note is that, in SQLite, each table in the FROM clause of a query can use at most one index and SQLite strives to use at least one index on each table. So it cannot use one index for the WHERE part of the query and another for the ORDER BY part.
In this case, the engine determines that the least costly query plan is to just use the synthpop_arrange1_idx index because all the information it needs is stored within and therefore does not require a lookup in the original synthpop table to retrieve further data. It knows the data is stored in the correct order but to perform the WHERE operation, it does need to scan the full index table.
But why does this in practice end up slower? That’s because the WHERE operation actually returns a much smaller subset of the data. So optimising that part of the query and the using a B-TREE for sorting actually ends up much faster in practice. However, the query optimiser has no way of knowing this upfront (and may not be the case if the WHERE operation returns a much bigger subset), so concludes that (wrongly in our case) that using the synthpop_arrange1_idx index is most efficient.
So at least for this query, let’s consider the synthpop_arrange1_idx index an drop it.
dbExecute(con_idx,"DROP INDEX synthpop_arrange1_idx;")
<SQL>
SELECT *
FROM `synthpop`
WHERE (`age` > 50 AND `age` < 60) AND (`income` < 300.0)
ORDER BY `bmi`, `age`, `income`, `nociga`, `sex`
<PLAN>
id parent notused detail
1 4 0 0 SEARCH synthpop USING INDEX synthpop_inc_age_idx (income<?)
2 35 0 0 USE TEMP B-TREE FOR ORDER BY
Now the optimiser goes back to using the synthpop_inc_age_idx index.
Query 2
So we’ve made Query 1 faster but what about query 2?
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 no_index 752ms 752ms 1.33 528KB 1.33
2 index 714ms 714ms 1.40 525KB 1.40
Well that’s not good! The index seems to have made query 2 a slower?! If we use explain() to dig into it we see it’s still doing a full scan but now the optimiser has to also evaluate a potential query plan that might involve our synthpop_inc_age_idx index.
<SQL>
SELECT
`marital`,
MIN(`income`) AS `min_income`,
MAX(`income`) AS `max_income`,
AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
(`age` > 65) AND
(`sex` = 'MALE') AND
(`sport` = 1) AND
(NOT((`income` IS NULL))) AND
(NOT((`marital` IS NULL)))
GROUP BY `marital`
ORDER BY `marital`
<PLAN>
id parent notused detail
1 7 0 0 SCAN synthpop
2 21 0 0 USE TEMP B-TREE FOR GROUP BY
Let’s create an index to improve the performance of query 2. Let’s focus again on the WHERE part of the query.
We might start by creating an index using all columns involved in the order of decreasing cardinality:
dbExecute(con_idx,"CREATE INDEX synthpop_inc_age_mar_sex_sp_idx ON synthpop (income, age, marital, sex, sport);")
<SQL>
SELECT
`marital`,
MIN(`income`) AS `min_income`,
MAX(`income`) AS `max_income`,
AVG(`income`) AS `mean_income`
FROM `synthpop`
WHERE
(`age` > 65) AND
(`sex` = 'MALE') AND
(`sport` = 1) AND
(NOT((`income` IS NULL))) AND
(NOT((`marital` IS NULL)))
GROUP BY `marital`
ORDER BY `marital`
<PLAN>
id parent notused
1 7 0 0
2 21 0 0
detail
1 SCAN synthpop USING COVERING INDEX synthpop_inc_age_mar_sex_sp_idx
2 USE TEMP B-TREE FOR GROUP BY
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 no_index 672ms 672ms 1.49 524KB 1.49
2 index 526ms 526ms 1.90 524KB 1.90
We see a small improvement. At least the query is not slower now!
Indexing Take Aways:
Indexes can be a useful strategy for improving specific query perfomance. HOWEVER:
We have only scraped the surface of the types of indexes available as well as how to determine when and how to deploy them.
They are fiddly to create and can have unexpected effects on different queries.
They take time to create and update and take up space of disk (our indexed database is now 1.2402647^{9} compared to 8.2244403^{8} of our original database!
Trying to create new indexes to optimise each new query quickly get out of hand and required a lot of knowledge/experimentation.
But! The next section provides some useful perspective!
DuckDB
While SQLite is ubiquitous in the world of embedded databases, and it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution.
There is however a rather recent type of embedded (flat) database called DuckDB.
DuckDB can be far more efficient for complex analytics queries on large amount of data from a database, more common in analytics workflow.
DuckDB is designed to support analytical query workloads, also known as Online analytical processing (OLAP).
These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables.
Changes to the data are expected to be rather large-scale as well, with several rows being appended, or large portions of tables being changed or added at the same time.
DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially
It also has a nice API to R handled through package duckdb. I highly recommend checking the DuckDB documentation to learn more about it’s features, but in general, you can interact with DuckDB databases in R as you would any other database.
So let’s create a DuckDB database with the same data and benchmark our queries against it.
Again we can use dbConnect() to both create a database using a duckdb::duckdb() driver and open a connection to it.
Wow! DuckDB is much faster than SQLite, can compete with and beat an indexed SQL database and can be even faster than running the queries on in-memory data!! And still very memory efficient. And all this without even having to think about indexes!! 🤩 🎉
This is definitely a database type you should know about!
Accessing data through the arrow package
Arrow is software development platform for building high performance applications. As we’ve seen already, The arrow R package provides functionality for fast reading and writing of flat files as well as more efficient binary file formats.
It also provides functions for opening connections to files as well as directories of files, much like we did with databases, and because it has deep integration with dplyr, it allows us to perform queries on out of memory data as we’ve been doing with our databases.
Accessing single files as arrow tables
We can read in a single large csv, arrow or parquet file using the appropriate arrow function but instead of reading it in as a data.frame, we can use as_data_frame = FALSE to open it as an arrow table. Because of how Arrow allocates memory, arrow tables are much more memory efficient representations of tabular data that could mean data that won’t fit into memory as an
WOw! that’s must faster than performing the query even on an in memory data.frame. Impressive!
Accessing data as arrow datasets
Another way to access files through R is by opening them as a dataset with function arrow::open_dataset().
We can open a single file or a whole directory of files, formatted in any of the formats arrow can handle.
This does not load the data into memory. Instead open_dataset() scans the content of the file(s) to identify the name of the columns and their data types.
Accessing single files as arrow datasets
Let’s open a single file as a dataset first. To do so we supply the path to the files as well as the format it’s stored in.
We can also use the same function to open a directory of files stored in the same format. This might be appropriate when your data generation involves creating data in batches that end up in separate files and for some reason you don’t want to be writing them to a database.
The directory structure can help improve performance of queries too depending on how data is partitioned across directories. In some ways you can think of the physical partitioning as a physical index that can be used in a query to completely skip certain files.
Let’s have a look at what this means by actually creating such a directory structure from our dataset.
First let’s create a directory to partition it into. Then we can use function arrow::write_dataset() to write out our data partitioned according to any variables we specify in the partitioning argument. Here we choose to partition across age. Let’s also write data out in efficient parquet files.
arrow::write_dataset(data, path ="data/arrow_dataset", format ="parquet", partitioning ="age")
Let’s use fs::dir_tree() to see the structure of the arrow_dat directory we just created:
As we can see, a folder has been created for each value of age and the rows where the original data matched that condition are contained in parquet files within.
The dataset directory is nonetheless still more efficient than the original csv.
# parquet arrow data setfs::dir_info("data/arrow_dataset", recurse =TRUE)$size%>%sum()
140M
# original csvfs::file_size("data/synthpop_10000000.csv")
NA
Now that we’ve got a partitioned directory of our data, let’s go ahead and open it as an arrow dataset.
arrow_dir_dataset<-arrow::open_dataset("data/arrow_dataset", format ="parquet")
DuckDB can be a very efficient database format for complex queries involving large amounts of data due to it’s OLAP nature owing to it’s columnar-vectorised operation engine.
Indexing can improve queries in SQLite and other OLTP type databases. However they are not flexible, take a lot of knowledge and experimentation, increase disk space and can also reduce performance on other queries or if mis-applied.
The arrow package provide another option for loading or opening connections to files or directories of data and has deep integration with dplyr for performing queries.
Partitioning can improve querying directories of data as arrow datasets. they are however inflexible and represent a single physical index applied on the whole dataset.
Arrow tables allows loading large datasets in a more memory efficient way and support very fast querying.
Batch processing
In the previous sections we were focusing on a narrow set of operations, in particular the efficiency of accessing, filtering, selecting, ordering and aggregating subsets of data from data too large to fit into memory. But often we need to perform some processing on the whole dataset, as we saw in the example of populating our database in batches.
Other times our analyses, for example fitting a model, might require the full dataset to produce a result which can be problematic even if we can just load our data in our memory as that may leaves us with little RAM to compute.
An option in this case would be to use algorithms that can compute on chunks or batches of the data. These algorithms are known as external memory algorithms (EMA), or batch processing.
Here’s a simple example of how we might write a batch processing algorithm to calculate the mean across multiple files, specifically the parquet files we just created in data/arrow_dataset/arrow_dat.
batch_mean<-function(file_name){dat<-arrow::read_parquet(file_name, col_select ="income")income_vct<-na.omit(dat[["income"]])c(mean =mean(income_vct), n =length(income_vct))}
In this function we are given a file name. For each file, we load only the column we are interested (income) remove NAs and calculate the mean. To be able to aggregate the mean across all files, we also record n, the number of values used to calculate the mean.
We can then apply the function to a list of file names and aggregate the results in a tibble using the purrr::map_dfr.
file_names<-fs::dir_ls("data/arrow_dataset", recurse =TRUE, type ="file")means<-purrr::map_dfr(file_names, ~batch_mean(.x))
means
# A tibble: 79 × 2
mean n
<dbl> <dbl>
1 543. 5138
2 530. 17675
3 548. 20947
4 678. 21002
5 770. 17801
6 1109. 71922
7 1107. 89002
8 1275. 95972
9 1411. 92471
10 1555. 115432
# … with 69 more rows
Now that we’ve got our batched mean we can calculate a weighted mean an use the n column as the weight, which indeed gives us the same mean as we had calculated it on the whole dataset.
There are a number of R packages that provide EMA solutions for analysis bigger than memory data.
For example function biglm from package biglm allows for fitting a linear model in batches.
In the following example from the package documentation, an lm model is fitted initially to a small subset of the data with function biglm. The model is subsequently updated with additional chunks of using the update.
The list of R packages available are numerous and their suitability varies according to the data and analysis you need to perform.
bigmemory: Manage Massive Matrices with Shared Memory and Memory-Mapped Files. The package uses memory mapping where RAM addresses are mapped to a file on disk. While innnevitably reducing performance, this extends the memory available for computation to memory on disk
A number of analytics packages build on bigmemory including:
bigtabulate: Extend the bigmemory package with ‘table’, ‘tapply’, and ‘split’ support for ‘big.matrix’ objects.
bigstep: Uses the bigmemory framework to perform stepwise model selection, when the data cannot fit into RAM.
ff : The ff package provides data structures that are stored on disk in a binary format but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory. These data structures lend themselves to efficient chunking. Unlike bigmemory which on support numeric data types, ff supports all of R vector types including factors (which any character data is converted to for memory efficiency.
ffbase: extends the ff package with a number of methods for working with ff objects.
Package biglm also has methods for ff type objects so is not limited to fitting on numeric data.
I should acknowledge that this brief section has been a summarisation of the chapter on Efficient Memory from BGUs Department of Industrial Engineering and Management “R” course by Jonathan D. Rosenblatt. For more information I highly recommend reviewing it as well as the chapter on Sparse Representations.