n_rows <- 1000000L
data <- arrow::read_parquet(here::here("data", paste0("synthpop_", n_rows, ".parquet")))
Efficient Data Input/Output (I/O)
Before we can work with data within R, we first have to be able to read it in. Conversely, once we’ve finished processing or analysing our data, we might need to write out final or intermediate results.
Many factors will go into deciding which format and which read & write functions we might choose for our data. For example:
File size
Portability
Interoperability
Human readability
In this section we’ll a number of the most common file formats for data (primarily tabular) and summarise their characteristics.
We’ll also compare and benchmark functions and packages available in R for reading an writing them.
File formats
Flat files
Some of the most common file formats we might be working with when dealing with tabular data are flat delimited text files. Such files store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. A couple of well known examples are:
Comma-Separated Values files (CSVs): use a comma to separate values.
Tab-Separated Values files (TSVs): use a tab to separate values.
They are ubiquitous and human readable but as you will see, they take up quite a lot of disk space (comparatively) and can be slow to read and write when dealing with large files
Packages/functions that can read/write delimited text files:
Relevant functions
Read
Write
write.csv()
/write.delim()
Binary files
If you look at Wikipedia for a definition of Binary files, you get:
A binary file is a computer file that is not a text file 😜
You’ll also learn that binary files are usually thought of as being a sequence of bytes, and that some binary files contain headers, blocks of metadata used by a computer program to interpret the data in the file. Because they are stored in bytes, they are not human readable unless viewed through specialised viewers.
The process of writing out data to a binary format is called binary serialisation and different format can use different serialisation methods.
Let’s look at some binary formats you might consider as an R user.
RData/RDS
formats
.RData
and .rds
files are binary formats specific to R that can be used to read complete R objects, so not just restricted to tabular data. They can therefore be good options for storing more complicated object like models etc. .RData
files can store multiple objects while .rds
are designed to contain a single object. Pertinent characteristics of such files:
Can be faster to restore the data to R (but not necessarily as fast to write).
Can preserve R specific information encoded in the data (e.g., attributes, variable types, etc).
Are R specific so not interoperable outside of R environments.
In R 3.6, the default serialisation version used to write
.Rdata
and.rds
binary files changed from 2 to 3. This means that files serialised with version 3 will not be able to read by others running R < 3.5.0 which limits interoperability even between R users.
Overall, while good for writing R objects, I would reserve writing such files only for ephemeral intermediate results or for more complex objects, where other formats are not appropriate. Be mindful of the serialisation version you use if you want users running R < 3.5.0 to be able to read them.
Relevant functions
Write
Read
Apache parquet/arrow:
While different file formats, I’ve bundled these two together because they are both Apache Foundation data formats. We also use the same R package (arrow
) to read & write them.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.
The formats, as well as the arrow
R package to interact with them, are part of the Apache Arrow software development platform for building high performance applications that process and transport large data sets.
Relevant functions
Write
arrow::write_parquet()
: for writing Apache parquet files.arrow::write_feather():
for writing arrow IPC format files (arrow represent version 2 of feather files, hence the confusing name of the function).
Read
arrow::read_parquet()
: for reading Apache parquet files.arrow::read_feather():
for reading arrow IPC format files.
fst
The fst package for R is based on a number of C++ libraries and provides a fast, easy and flexible way to serialize data frames into the fst
binary format. With access speeds of multiple GB/s, fst is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers.
The fst file format provides full random access to stored datasets allowing retrieval of subsets of both columns and rows from a file. Files are also compressed.
Relevant functions
Write
-
fst::write.fst()
: for writingfst
files.
Read
-
fst::read.fst()
: for readingfst
files.
qs
Package qs
provides an interface for quickly saving and reading objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the saveRDS
and readRDS
functions in R.
saveRDS
and readRDS
are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, fst
is extremely fast, but only works on data.frame
's and certain column types.
qs
is both extremely fast and general: it can serialize any R object like saveRDS
and is just as fast and sometimes faster than fst
.
Write
-
qs::qsave()
: for serialising R objects toqs
files.
Read
-
qs::qload()
: for loadingqs
files.
Benchmarks
Now that we’ve discussed a bunch of relevant file formats and the packages used to read and write them, let’s go ahead and test out the comparative performance of reading and writing them, as well as the file sizes of different formats.
Writing data
Let’s start by comparing write efficiency.
Before we start, we’ll need some data to write. So let’s load one of the parquet files from the course materials. Let’s go for the file with 1,000,000 rows. If you want to speed up the testing you can use the file with 100,000 rows by changing the value of nrow
.
Let’s also load dplyr
for the pipe and other helpers:
Let’s now create a directory to write our data to:
out_dir <- here::here("data", "write")
fs::dir_create(out_dir)
To compare each file format and function combination (where appropriate), I’ve written a function that uses the vale of the format
argument and the switch()
function to deploy different write function/format combination for writing out the data.
write_dataset <- function(data,
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
out_dir,
file_name = paste0("synthpop_", n_rows, "_")) {
switch (format,
## FLAT FILES ###
# write cvs using base
csv = write.csv(data,
file = fs::path(out_dir,
paste0(file_name, format),
ext = "csv"),
row.names = FALSE),
# write csv using readr
csv_readr = readr::write_csv(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# write csv using data.table
csv_dt = data.table::fwrite(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# write csv using arrow
csv_arrow = arrow::write_csv_arrow(data,
file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
## BINARY FILES ###
# write parquet using arrow
parquet = arrow::write_parquet(data, sink = fs::path(
out_dir,
paste0(file_name, format),
ext = "parquet")),
# write arrow IPC using arrow
arrow = arrow::write_feather(data, sink = fs::path(
out_dir,
paste0(file_name, format),
ext = "arrow")),
# write RData using base
rdata = save(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "RData"),
version = 2),
# write rds using base
rds = saveRDS(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "rds"),
version = 2),
# write fst using fst
fst = fst::write_fst(data, path = fs::path(out_dir,
paste0(file_name, format),
ext = "fst")),
# write qs using qs
qs = qs::qsave(data, file = fs::path(out_dir,
paste0(file_name, format),
ext = "qs"))
)
}
I’ve also write a function to process the bench::mark()
output, removing unnecessary information, arranging the results in descending order of median and printing the result as a gt()
table.
We’re now ready to run our benchmarks. I’ve set them up as a bench::press()
so we can run the same function every time but vary the format
argument for each test:
bench::press(
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
{
bench::mark(write_dataset(data, format = format, out_dir = out_dir))
}
) %>%
print_bm()
format | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc | total_time |
---|---|---|---|---|---|---|---|
arrow | 58.73ms | 16.9895018 | 40.38KB | 0.0000000 | 9 | 0 | 529.74ms |
csv_dt | 236.50ms | 4.2072572 | 1.44MB | 0.0000000 | 3 | 0 | 713.05ms |
fst | 294.15ms | 3.3996278 | 3.54MB | 0.0000000 | 2 | 0 | 588.30ms |
parquet | 315.53ms | 3.1692389 | 2.27MB | 0.0000000 | 2 | 0 | 631.07ms |
csv_arrow | 340.96ms | 2.9329298 | 3.34MB | 0.0000000 | 2 | 0 | 681.91ms |
qs | 345.49ms | 2.8944557 | 176.88KB | 0.0000000 | 2 | 0 | 690.98ms |
csv_readr | 517.42ms | 1.9326511 | 66.58MB | 1.9326511 | 1 | 1 | 517.42ms |
rds | 3.12s | 0.3208688 | 8.63KB | 0.0000000 | 1 | 0 | 3.12s |
rdata | 3.12s | 0.3201837 | 8.63KB | 0.3201837 | 1 | 1 | 3.12s |
csv | 3.24s | 0.3082182 | 61.33MB | 3.3904006 | 1 | 11 | 3.24s |
We see that:
the fastest write format by quite some margin is the arrow format using
arrow::write_feather()
.All
arrow
package are actually quite efficient, all featuring in the top 5 for speed, regardless of format.For
csv
formats however, there is a clear winner,data.table()
.Both
qs
andfst
are, as advertised, quite fast andqs
in particular should definitely be considered when needing to store more complex R objects.Base functions
write.csv()
,save()
andsaveRDS
are often orders of magnitude slower.
Size on disk
Let’s also check how much space each file format takes up on disk:
tibble::tibble(file = basename(fs::dir_ls(out_dir)),
size = file.size(fs::dir_ls(out_dir))) |>
arrange(size) |>
mutate(size = gdata::humanReadable(size,
standard="SI",
digits=1)) |>
gt::gt()
file | size |
---|---|
synthpop_1000000_parquet.parquet | 7.1 MB |
synthpop_1000000_rds.rds | 11.9 MB |
synthpop_1000000_rdata.RData | 11.9 MB |
synthpop_1000000_qs.qs | 16.2 MB |
synthpop_1000000_arrow.arrow | 47.8 MB |
synthpop_1000000_fst.fst | 48.4 MB |
synthpop_1000000_csv_dt.csv | 106.4 MB |
synthpop_1000000_csv_readr.csv | 110.8 MB |
synthpop_1000000_csv.csv | 120.4 MB |
synthpop_1000000_csv_arrow.csv | 120.8 MB |
It’s clear that binary formats take up a lot less space on disk that csv text files. At the extremes, parquet files take up over 17 times less space that a csv file written out with write.csv()
or arrow::write_csv_arrow()
.
Reading data
Let’s now use the files we created to test how efficient different formats and functions are in reading in.
Just like I did before with write_dataset()
, I’ve written a function to read the appropriate file using the appropriate function according to the value of the format
argument:
read_dataset <- function(data, format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
out_dir,
file_name = paste0("synthpop_", n_rows, "_")) {
switch (format,
## FLAT FILES ###
# read cvs using base
csv = read.csv(file = fs::path(out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using readr
csv_readr = readr::read_csv(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using data.table
csv_dt = data.table::fread(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
# read cvs using arrow
csv_arrow = arrow::read_csv_arrow(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "csv")),
## BINARY FILES ###
# read parquet using arrow
parquet = arrow::read_parquet(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "parquet")),
# read arrow using arrow
arrow = arrow::read_feather(file = fs::path(
out_dir,
paste0(file_name, format),
ext = "arrow")),
# read RData using base
rdata = load(file = fs::path(out_dir,
paste0(file_name, format),
ext = "RData")),
# read rds using base
rds = readRDS(file = fs::path(out_dir,
paste0(file_name, format),
ext = "rds")),
fst = fst::read_fst(path = fs::path(out_dir,
paste0(file_name, format),
ext = "fst")),
qs = qs::qload(file = fs::path(out_dir,
paste0(file_name, format),
ext = "qs"))
)
}
And again, I’ve set up our benchmarks as a bench::press()
so we can run the same function every time but vary the format
argument for each test:
Let’s see how fast our format/function combos are at reading!
bench::press(
format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
"parquet", "arrow", "rdata", "rds", "fst", "qs"),
{
bench::mark(
read_dataset(data, format = format, out_dir = out_dir),
relative = FALSE)
}
) %>%
print_bm()
format | median | itr/sec | mem_alloc | gc/sec | n_itr | n_gc | total_time |
---|---|---|---|---|---|---|---|
arrow | 10.56ms | 94.3128761 | 12.0MB | 6.7366340 | 42 | 3 | 445.33ms |
parquet | 30.57ms | 33.0283989 | 11.5MB | 2.2018933 | 15 | 1 | 454.15ms |
csv_arrow | 56.60ms | 17.5962129 | 25.0MB | 2.5137447 | 7 | 1 | 397.81ms |
fst | 232.94ms | 4.2929415 | 76.3MB | 2.1464707 | 2 | 1 | 465.88ms |
csv_dt | 309.57ms | 3.2302349 | 97.7MB | 3.2302349 | 1 | 1 | 309.57ms |
qs | 342.91ms | 2.9162241 | 76.3MB | 2.9162241 | 2 | 2 | 685.82ms |
csv_readr | 624.16ms | 1.6021469 | 91.0MB | 3.2042939 | 1 | 2 | 624.16ms |
rdata | 1.08s | 0.9246915 | 76.3MB | 0.9246915 | 1 | 1 | 1.08s |
rds | 1.10s | 0.9105128 | 76.3MB | 0.9105128 | 1 | 1 | 1.10s |
csv | 1.96s | 0.5095255 | 378.9MB | 2.5476277 | 1 | 5 | 1.96s |
Results of our experiments show that:
The arrow format using
arrow::read_feather()
is again the fastest.Again all
arrow
functions are the fastest for reading, regardless of format, occupying the top 3.data.table::fread()
is again very competitive for reading CSVs.qs
also is highly performant, and a good function to know given it can be used for more complex objectsbase functions for reading files, whether binary or CSV are again the slowest by quite some margin.
It should be noted that both
readr::read_csv()
andread.csv()
can be made much faster by pre-specifying the data type for each column when reading.