Efficient Data Input/Output (I/O)

Before we can work with data within R, we first have to be able to read it in. Conversely, once we’ve finished processing or analysing our data, we might need to write out final or intermediate results.

Many factors will go into deciding which format and which read & write functions we might choose for our data. For example:

In this section we’ll a number of the most common file formats for data (primarily tabular) and summarise their characteristics.

We’ll also compare and benchmark functions and packages available in R for reading an writing them.

File formats

Flat files

Some of the most common file formats we might be working with when dealing with tabular data are flat delimited text files. Such files store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. A couple of well known examples are:

  • Comma-Separated Values files (CSVs): use a comma to separate values.

  • Tab-Separated Values files (TSVs): use a tab to separate values.

They are ubiquitous and human readable but as you will see, they take up quite a lot of disk space (comparatively) and can be slow to read and write when dealing with large files

Packages/functions that can read/write delimited text files:

Relevant functions
Read
Write

Binary files

If you look at Wikipedia for a definition of Binary files, you get:

A binary file is a computer file that is not a text file 😜

You’ll also learn that binary files are usually thought of as being a sequence of bytes, and that some binary files contain headers, blocks of metadata used by a computer program to interpret the data in the file. Because they are stored in bytes, they are not human readable unless viewed through specialised viewers.

The process of writing out data to a binary format is called binary serialisation and different format can use different serialisation methods.

Let’s look at some binary formats you might consider as an R user.

RData/RDS formats

.RData and .rds files are binary formats specific to R that can be used to read complete R objects, so not just restricted to tabular data. They can therefore be good options for storing more complicated object like models etc. .RData files can store multiple objects while .rds are designed to contain a single object. Pertinent characteristics of such files:

  • Can be faster to restore the data to R (but not necessarily as fast to write).

  • Can preserve R specific information encoded in the data (e.g., attributes, variable types, etc).

  • Are R specific so not interoperable outside of R environments.

  • In R 3.6, the default serialisation version used to write .Rdata and .rds binary files changed from 2 to 3. This means that files serialised with version 3 will not be able to read by others running R < 3.5.0 which limits interoperability even between R users.

Overall, while good for writing R objects, I would reserve writing such files only for ephemeral intermediate results or for more complex objects, where other formats are not appropriate. Be mindful of the serialisation version you use if you want users running R < 3.5.0 to be able to read them.

Relevant functions
Write
Read

Apache parquet/arrow:

While different file formats, I’ve bundled these two together because they are both Apache Foundation data formats. We also use the same R package (arrow) to read & write them.

  • Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

  • Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.

Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.

The formats, as well as the arrow R package to interact with them, are part of the Apache Arrow software development platform for building high performance applications that process and transport large data sets.

Note

You may have noticed the files I shared in data/ as part of the course materials were all parquet files. That’s because the compression of parquet files meant I could write a 10,000,000 table of data to a ~67 MB file (compared to over 1GB in csv format!) and allowed me to share it through GitHub (and you to download it in a more acceptable time frame!

Relevant functions
Write
  • arrow::write_parquet(): for writing Apache parquet files.

  • arrow::write_feather(): for writing arrow IPC format files (arrow represent version 2 of feather files, hence the confusing name of the function).

Read
  • arrow::read_parquet(): for reading Apache parquet files.

  • arrow::read_feather(): for reading arrow IPC format files.

fst

The fst package for R is based on a number of C++ libraries and provides a fast, easy and flexible way to serialize data frames into the fst binary format. With access speeds of multiple GB/s, fst is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers.

The fst file format provides full random access to stored datasets allowing retrieval of subsets of both columns and rows from a file. Files are also compressed.

Relevant functions
Write
Read

qs

Package qs provides an interface for quickly saving and reading objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the saveRDS and readRDS functions in R.

saveRDS and readRDS are the standard for serialization of R data, but these functions are not optimized for speed. On the other hand, fst is extremely fast, but only works on data.frame's and certain column types.

qs is both extremely fast and general: it can serialize any R object like saveRDS and is just as fast and sometimes faster than fst.

Write
Read

Benchmarks

Now that we’ve discussed a bunch of relevant file formats and the packages used to read and write them, let’s go ahead and test out the comparative performance of reading and writing them, as well as the file sizes of different formats.

Writing data

Let’s start by comparing write efficiency.

Before we start, we’ll need some data to write. So let’s load one of the parquet files from the course materials. Let’s go for the file with 1,000,000 rows. If you want to speed up the testing you can use the file with 100,000 rows by changing the value of nrow.

n_rows <- 1000000L
data <- arrow::read_parquet(here::here("data", paste0("synthpop_", n_rows, ".parquet")))

Let’s also load dplyr for the pipe and other helpers:

Let’s now create a directory to write our data to:

out_dir <- here::here("data", "write")
fs::dir_create(out_dir)

To compare each file format and function combination (where appropriate), I’ve written a function that uses the vale of the format argument and the switch() function to deploy different write function/format combination for writing out the data.

write_dataset <- function(data, 
                          format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
                                     "parquet", "arrow", "rdata", "rds", "fst", "qs"),
                          out_dir, 
                          file_name = paste0("synthpop_", n_rows, "_")) {
    
    
    switch (format,
            ## FLAT FILES ###
            # write cvs using base
            csv = write.csv(data, 
                            file = fs::path(out_dir, 
                                            paste0(file_name, format), 
                                            ext = "csv"),
                            row.names = FALSE),
            # write csv using readr
            csv_readr = readr::write_csv(data, 
                                         file = fs::path(
                                             out_dir, 
                                             paste0(file_name, format), 
                                             ext = "csv")),
            # write csv using data.table
            csv_dt = data.table::fwrite(data, 
                                        file = fs::path(
                                            out_dir, 
                                            paste0(file_name, format), 
                                            ext = "csv")),
            # write csv using arrow
            csv_arrow = arrow::write_csv_arrow(data, 
                                               file = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "csv")),
            ## BINARY FILES ###
            # write parquet using arrow
            parquet = arrow::write_parquet(data, sink = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "parquet")),
            # write arrow IPC using arrow
            arrow = arrow::write_feather(data, sink = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "arrow")),
            # write RData using base
            rdata = save(data, file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "RData"),
                         version = 2),
            # write rds using base
            rds = saveRDS(data, file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "rds"),
                          version = 2),
            # write fst using fst
            fst = fst::write_fst(data, path = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "fst")),
            # write qs using qs
            qs = qs::qsave(data, file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "qs"))
            
            
    )
}

I’ve also write a function to process the bench::mark() output, removing unnecessary information, arranging the results in descending order of median and printing the result as a gt() table.

print_bm <- function(benchmark) {
    benchmark[, c("expression", "min", "result", "memory", "time", "gc")] <- NULL
    benchmark %>%
        arrange(median) %>%
        gt::gt()
}

We’re now ready to run our benchmarks. I’ve set them up as a bench::press() so we can run the same function every time but vary the format argument for each test:

bench::press(
    format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
               "parquet", "arrow", "rdata", "rds", "fst", "qs"),
    {
        bench::mark(write_dataset(data, format = format, out_dir = out_dir))
    }
) %>%
    print_bm()
format median itr/sec mem_alloc gc/sec n_itr n_gc total_time
arrow 58.73ms 16.9895018 40.38KB 0.0000000 9 0 529.74ms
csv_dt 236.50ms 4.2072572 1.44MB 0.0000000 3 0 713.05ms
fst 294.15ms 3.3996278 3.54MB 0.0000000 2 0 588.30ms
parquet 315.53ms 3.1692389 2.27MB 0.0000000 2 0 631.07ms
csv_arrow 340.96ms 2.9329298 3.34MB 0.0000000 2 0 681.91ms
qs 345.49ms 2.8944557 176.88KB 0.0000000 2 0 690.98ms
csv_readr 517.42ms 1.9326511 66.58MB 1.9326511 1 1 517.42ms
rds 3.12s 0.3208688 8.63KB 0.0000000 1 0 3.12s
rdata 3.12s 0.3201837 8.63KB 0.3201837 1 1 3.12s
csv 3.24s 0.3082182 61.33MB 3.3904006 1 11 3.24s

We see that:

  • the fastest write format by quite some margin is the arrow format using arrow::write_feather().

  • All arrow package are actually quite efficient, all featuring in the top 5 for speed, regardless of format.

  • For csv formats however, there is a clear winner, data.table().

  • Both qs and fst are, as advertised, quite fast and qs in particular should definitely be considered when needing to store more complex R objects.

  • Base functions write.csv() , save() and saveRDS are often orders of magnitude slower.

Size on disk

Let’s also check how much space each file format takes up on disk:

tibble::tibble(file = basename(fs::dir_ls(out_dir)),
               size = file.size(fs::dir_ls(out_dir))) |>
    arrange(size) |>
    mutate(size = gdata::humanReadable(size,
                                       standard="SI",
                                       digits=1)) |>
    gt::gt()
file size
synthpop_1000000_parquet.parquet 7.1 MB
synthpop_1000000_rds.rds 11.9 MB
synthpop_1000000_rdata.RData 11.9 MB
synthpop_1000000_qs.qs 16.2 MB
synthpop_1000000_arrow.arrow 47.8 MB
synthpop_1000000_fst.fst 48.4 MB
synthpop_1000000_csv_dt.csv 106.4 MB
synthpop_1000000_csv_readr.csv 110.8 MB
synthpop_1000000_csv.csv 120.4 MB
synthpop_1000000_csv_arrow.csv 120.8 MB

It’s clear that binary formats take up a lot less space on disk that csv text files. At the extremes, parquet files take up over 17 times less space that a csv file written out with write.csv() or arrow::write_csv_arrow().

Reading data

Let’s now use the files we created to test how efficient different formats and functions are in reading in.

Just like I did before with write_dataset(), I’ve written a function to read the appropriate file using the appropriate function according to the value of the format argument:

read_dataset <- function(data, format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
                                           "parquet", "arrow", "rdata", "rds", "fst", "qs"),
                          out_dir,
                          file_name = paste0("synthpop_", n_rows, "_")) {
    
    
    switch (format,
            ## FLAT FILES ###
            # read cvs using base
            csv = read.csv(file = fs::path(out_dir, 
                                            paste0(file_name, format), 
                                            ext = "csv")),
            # read cvs using readr
            csv_readr = readr::read_csv(file = fs::path(
                                             out_dir, 
                                             paste0(file_name, format), 
                                             ext = "csv")),
            # read cvs using data.table
            csv_dt = data.table::fread(file = fs::path(
                                            out_dir, 
                                            paste0(file_name, format), 
                                            ext = "csv")),
            # read cvs using arrow
            csv_arrow = arrow::read_csv_arrow(file = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "csv")),
            ## BINARY FILES ###
            # read parquet using arrow
            parquet = arrow::read_parquet(file = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "parquet")),
            # read arrow using arrow
            arrow = arrow::read_feather(file = fs::path(
                                                   out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "arrow")),
            # read RData using base
            rdata = load(file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "RData")),
            # read rds using base
            rds = readRDS(file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "rds")),
            fst = fst::read_fst(path = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "fst")),
            qs = qs::qload(file = fs::path(out_dir, 
                                                   paste0(file_name, format), 
                                                   ext = "qs"))
            
            
    )
}

And again, I’ve set up our benchmarks as a bench::press() so we can run the same function every time but vary the format argument for each test:

Let’s see how fast our format/function combos are at reading!

bench::press(
    format = c("csv", "csv_readr", "csv_dt", "csv_arrow",
            "parquet", "arrow", "rdata", "rds", "fst", "qs"),
    {
    bench::mark(
        read_dataset(data, format = format, out_dir = out_dir),
        relative = FALSE)
    }
) %>%
    print_bm()
format median itr/sec mem_alloc gc/sec n_itr n_gc total_time
arrow 10.56ms 94.3128761 12.0MB 6.7366340 42 3 445.33ms
parquet 30.57ms 33.0283989 11.5MB 2.2018933 15 1 454.15ms
csv_arrow 56.60ms 17.5962129 25.0MB 2.5137447 7 1 397.81ms
fst 232.94ms 4.2929415 76.3MB 2.1464707 2 1 465.88ms
csv_dt 309.57ms 3.2302349 97.7MB 3.2302349 1 1 309.57ms
qs 342.91ms 2.9162241 76.3MB 2.9162241 2 2 685.82ms
csv_readr 624.16ms 1.6021469 91.0MB 3.2042939 1 2 624.16ms
rdata 1.08s 0.9246915 76.3MB 0.9246915 1 1 1.08s
rds 1.10s 0.9105128 76.3MB 0.9105128 1 1 1.10s
csv 1.96s 0.5095255 378.9MB 2.5476277 1 5 1.96s

Results of our experiments show that:

  • The arrow format using arrow::read_feather() is again the fastest.

  • Again all arrow functions are the fastest for reading, regardless of format, occupying the top 3.

  • data.table::fread() is again very competitive for reading CSVs.

  • qs also is highly performant, and a good function to know given it can be used for more complex objects

  • base functions for reading files, whether binary or CSV are again the slowest by quite some margin.

  • It should be noted that both readr::read_csv() and read.csv() can be made much faster by pre-specifying the data type for each column when reading.

Take Aways
  • The arrow package offers some of the fastest functions for writing both flat (e.g. CSV) and binary files like parquet and arrow.

  • The arrow format is especially fast to read and write.

  • Functions from the data.table package are also solid contenders for reading and writing CSV files.

  • Functions in package qs are also quite performant, especially given they can read and write more complex R objects.

  • Binary files are the most disk space efficient, particularly the parquet file format.