-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathR-and-Python.Rmd
564 lines (395 loc) · 29.7 KB
/
R-and-Python.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
---
layout: default
title: Working with big data in R and Python
---
```{r setup, include=FALSE}
options(replace.assign=TRUE, width=65)
knitr::opts_chunk$set(eval = FALSE, message = FALSE) ## turned off message() output
library(DBI)
SLOWEVAL <- TRUE
```
This section aims to provide an overview of working with large datasets in R and (to a lesser extent) Python. Given the scope of topics, this is not meant to be a detailed treatment of each topic.
We'll start with a refresher on data frames in R and Python and some discussion of the *dplyr* package, whose standard operations are similar to using SQL syntax. Note that what is referred to as split-apply-combine functionality in dplyr in R and in pandas in Python is the same concept as the use of SQL's GROUP BY combined with aggregation operations such as MIN, MAX, AVG, COUNT.
The CSV files for the 2016 Stack Overflow data and the space-delimited files for the Wikipedia traffic data used in the examples below can be obtained [here](http://www.stat.berkeley.edu/share/paciorek/tutorial-databases-data.zip).
## 1 Data frames in R and Python
### 1.1 Data frames in R
A data frame in R is essentially the same as a table in SQL. The notion of a data frame has been essential to the success of R and its existence inspired Python's Pandas package.
R's data frames are stored in memory, but there are now packages (such as dplyr with an SQL backend, `arrow`, `SparkR` and `h2o`) that allow you to treat an external data source as if it were an actual R data frame, using familiar syntax to operate on the data frame.
This tutorial assumes you're familiar with basic data frame functionality in R or Python, so I won't go into more details here.
dplyr, which will be discussed later, allows you to operate on data frames using functionality that is similar to SQL, in particular selecting columns, filtering rows, aggregation operations on subsets, and joining multiple data frames.
But base R syntax can be used for all of these operations too. Here's the base R syntax corresponding to SQL's SELECT, WHERE, GROUP BY, and JOIN functionality.
```{r}
users <- read.csv(file.path('data', 'users-2016.csv'))
questions <- read.csv(file.path('data', 'questions-2016.csv'))
users[ , c('userid', 'upvotes')] # select columns
users[users$upvotes > 10000, ] # filter by row (i.e., SQL WHERE)
aggregate(upvotes ~ age, data = users, FUN = median) # group by (i.e., aggregation)
joined <- merge(users, questions, by.x = 'userid', by.y = 'ownerid',
all.x = FALSE, all.y = FALSE) # inner join
```
### 1.2 Using SQL syntax with R data frames: `sqldf`
The *sqldf* package provides the ability to use SQL queries on R data frames (via `sqldf`) and on-the-fly when reading from CSV files (via `read.csv.sql`). The latter can help you avoid reading in the entire dataset into memory in R if you just need a subset of it.
The basic sequence of operations that happens is that the data frame (if using `sqldf`) or the file (if using `read.csv.sql`) is read temporarily into a database and then the requested query is performed on the database, returning the result as a regular R data frame. So you might find things to be a bit slow because of the time involved in creating the database.
The following illustrates usage but the `read.csv.sql` part of the code won't work in practice on this particular example input file, because sqldf regards quotes as part of the text and not as delineating fields. The CSVs for the Stack Overflow data all have quotes distinguishing fields because there are commas within some fields.
```{r}
library(sqldf)
## sqldf
users <- read.csv(file.path('data','users-2016.csv'))
oldUsers <- sqldf("select * from users where age > 75")
## read.csv.sql with data read into an in-memory database
oldUsers <- read.csv.sql(file.path('data', 'users-2016.csv'),
sql = "select * from file where age > 75",
dbname = NULL, header = TRUE)
## read.csv.sql with data read into temporary database on disk
oldUsers <- read.csv.sql(file.path('data', 'users-2016.csv'),
sql = "select * from file where age > 75",
dbname = tempfile(), header = TRUE)
```
And note that one can use `sqldf::read.csv.sql` to avoid reading all the data in from disk.
### 1.3 Data frames in Python
The Pandas package has nice functionality for doing dataset manipulations akin to SQL queries including group by/aggregation operations, using a data structure called a DataFrame inspired by R's data frames. Furthermore, Pandas was designed from the start for computational efficiency, in contrast to standard data frames in R (but see below for newer R functionality that is much more efficient).
Here are some examples:
```{python}
import pandas as pd
import os
users = pd.read_csv(os.path.join('data', 'users-2016.csv'))
questions = pd.read_csv(os.path.join('data', 'questions-2016.csv'))
type(users)
users[['userid', 'upvotes']] # select columns
users[users.upvotes > 10000] # filter by row (i.e., sql WHERE)
# group by (i.e., aggregation)
users.groupby('age')['upvotes'].agg({'med': 'median', 'avg': 'mean'})
joined = pd.merge(users, questions, how= 'inner', left_on= 'userid',
right_on = 'ownerid')
```
[Polars](https://docs.pola.rs/user-guide) is a newer dataframe package that provides a Python interface and is designed to be fast.
Here are some examples:
```{python}
import polars as pl
import os
users = pl.read_csv(os.path.join('data', 'users-2016.csv'))
tags = pl.read_csv(os.path.join('data', 'questions_tags-2016.csv'))
questions = pl.read_csv(os.path.join('data', 'questions-2016.csv'))
type(users)
users.select(pl.col('userid','upvotes')) # select columns
users.filter(pl.col('upvotes') > 10000) # filter by row (i.e., sql WHERE)
# group by (i.e., aggregation)
tags.groupby('tag').agg(
pl.col('*').count().alias('n'))
tags.groupby('tag').agg(
pl.col('*').count().alias('n')).sort('n', descending=True).head(8)
## Users userid got read in as an integer but Questions ownerid as string.
users.schema
users = pl.read_csv(os.path.join('data', 'users-2016.csv'), dtypes = {'userid': str})
joined = users.join(questions, how= 'inner', left_on= 'userid',
right_on = 'ownerid')
```
Here's a [further example](./polars-vs-pandas.py) that compares Pandas and Polars, building off of [this blog post](https://www.r-bloggers.com/2024/02/why-pandas-feels-clunky-when-coming-from-r/?utm_source=phpList&utm_medium=email&utm_campaign=R-bloggers-daily&utm_content=HTML#google_vignette) comparing R's dplyr to Pandas. As illustrated in the example and above, I think the Polars interface (API) is easier to read and use than that of Pandas.
### 1.4 Distributed data frames in Dask in Python
The Dask package provides the ability to divide data frames across multiple workers (and across nodes), allowing one to handle very large datasets, as discussed in [this tutorial](https://berkeley-scf.github.io/tutorial-dask-future/python-dask#4-dask-distributed-datastructures-and-automatic-parallel-operations-on-them).
## 2 dplyr in R
### 2.1 Overview
dplyr is part of the [tidyverse](http://tidyverse.org/), a set of R packages spearheaded by Hadley Wickham. You can think of dplyr as providing the functionality of SQL (selecting columns, filtering rows, transforming columns, aggregation, and joins) on R data frames using a clean syntax that is easier to use than base R operations.
There's lots to dplyr, but here we'll just illustrate the basic operations by analogy with SQL.
Here we'll read the data in and do some basic subsetting. In reading the data in we'll use another part of the tidyverse: the `readr` package, which provides `read_csv` as a faster version of `read.csv`.
```{r, dplyr1, eval = TRUE}
library(dplyr)
users <- readr::read_csv(file.path('data', 'users-2016.csv'))
result <- select(users, userid, displayname) # select columns
dim(result)
result <- filter(users, age > 75) # filter by row (i.e., SQL WHERE)
dim(result)
```
### 2.2 Piping
dplyr is often combined with piping, which allows you to build up a sequence of operations (from left to right), as if you were using UNIX pipes or reading a series of instructions. Here's a very simple example where we combine column selection and filtering in a readable way:
```{r, dplyr2, eval = TRUE}
result <- users %>% select(displayname, userid, age) %>% filter(age > 75)
## Or using the new pipe operator from base R:
result <- users |> select(displayname, userid, age) |> filter(age > 75)
```
What happens here is that the operations are run from left to right (except for the assignment into `result`) and the result of the left-hand side of a `%>%` is passed into the right-hand side function as the first argument. So this one liner is equivalent to:
```{r, dplyr3, eval = TRUE}
tmp <- select(users, displayname, userid, age)
result2 <- filter(tmp, age > 75)
identical(result, result2)
```
and also equivalent to:
```{r, dplyr4, eval = TRUE}
result3 <- filter(select(users, displayname, userid, age), age > 75)
identical(result, result3)
```
We'll use pipes in the remainder of the dplyr examples.
### 2.3 Functionality
Here's how one can do stratified analysis with aggregation operations. In the dplyr world, this is known as split-apply-combine but in the SQL world this is just a GROUP BY with some aggregation operation.
```{r, dplyr5, eval = TRUE}
medianVotes <- users %>% group_by(age) %>% summarize(
median_upvotes = median(upvotes),
median_downvotes = median(downvotes))
head(medianVotes)
```
You can also create new columns, sort, and do joins, as illustrated here:
```{r, dplyr6, eval = TRUE}
## create new columns
users2 <- users %>% mutate(year = substring(creationdate, 1, 4),
month = substring(creationdate, 6, 7))
## sorting (here in descending (not the default) order by upvotes)
users2 <- users %>% arrange(age, desc(upvotes))
## joins
questions <- readr::read_csv(file.path('data', 'questions-2016.csv'))
questionsOfAge <- users %>% filter(age > 75) %>%
inner_join(questions, by = c("userid" = "ownerid"))
head(questionsOfAge)
```
> **Challenge**: Why did I first filter and then do the join, rather than the reverse?
The join functions include `inner_join`, `left_join`, `right_join`, `full_join`. I don't see any cross join functionality.
In addition to operating directly on data frames, dplyr can also operate on databases and data.table objects as the back-end storage, as we'll see next.
### 2.4 Cautionary notes
Note that dplyr and other packages in the tidyverse use a modified form of data frames. In some cases you may want to convert back to a standard data frame using `as.data.frame`. For example:
```{r, dplyr7, eval=TRUE}
as.data.frame(head(questionsOfAge, 3))
```
Note that dplyr and other tidyverse packages use a lot of "non-standard evaluation". In this context of non-standard evaluation, the thing to pay attention to is that the column names are not quoted. This means that one cannot use a variable to stand in for a column. So the following woudn't work because dplyr would literally look for a variable named "colname" in the data frame. There is a [system for addressing this](https://dplyr.tidyverse.org/articles/programming.html) but I won't go into it further here.
```{r, dplyr8, eval=FALSE}
## this won't work because of non-standard evaluation!
myfun <- function(df, colname)
select(df, colname)
myfun(questions, 'age')
```
### 2.5 dplyr with SQL and databases
We can connect to an SQLite or Postgres database and then query it using dplyr syntax:
```{r, eval=TRUE}
library(RSQLite)
drv <- dbDriver("SQLite")
db <- dbConnect(drv, dbname = file.path('data', 'stackoverflow-2016.db'))
users <- tbl(db, 'users')
highVotes <- users %>% filter(upvotes > 10000)
head(highVotes)
```
> **Note**: dplyr uses lazy evaluation when interfacing with databases -- it only does the query and return results when the results are needed (in this case when we call `head`).
## 3 Manipulating datasets quickly in memory
### 3.1 `data.table` in R
The *data.table* package provides a lot of functionality for fast manipulation of datasets in memory. data.table can do the standard SQL operations such as indexing, merges/joins, assignment, grouping, etc. Plus data.table objects are data frames (i.e., they inherit from data frames) so they are compatible with R code that uses data frames.
If you've got enough memory, data.table can be effective with pretty large datasets (e.g., 10s of gigabytes).
To illustrate without the example taking too long, we'll only read in a subset of the Wikipedia webtraffic data.
Let's read in the dataset, specifying the column classes so that fread() doesn't have to detect what they are (which will take additional time and might cause errors). Note that we can read directly from a UNIX operation piped into R.
```{r}
library(data.table)
colClasses <- c('numeric', 'numeric', 'character',
'character', 'numeric', 'numeric')
colNames <- c('date', 'hour', 'site', 'page', 'count', 'size')
system.time(wikiDT <- fread('gzip -cd data/part-0000?.gz',
col.names = colNames, colClasses = colClasses, header = FALSE,
quote = ""))
## 30 sec. for 300 MB zipped
```
Now let's do some basic subsetting. We'll see that setting a key (equivalent to setting an index in SQL) can improve lookup speed dramatically.
```{r}
## without a key (i.e., index)
system.time(sub <- subset(wikiDT, count == 512)) # .27 sec.
system.time(setkey(wikiDT, count , size)) # 3 sec.
## with a key (i.e., index)
system.time(sub2 <- wikiDT[.(512), ]) # essentially instantaneous
```
data.table has a lot of functionality and can be used to do a variety of sophisticated queries and manipulations (including aggregation operations), but it has its own somewhat involved syntax and concepts. The above just scratches the surface of what you can do with it.
### 3.2 Using dplyr syntax with data.table in R
Rather than learning the data.table syntax, one can also use dplyr syntax with data.table objects.
We can use dplyr syntax directly with data table objects, illustrated here with our existing `wikiDT` data table.
```{r}
system.time(sub <- wikiDT %>% filter(count == 512))
```
One can also use `dtplyr` to set use a data table as a back end for dplyr manipulations. Using `lazy_dt` allows dtplyr to do some optimization as it generates the translation from dplyr syntax to data table syntax, though this simple example doesn't illustrate the usefulness of that.
```{r}
wikiDT2 <- lazy_dt(wikiDT)
system.time(sub <- wikiDT2 %>% filter(count == 512)) # 0.1 sec.
```
Finally the `tidytable` package also allows you to use dplyr syntax as well as other tidyverse syntax, such as `tidyr` functions.
### 3.3 Polars dataframes in Python
As mentioned earlier, [Polars](https://docs.pola.rs/user-guide) is a newer dataframe package that provides a Python interface, operates in memory, and is designed to be fast. It uses the Arrow columnar format. It also provides a lazy execution model like Spark or Dask that allows for automatic optimization of queries.
### 3.4 DuckDB
With DuckDB, you can run queries against existing R and Python data frames, collections of files in the Parquet file format and other file formats, and Arrow objects, without having to copy the data or import it into an actual database.
In R, use the `duckdb_register` and `duckdb_register_arrow` functions to 'register' the data frame or Arrow data source.
For Python, see the [example syntax in the DuckDB documentation](https://duckdb.org/docs/api/python/data_ingestion#dataframes-arrow-tables) to query Pandas and Polars data frames and Arrow objects.
Alternatively, you can read the data from files on disk into a DuckDB database table and then run queries against the database you've created.
```{python}
import duckdb
data = duckdb.read_parquet("data/*.parquet")
data.to_table("wikistats")
duckdb.sql("select * from wikistats limit 5")
duckdb.sql("select count(*) from wikistats")
```
## 4 Working with large datasets on disk
There are a variety of packages in R and Python that allow you to work with very large datasets on disk without loading them fully into memory. Some of these are also very good at compressing files to reduce disk storage.
I recommend first considering Arrow as it works well with the usual dataframe manipulations, but the other packages mentioned here may also be useful.
### 4.1 Arrow
Apache Arrow provides efficient data structures for working with data in memory, usable in R via the `arrow` package and the `PyArrow` package in Python. Data are stored by column, with values in a column stored sequentially and in such a way that one can access a specific value without reading the other values in the column (O(1) lookup).
In general Arrow will only read data from disk as needed, avoiding keeping the entire dataset in memory (how much has to be read depends on the file format, with the native arrow format best in this regard), which can reduce I/O and memory usage.
You can use Apache Arrow to read and write from datasets stored as one or (often) more files in various formats, including:
- parquet: a space-efficient, standard format;
- arrow format: data are stored in the same format on disk (called the 'feather' format) as in memory, improving I/O speed; and
- text/csv files.
See this [very useful discussion](https://stackoverflow.com/questions/56472727/difference-between-apache-parquet-and-arrow) of file formats, comparing the parquet and arrow formats. And note that if you're going to be reading the data frequently off disk, storing the files in text/CSV is not a good idea as it will be much faster to read from the Parquet or Arrow formats.
Here's a bit of what you can do with the *PyArrow* package for Python, illustrating working with data from a collection of Parquet files.
```{python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
tbl = pq.read_table("data/parquet")
## alternatively
tbl = tbl.rename_columns(["date","hour","lang","site","hits","size"])
df = tbl.filter(ds.field('hits') > 10000).to_pandas()
```
For R, there's a [nice vignette](https://cran.r-project.org/web/packages/arrow/vignettes/dataset.html) covering basic usage, which involves dplyr syntax. Here's an example of reading from a single file (at the moment, I'm not seeing how to read from multiple files):
```{r}
tbl <- read_parquet("data/parquet/part-00000.parquet")
tbl <- read_feather("data/arrow/part-00000.arrow")
```
### 4.2 DuckDB
With DuckDB, with some file formats (e.g., CSV, Parquet), you can run queries on files on disk without reading the entire dataset into memory.
Here's an example in Python.
```{python}
import duckdb
duckdb.sql("select * from 'data/*.parquet' limit 5")
duckdb.sql("select count(*) from 'data/*.parquet'")
df = duckdb.sql("select * from 'data/*.parquet' limit 5").to_df()
```
### 4.3 fst
The *fst* package for R provides the ability to quickly read and write data frames in parallel from data stored on disk in the efficient fst format. A key feature in terms of reducing memory use is that data can be quickly accessed by column or by row (O(1) lookup), allowing one to easily subset the data when reading, rather than reading the entire dataset into memory, which is what would otherwise happen.
Here's an example, starting by reading data (some of the Wikipedia traffic data, 2.3 GB of data) into an initial data frame in R (which might defeat the purpose if the dataset size is too big to fit in memory).
```{r, fst-write, eval=FALSE}
## read in Wikistats data
wikiDF <- readr::read_table(file = pipe("gzip -cd data/part-0000?.gz"),
col_names = c('day','hour','language','site','hits','size'),
col_types = c('nnccnn'))
system.time(write_fst(wikiDF, file.path('/tmp', 'data.fst'))) ## 8.9 seconds
```
The size of the compressed file is 790 MB based on the default compression, but one can choose different compression levels.
```{r, fst-read}
system.time(wikiDF <- read_fst(file.path('/tmp','data.fst'))) ## 8.4 seconds
```
The 8 seconds to read the data compares to 55 seconds to read the data from the gzipped files via a connection using `readr::read_table` and 29 seconds via `data.table::fread`.
### 4.4 Additional packages in R (ff, LaF, bigmemory)
#### 4.4.1 ff
ff stores datasets in columnar format, with one file per column, on disk, so is not limited by memory (with the caveat below). It then provides fast access to the dataset from R.
To create the disk-based ff dataset, you'll need to first read in the data from its original home. Note the arguments are similar to those for `read.table` and `read.csv`. `read.table.ffdf` reads the data in chunks.
```{r, eval = FALSE}
library(ff)
colClasses <- c('numeric','numeric','character', 'character','numeric','numeric')
colClasses[colClasses == 'character'] <- 'factor' # 'character' not allowed in ff
## read in Wikistats data; this will take a while.
wikiff <- read.table.ffdf(file = pipe("gzip -cd data/part-0000?.gz"),
colClasses = colClasses, sep = ' ')
```
Now, one can save the ff dataset into permanent storage on disk that can be much more quickly loaded than the original reading of the data above.
```{r, eval = FALSE}
system.time(ffsave(wikiff, file = 'wikistats')) ## 80 sec.
rm(wikiff)
```
Here's how one loads the dataset back in.
```{r, eval = FALSE}
system.time(ffload('wikistats')) ## 20 sec.
```
In the above operations, we wrote a copy of the file in the ff binary format that can be read more quickly back into R than the original reading of the CSV using `ffsave` and `ffload`. Also note the reduced size of the binary format file compared to the original CSV. It's good to be aware of where the binary ff file is stored given that for large datasets, it will be large. With ff (I think bigmemory is different in how it handles this) it appears to be stored in `/tmp` in an R temporary directory. Note that as we work with large files we need to be more aware of the filesystem, making sure in this case that /tmp has enough space.
To use ff effectively, you want to use functions designed to manipulate ff objects; otherwise R will convert the ff dataset into a standard data frame and defeat the purpose as this will put the entire dataset in memory.
You can look at the ff and ffbase packages to see what functions are available using ```library(help = ff)``` and ```library(help = ffbase)```. Notice that there is an `merge.ff` function for joins. Here we use the ff-specific table function:
```{r, eval = FALSE}
table.ff(wikiff$hour)
```
##### Miscellanea
Note that a copy of an ff object appears to be a shallow copy: if you modify the copy it will change the data in the original object.
Note that `ff` stores factor levels *in memory*, so if one has many factor levels, that can be a limitation. Furthermore, character columns are not allowed, so one is forced to use factors. Thus with textual data or the like, one can easily run into this limitation. With the Wikistats data, this is a big problem.
Also, I've encountered problems when there are more than about 1000 columns because each column is a separate file and there can be limitations in R on how many files it has open at once.
#### 4.4.2 LaF package
The LaF package is designed to quickly read in data from CSV and FWF (fixed-width format) input files, efficiently handling cases where you only want some of the rows or columns. It requires unzipped text files as input, so one can't unzip input files on the fly via piping.
```{r, laF, eval=FALSE}
colClasses <- c('numeric','numeric','character', 'character','numeric','numeric')
colNames <- c('date', 'hour', 'site', 'page', 'count', 'size')
## read in Wikistats data
datLaF <- laf_open_csv(file.path('data', 'part-00000.txt'), sep = ' ',
column_types = colClasses, column_names = colNames) ## returns immediately
sub <- datLaf[dat$count[] == 635,]
```
If you run this you'll see that the `laf_open_csv` took no time, indicating LaF is using lazy evaluation.
#### 4.4.3 bigmemory for matrices
`bigmemory` is similar to ff in providing the ability to load datasets into R without having
them in memory, but rather stored in clever ways on disk that allow for fast access.
bigmemory provides a `big.matrix` class, so it appears to be
limited to datasets with a single type for all the variables. However,
one nice feature is that one can use `big.matrix` objects with foreach
(one of R's parallelization tools) without
passing a copy of the matrix to each worker. Rather the workers can
access the matrix stored on disk.
The `biglm` package provides the ability to fit linear models and GLMs to big datasets, with
integration with ff and bigmemory.
### 4.5 Strategies in Python
Python provides a variety of packages and approaches you can use
to avoid reading large datasets fully into memory. Here is a brief
overview of a few approaches:
- Use the [Dask package](https://berkeley-scf.github.io/tutorial-dask-future/python-dask#4-dask-distributed-datastructures-and-automatic-parallel-operations-on-them) to break up datasets into chunks. Dask processes the data in chunks, so one often doesn't need a lot of memory, even just on one machine.
- Use `numpy.load` with the `mmap_mode` argument to access a numpy array (stored in a .npy file) on disk via memory mapping, reading only the pieces of the array that you need into memory, as discussed [here](https://numpy.org/doc/stable/reference/generated/numpy.load.html).
See [here](https://pythonspeed.com/articles/mmap-vs-zarr-hdf5) for more discussion of accessing data on disk from Python.
### 4.6 Online (batch) processing of data in R and Python
Another approach is to manually process the data in batches, only reading in chunks of data that can fit in memory before doing some computation or writing back out to disk and then reading in the next chunk. When taking this approach, you want to ensure that the code you are using will be able to skip directly to the point in the file where it should read the next chunk of data from (randomly accessing memory) rather than reading all the data up to the point of interest and simply discarding the initial data.
Not surprisingly there is a ton more functionality than shown below (in both Python and R) for reading chunks from files as well as skipping ahead in a file via a file connection or stream.
#### 4.6.1 Online processing in R
In R, various input functions can read in a subset of a file or can skip ahead. In general the critical step is to use a *connection* rather than directly opening the file, as this will allow one to efficiently read the data in in chunks.
I've put these in separate chunks as a reminder that for more accurate time comparisons they should be run in separate R sessions as there are some caching effects (though it's surprising that closing R has an effect as I would think the file would be cached by the OS regardless).
First we'll see that skipping ahead when not using a connection is costly -- R needs to read all the earlier rows before getting to the data of interest:
```{r}
fn <- file.path('data', 'questions-2016.csv')
system.time(dat1 <- read.csv(fn, nrows = 100000, header = TRUE)) # 0.3 sec.
system.time(dat2 <- read.csv(fn, nrows = 100000, skip = 100001, header = FALSE)) # 0.5 sec.
system.time(dat3 <- read.csv(fn, nrows = 1, skip = 100001, header = FALSE)) # 0.15 sec.
system.time(dat4 <- read.csv(fn, nrows = 100000, skip = 1000001, header = FALSE)) # 3.7 sec.
```
If we use a connection, this cost is avoided (although there is still a cost to skipping ahead compared to reading in chunks, picking up where the last chunk left off):
```{r}
fn <- file.path('data', 'questions-2016.csv')
con <- file(fn, open = 'r')
system.time(dat1c <- read.csv(con, nrows = 100000, header = TRUE)) # 0.3 sec.
system.time(dat2c <- read.csv(con, nrows = 100000, header = FALSE)) # 0.3 sec.
system.time(dat3c <- read.csv(con, nrows = 1, header = FALSE)) # .001 sec.
system.time(dat5c <- read.csv(con, nrows = 1, skip = 100000, header = FALSE)) # .15 sec
```
You can use `gzfile`, `bzfile`, `url`, and `pipe` to open connections to zipped files, files on the internet, and inputs processed through UNIX-style piping.
`read_csv` is generally somewhat faster and seems to be able to skip ahead efficiently even though it is not using a connection (which surprises me given that with a CSV file you don't know how big each line is so one would think one needs to process through each line in some fashion).
```{r}
library(readr)
fn <- file.path('data', 'questions-2016.csv')
system.time(dat1r <- read_csv(fn, n_max = 100000, col_names = TRUE)) # 0.4 sec.
system.time(dat2r <- read_csv(fn, n_max = 100000, skip = 100001, col_names = FALSE)) # 0.13 sec
system.time(dat3r <- read_csv(fn, n_max = 1, skip = 200001, col_names = FALSE)) # 0.07 sec
system.time(dat4r <- read_csv(fn, n_max = 100000, skip = 1000001, col_names = FALSE)) # 0.18 sec
```
Note that `read_csv` can handle zipped inputs, but does not handle a standard text file connection.
#### 4.6.2 Online processing in Python
Pandas' `read_csv` has similar functionality in terms of reading a fixed number of rows and skipping rows, and it can decompress zipped files on the fly.
```{python}
import pandas as pd
import timeit
fn = os.path.join('data', 'users-2016.csv')
## here's the approach I'd recommend, as it's what 'chunksize' is intended for
start_time = timeit.default_timer()
chunks = pd.read_csv(fn, chunksize = 100000, header = 0) # 0.003 sec.
elapsed = timeit.default_timer() - start_time
elapsed
type(chunks)
## read first chunk
start_time = timeit.default_timer()
dat1c = chunks.get_chunk()
elapsed = timeit.default_timer() - start_time
elapsed # 0.2 sec.
## read next chunk
start_time = timeit.default_timer()
dat2c = chunks.get_chunk() # 0.25 sec.
elapsed = timeit.default_timer() - start_time
elapsed # 0.2 sec.
## this also works but is less elegant
start_time = timeit.default_timer()
dat1 = pd.read_csv(fn, header = 0, nrows = 100000)
elapsed = timeit.default_timer() - start_time
elapsed # 0.3 sec.
start_time = timeit.default_timer()
dat2 = pd.read_csv(fn, nrows = 100000, header = None, skiprows=100001)
elapsed = timeit.default_timer() - start_time
elapsed # 0.3 sec.
```