-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathsql.Rmd
1154 lines (791 loc) · 58.9 KB
/
sql.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
layout: default
title: SQL
---
```{r setup, include=FALSE}
options(replace.assign=TRUE, width=65)
knitr::opts_chunk$set(eval = FALSE, message = FALSE) ## turned off message() output
library(DBI)
SLOWEVAL <- TRUE
```
## 1 Introduction to SQL
### 1.1 Getting started
Here is a simple query that selects the first five rows (and all columns, based on the `*` wildcard) from the questions table.
```
select * from questions limit 5
```
To run this from R we provide the SQL syntax as a string as the second argument to `dbGetQuery`.
```{r, start, eval=TRUE}
library(RSQLite)
drv <- dbDriver("SQLite")
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2021.db'
db <- dbConnect(drv, dbname = file.path(dir, dbFilename))
dbGetQuery(db, "select * from questions limit 5")
```
Now let's see some more interesting usage of other SQL syntax.
First we get the questions that are viewed the most by filtering to the rows for which the 'viewcount' is greater than 100000. We'll limit the results to the first 5 rows so we don't print too much out.
```{r, eval=TRUE}
dbGetQuery(db, "select * from questions where viewcount > 100000 limit 5")
```
Next, let's find the number of views for the 15 questions viewed the most.
```{r, eval=TRUE}
dbGetQuery(db, "select viewcount from questions
order by viewcount desc limit 15")
```
Let's lay out the various verbs in SQL. Here's the form of a standard query (but note that the sorting done by ORDER BY is computationally expensive and would be used sparingly):
```
SELECT <column(s)> FROM <table> WHERE <condition(s) on column(s)> ORDER BY <column(s)>
```
SQL keywords are often written in ALL CAPITALS by convention, although I won't necessarily do that in this tutorial.
And here is a table of some important keywords:
| Keyword | What it does
|------------------------|------------------------
| SELECT | select columns
| FROM | which table to operate on
| WHERE | filter (choose) rows satisfying certain conditions
| LIKE, IN, <, >, =, <=, >=, !=, etc. | used as part of filtering conditions
| ORDER BY | sort based on columns
Some other keywords are: DISTINCT, ON, JOIN, GROUP BY, AS, USING, UNION, INTERSECT, HAVING, SIMILAR TO (not available in SQLite), SUBSTR in SQLite and SUBSTRING in PostgreSQL.
> **Challenge**: Return a few rows from the users, questions, answers, and tags tables so you can get a sense for what the entries in the tables are like.
> **Challenge**: Find the users in the database with the most upvotes.
### 1.2 Getting unique results (DISTINCT)
A useful SQL keyword is DISTINCT, which allows you to eliminate duplicate rows from any table (or remove duplicate values when one only has a single column or set of values).
```{r, eval=TRUE}
## Find the unique tags:
dbGetQuery(db, "select distinct tag from questions_tags limit 15")
## Count the number of unique tags:
dbGetQuery(db, "select count(distinct tag) from questions_tags")
```
### 1.3 Grouping / stratifying (GROUP BY)
A common pattern of operation is to stratify the dataset, i.e., collect it into mutually exclusive and exhaustive subsets. One would then generally do some aggregation operation on each subset. The aggregation is always done within each of the groups. In SQL this is done with the GROUP BY keyword.
Here's a basic example where we count the occurrences of different tags.
```{r, eval=FALSE}
dbGetQuery(db, "select tag, count(*) as n from questions_tags
group by tag order by n desc limit 100")
```
Also note the use of `as` to define a name for the new column.
> **Challenge**: What specifically does that query do? Describe the table that would be returned.
In general `GROUP BY` statements will involve some aggregation operation on the subsets. Options include: COUNT, MIN, MAX, AVG, SUM.
The result of a query that uses `group by` is a table with as many rows as groups.
> **Warning**: To filter the result of a grouping operation, we need to use `having` rather than `where`. (`where` would filter before the application of the `group by`).
```{r}
dbGetQuery(db, "select tag, count(*) as n from questions_tags
group by tag having n > 100000 limit 10")
```
> **Warning**: Determining what fields can be selected when using `group by` can be tricky, because it varies by DBMS. For example, with Postgres, you can only select fields created by aggregation and the fields that `group by` is applied to, as well as when there is something called a [functional dependency](https://www.postgresql.org/docs/current/sql-select.html#SQL-GROUPBY). SQLite allows more flexibility. For example the following can be done in SQLite to find user and answer information for the answer to each question from the user with the highest reputation. However Postgres gives the error 'ERROR: column "u.userid" must appear in the GROUP BY clause or be used in an aggregate function'.
```{r}
dbGetQuery(db, "select *, max(reputation) from users U join answers A
on A.ownerid = U.userid group by A.questionid limit 5")
```
> **Challenge**: Write a query that will count the number of answers for each question, returning the IDs of the most answered questions. Hint: consider which field in the "answers" table we do the grouping on (and you shouldn't need to use the "questions" table).
> **Tip**: When applied to a specific field, `COUNT` will not count elements that are `NULL`. That can be useful in cases such as determining the number of non-matches in an outer join. In contrast, `COUNT(*)` will count the number of rows, regardless of the contents.
### 1.4 Joins
#### 1.4.1 Introduction to joins
Suppose in the example of students in classes, we want a result that has the grades of all students in 9th grade. For this we need information from the Student table (to determine grade level) and information from the ClassAssignment table (to determine the class grade for each class a student takes). Getting information from multiple tables, where a row in one table is matched with one or more rows in another table is called a *join*. In this case the join would look for all rows in the ClassAssignment table that match a given row (i.e., student) in the Student table, using the column in each of the tables containing the student ID to do the matching of rows.
The syntax generally looks like this (again the WHERE and ORDER BY are optional):
```
SELECT <column(s)> FROM <table1> JOIN <table2> ON <columns to match on>
WHERE <condition(s) on column(s)> ORDER BY <column(s)>
```
Let's see an example join on the Stack Overflow database. In particular let's select only the questions with the tag "python".
```{r, eval=SLOWEVAL}
result1 <- dbGetQuery(db, "select * from questions join questions_tags
on questions.questionid = questions_tags.questionid
where tag = 'python'")
head(result1)
```
It's also possible to get the same exact result without using the JOIN keyword, but you'll need the WHERE keyword to ensure that the rows get matched correctly.
```{r, eval=SLOWEVAL}
result2 <- dbGetQuery(db, "select * from questions, questions_tags
where questions.questionid = questions_tags.questionid
and tag = 'python'")
identical(result1, result2)
```
We'll explain what is going on in the next section.
Here's a three-way join (both with and without the JOIN keyword) with some additional use of *aliases* to abbreviate table names. What does this query ask for?
```{r, eval=SLOWEVAL}
result1 <- dbGetQuery(db, "select * from questions Q
join questions_tags T on Q.questionid = T.questionid
join users U on Q.ownerid = U.userid
where tag = 'python' and upvotes > 100")
```
Once again, we could do that without JOIN and using WHERE to match the rows appropriately.
```{r}
result2 <- dbGetQuery(db, "select * from questions Q, questions_tags T, users U
where Q.questionid = T.questionid
and Q.ownerid = U.userid
and tag = 'python' and upvotes > 100")
```
> **Challenge**: Write a query that would return all the answers to questions with the Python tag.
> **Challenge**: Write a query that will count the number of answers for each question, returning the most answered questions and their information. Note that this extends the question in the previous section.
> **Challenge**: Write a query that would return the users who have answered a question with the Python tag.
#### 1.4.2 Types of joins
We've seen a bunch of joins but haven't discussed the full taxonomy of types of joins. There are various possibilities for how to do a join depending on whether there are rows in one table that do not match any rows in another table.
*Inner joins*: In database terminology an inner join is when the result has a row for each match of a row in one table with the rows in the second table, where the matching is done on the columns you indicate. If a row in one table corresponds to more than one row in another table, you get all of the matching rows in the second table, with the information from the first table duplicated for each of the resulting rows. For example in the Stack Overflow data, an inner join of questions and answers would pair each question with each of the answers to that question. However, questions without any answers or (if this were possible) answers without a corresponding question would not be part of the result.
*Outer joins*: Outer joins add additional rows from one table that do not match any rows from the other table as follows. A *left outer join* gives all the rows from the first table but only those from the second table that match a row in the first table. A *right outer join* is the converse, while a *full outer join* includes at least one copy of all rows from both tables. So a left outer join of the Stack Overflow questions and answers tables would, in addition to the matched questions and their answers, include a row for each question without any answers, as would a full outer join. In this case there should be no answers that do not correspond to question, so a right outer join should be the same as an inner join. Note that one cannot do a right outer join (or a full outer join) in SQLit; you'll need to switch the order of the tables and do a left outer join.
*Cross joins*: A cross join gives the Cartesian product of the two tables, namely the pairwise combination of every row from each table, analogous to `expand.grid` in R. I.e., take a row from the first table and pair it with each row from the second table, then repeat that for all rows from the first table. Since cross joins pair each row in one table with all the rows in another table, the resulting table can be quite large (the product of the number of rows in the two tables). In the Stack Overflow database, a cross join would pair each question with every answer in the database, regardless of whether the answer is an answer to that question.
Here's a table of the different kinds of joins:
| Type of join | Rows from first table | Rows from second table
|------------------------|--------------------------|---------------------------------------------
| inner (default) | all that match on specified condition | all that match on specified condition
| left outer | all | all that match first table
| right outer | all that match second table | all
| full outer | all | all
| cross | all combined pairwise with second table | all combined pairwise with first table
A 'natural' join is an inner join that doesn't require you to specify the common columns between tables on which to enforce equality, but it's often good practice to not use a natural join and to explicitly indicate which columns are being matched on.
Simply listing two or more tables separated by commas as we saw earlier is the same as a *cross join*. Alternatively, listing two or more tables separated by commas, followed by conditions that equate rows in one table to rows in another is the same as an *inner join*.
In general, inner joins can be seen as a form of cross join followed by a condition that enforces matching between the rows of the table. More broadly, here are five equivalent joins that all perform the equivalent of an inner join:
```
select * from table1 join table2 on table1.id = table2.id ## explicit inner join
select * from table1, table2 where table1.id = table2.id ## without explicit JOIN
select * from table1 cross join table2 where table1.id = table2.id
select * from table1 join table2 using(id)
select * from table1 natural join table2
```
Note that in the last query the join would be based on all common columns, which could be a bit dangerous if you don't look carefully at the schema of both tables. Assuming `id` is the common column, then the last of these queries is the same as the others.
> **Challenge**: Create a view with one row for every question-tag pair, including questions without any tags.
> **Challenge**: Write a query that would return the displaynames of all of the users who have *never* posted a question. The NULL keyword will come in handy -- it's like `NA` in R. Hint: NULLs should be produced if you do an outer join.
> **Challenge**: How many questions tagged with 'random-forest' were unanswered? (You should need two different kinds of joins to answer this.)
#### 1.4.3 Counting NULLs
We've seen that one can user outer joins as a way to find rows in one table that do not appear in another table. If you want to be able to count the number of entries by group and have a count of 0 when a group doesn't appear in a table, you can do that by making sure to apply the count only to a field produced by an outer join that contains NULL values and not to all fields.
For example if we wanted to count the number of answers provided by each user and make sure to include people who have answered no questions, assigning them a value of 0, we could do it like this by counting the `answerid` field, which will have NULL for users with no answers and will be counted as a 0:
```{r}
dbGetQuery(db, "select userid, count(answerid) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")
```
If we had done `count(*)`, then each row for a user with no answers would be incorrectly assigned a 1 (since they have a row associated with them) instead of a 0 (note user #56 below).
```{r}
dbGetQuery(db, "select userid, count(*) as n_answers from users left outer join answers on userid=ownerid group by userid order by userid limit 10")
```
#### 1.4.4 Joining a table with itself (self join)
Sometimes we want to query information across rows of the same table. For example supposed we want to analyze the time lags between when the same person posts a question. Do people tend to post in bursts or do they tend to post uniformly over the year? To do this we need contrasts between the times of the different posts. (One can also address this using window functions, discussed later.)
So we need to join two copies of the same table, which means dealing with resolving the multiple copies of each column.
This would look like this:
```{r}
dbGetQuery(db, "select * from questions Q1 join questions Q2
on Q1.ownerid = Q2.ownerid")
```
That should create a new table with all pairs of questions asked by a single person.
Actually, there's a problem here.
> **Challenge**: What kinds of rows will we get that we don't want?
A solution to that problem of having the same question paired with itself is:
```{r}
dbGetQuery(db, "create view question_contrasts as
select * from questions Q1 join questions Q2
on Q1.ownerid = Q2.ownerid
where Q1.creationdate != Q2.creationdate")
```
> **Challenge**: There's actually a further similar problem. What is the problem and how can we fix it by changing two characters in the query above? Hint, even as character strings, the creationdate column has an ordering.
### 1.5 Temporary tables and views
You can think of a view as a temporary table that is the result of a query and can be used in subsequent queries. In any given query you can use both views and tables. The advantage is that they provide modularity in our querying. For example, if a given operation (portion of a query) is needed repeatedly, one could abstract that as a view and then make use of that view.
Suppose we always want the upvotes and displayname of question owners available. Once we have the view we can query it like a regular table.
```{r, eval=FALSE}
## note there is a creationdate in users too, hence disambiguation
dbExecute(db, "create view questions_plus as
select questionid, questions.creationdate, score, viewcount,
title, ownerid, upvotes, displayname
from questions join users on questions.ownerid = users.userid")
## don't be confused by the "0" response --
## it just means that nothing is returned to R; the view _has_ been created
dbGetQuery(db, "select * from questions_plus where upvotes > 100 limit 5")
```
One use of a view would be to create a mega table that stores all the information from multiple tables in the (unnormalized) form you might have if you simply had one data frame in R or Python.
```{r, eval=FALSE, include=TRUE}
dbExecute(db, "drop view questions_plus") # drop the view if we no longer need it
```
If you want to create a temporary table just for a single query, you can use a subquery or a WITH clause, as dicussed in [Section 3.2](#32-subqueries).
## 2 Additional SQL topics
### 2.1 Creating database tables
Often one would create tables from within R or Python (though one can [create tables from within the `sqlite` and `psql` command line interfaces](db-management) as well). Here's the syntax from R.
```{r}
## Option 1: pass directly from CSV to database
dbWriteTable(conn = db, name = "student", value = "student.csv", row.names = FALSE,
header = TRUE)
## Option 2: pass from data in an R data frame
## First create your data frame:
# student_df <- data.frame(...)
## or
# student_df <- read.csv(...)
dbWriteTable(conn = db, name = "student", value = student_df, row.names = FALSE,
append = FALSE)
```
### 2.2 String processing and creating new fields
We can do some basic matching with LIKE, using % as a wildcard and _ to stand in for any single character:
```{r, eval=TRUE}
dbGetQuery(db, "select * from questions_tags where tag like 'r-%' limit 10")
```
In Postgres, in addition to the basic use of LIKE to match character strings, one can use regular expression syntax with SIMILAR TO.
SIMILAR TO is not available in SQLite so the following can only be done in the Postgres instance of our example database. Here we'll look for all tags that are of the form "r-", "-r", "r" or "-r-". SQL uses % as a wildcard (this is not standard regular expression syntax).
```{r, eval=FALSE}
## Try in postgreSQL, not SQLite
result <- dbGetQuery(db, "select * from questions_tags
where tag similar to 'r-%|%-r|r|%-r-%' limit 10")
## Standard regex for 'any character' doesn't seem to work:
## result <- dbGetQuery(db, "select * from questions_tags
where tag SIMILAR TO 'r-.*|.*-r|r|.*-r-.*' limit 10")
```
> **Note**: The matching does not match on substrins, unless one uses wildcards at beginning and end of the pattern, so "r" will only find "r" and not, for example, "dplyr".
To extract substrings we can SUBSTRING in Postgres. Here's a basic example:
```{r}
dbGetQuery(db, "select substring(creationdate from '^[[:digit:]]{4}') as year
from questions limit 3")
```
If you need to specify the pattern to be extracted relative to the surrounding characters, then
Postgres requires that the pattern to be extracted be surrounded by `#"` (one could use another character in place of `#`), but for use from R we need to escape the double-quote with a backslash so it is treated as a part of the string passed to Postgres and not treated by R as indicating where the character string stops/starts. We also need the % wildcard character when extracting in this way.
```{r}
dbGetQuery(db, "select substring(creationdate from
'%-#\"[[:digit:]]{2}#\"-%' for '#') as month
from questions limit 3")
```
> **Warning**: SQLite provides SUBSTR for substrings, but the flexibility of SUBSTR seems to be much less than use of SUBSTRING in PostgreSQL.
Here is some [documentation on string functions in PostgreSQL](https://www.postgresql.org/docs/current/functions-string.html).
> **Challenge**: Select the questions that have "java" but not "javascript" in their titles using regular expression syntax.
> **Challenge**: Figure out how to calculate the length (in characters) of the title of each question.
> **Challenge**:Process the creationdate field to create year, day, and month fields in a new view. Note that this would be good practice for string manipulation but you would want to handle dates and times using the material in the next section and not use string processing.
### 2.3 Dates and times
Here we'll see how you can work with dates and times in SQLite, but the functionality should be similar in other DBMS.
SQLite doesn't have specific date-time types, but it's standard to store date-times as strings in the text field
in the ISO-8601 format: YYYY-MM-DD HH:MM:SS.SSS. That's the format of the dates in the StackOverflow database:
```{r, eval=TRUE}
dbGetQuery(db, "select distinct creationdate from questions limit 5")
```
Then SQLite provides some powerful functions for manipulating and extracting information in such fields. Here are just a few examples, noting that `strftime` is particularly powerful. Other DBMS should have similar functionality, but I haven't investigated further.
```{r, eval=TRUE}
## Julian days (decimal days since noon UTC/Greenwich time November 24, 4714 BC (Yikes!)).
output <- dbGetQuery(db, "select creationdate, julianday(creationdate)
from questions limit 5")
output
## Julian day is decimal-valued:
formatC(output[ , 2], 6, format = 'f')
## Convert to local time
dbGetQuery(db, "select distinct creationdate, datetime(creationdate, 'localtime')
from questions limit 5")
## Eastern time, manually, ignoring daylight savings
dbGetQuery(db, "select distinct creationdate, datetime(creationdate, '-05:00')
from questions limit 5")
## day of week: Jan 1 2021 was a Friday (0=Sunday, 6=Saturday)
dbGetQuery(db, "select creationdate, strftime('%w', creationdate)
from questions limit 5")
```
Unfortunately I'm not sure if the actual dates in the database are Greenwich time or some US time zone, but we'll ignore that complication here.
Let's put it all together to do something meaningful.
```{r, eval=TRUE, fig.cap="", fig.width=5, fig.height=4}
result <- dbGetQuery(db, "select strftime('%H', creationdate) as hour,
count() as n from questions group by hour")
head(result)
plot(as.numeric(result$hour), result$n, xlab = 'hour of day (UTC/Greenwich???)',
ylab = 'number of questions')
```
Here's some [documentation of the syntax for the functions, including `stftime`](https://www.sqlite.org/lang_datefunc.html).
## 3 More advanced SQL
### 3.1 Set operations: UNION, INTERSECT, EXCEPT
You can do set operations like union, intersection, and set difference using the UNION, INTERSECT, and EXCEPT keywords on tables that have the same schema (same column names and types), though most often these would be used on single columns (i.e., single-column tables).
> **Note**: While one can often set up an equivalent query without using INTERSECT or UNION, set operations can be very handy.
Here's an example of a query that can be done with or without an intersection. Suppose we want to know the names of all individuals who have asked both an R question and a Python question. We can do this with INTERSECT:
```{r, eval=TRUE}
system.time(
result1 <- dbGetQuery(db, "select displayname, userid from
questions Q join users U on U.userid = Q.ownerid
join questions_tags T on Q.questionid = T.questionid
where tag = 'r'
intersect
select displayname, userid from
questions Q join users U on U.userid = Q.ownerid
join questions_tags T on Q.questionid = T.questionid
where tag = 'python'")
)
```
Alternatively we can do a self-join. Note that the syntax gets complicated as we are doing multiple joins.
```{r, eval=TRUE}
system.time(
result2 <- dbGetQuery(db, "select displayname, userid from
(questions Q1 join questions_tags T1
on Q1.questionid = T1.questionid)
join
(questions Q2 join questions_tags T2
on Q2.questionid = T2.questionid)
on Q1.ownerid = Q2.ownerid
join users on Q1.ownerid = users.userid
where T1.tag = 'r' and T2.tag = 'python'")
)
identical(result1, result2)
```
> **Challenge**: Those two queries return equivalent information, but the results are not exactly the same. What causes the difference? How can we modify the second query to get the exact same results as the first?
Which is faster? The second one looks more involved in terms of the joins, so the timing results seen above make sense.
We could use UNION or EXCEPT to find people who have asked either or only one type of question, respectively.
> **Challenge**: Find the users who have asked either an R question or a Python question.
> **Challenge**: Find the users who have asked but not answered a question.
### 3.2 Subqueries
A subquery is a full query that is embedded in a larger query.
#### 3.2.1 Subqueries in the FROM statement
We can use subqueries in the FROM statement to create a temporary table to use in a query. Here we'll do it in the context of a join.
> **Challenge**: What does the following do?
```{r, eval=FALSE}
dbGetQuery(db, "select * from questions join answers A
on questions.questionid = A.questionid
join
(select ownerid, count(*) as n_answered from answers
group by ownerid order by n_answered desc limit 1000) most_responsive
on A.ownerid = most_responsive.ownerid")
```
It might be hard to just come up with that full query all at once. A good strategy is probably to think about creating a view that is the result of the inner query and then have the outer query use that. You can then piece together the complicated query in a modular way. For big databases, you are likely to want to submit this as a single query and not two queries so that the SQL optimizer can determine the best way to do the operations. But you want to start with code that you're confident will give you the right answer!
Note we could also have done that query using a subquery in the WHERE statement, as discussed in the next section.
> **Challenge**: Write a query that finds the number of answers per question, but only answers from users with at least 100 upvotes.
Finally one can use subqueries in the SELECT clause to create new variables, but we won't go into that here.
#### 3.2.2 Subqueries in the WHERE statement
Instead of a join, we can use subqueries as a way to combine information across tables, with the subquery involved in a WHERE statement. The subquery creates a set and we then can check for inclusion in (or exclusion from with `not in`) that set.
For example, suppose we want to know the average number of UpVotes for users who have posted a question with the tag "python".
```{r, eval=TRUE}
dbGetQuery(db, "select avg(upvotes) from users where userid in
(select distinct ownerid from
questions join questions_tags
on questions.questionid = questions_tags.questionid
where tag = 'python')")
```
In some cases one can do a join rather than using a subquery, but in the following example, it fails.
> **Challenge**: What's wrong with the following query as an attempt to answer the question above? (See if you can figure it out before looking at the answer below.)
```{r}
dbGetQuery(db, "select avg(upvotes) from questions, questions_tags, users
where questions.questionid = questions_tags.questionid and
questions.ownerid = users.userid and
tag = 'python'")
```
(Answer: In the subquery, we find the IDs of the users we are looking for and then average over the UpVotes of those individuals. In the join version we found all the questions that had a Python tag and averaged over the UpVotes of the individuals associated with those questions. So the latter includes multiple UpVotes values from individuals who have posted multiple Python questions.)
> **Challenge**: Write a query that would return the user information for users who have answered a question with the Python tag. We've seen this challenge before, but do it now based on a subquery.
> **Challenge**: Find the users who have asked but not answered a question. We've seen this before, but now make use of subqueries instead of a join.
> **Challenge**: How would you find all the answers associated with the user with the most upvotes?
> **Challenge**: Create a frequency list of the tags used in the top 100 most answered questions. Note there is a way to do this with a JOIN and a way without a JOIN.
#### 3.2.3 Using WITH
The WITH clause allows you to create a temporary table to then use in an associated SELECT statement. So it provides similar functionality to using a view but without it being a persistent part of the database. The temporary table is only available within the associated SELECT statement. WITH can only occur as part of a query with SELECT.
Let's see use of WITH to accomplish what we did with a subquery in the FROM statement above.
```{r, eval=FALSE}
dbGetQuery(db, "with most_responsive as (
select ownerid, count(*) as n_answered from answers
group by ownerid order by n_answered desc limit 1000
)
select * from questions join answers A
on questions.questionid = A.questionid
join most_responsive on A.ownerid = most_responsive.ownerid")
```
One could also replace the subquery in the WHERE statement above using WITH.
```{r, eval=FALSE}
dbGetQuery(db, "with tmp as (select distinct ownerid from
questions join questions_tags
on questions.questionid = questions_tags.questionid
where tag = 'python')
select avg(UpVotes) from users where userid in
tmp")
```
Finally, you can create multiple temporary tables in the WITH clause, separated by commas. This can help make your query more modular without the complication of creating views that will only be used once.
### 3.3 Window functions
[Window functions](https://www.postgresql.org/docs/current/functions-window.html) provide the ability to perform calculations across sets of rows that are related to the current query row.
Comments:
- The result of applying a window function is the same number of rows as the input, even though the functionality is similar to `group by`. Hint: think about the result of `group by` + `mutate` in dplyr in R.
- One can apply a window function within groups or across the whole table.
- The functions one can apply include standard aggregation functions such as `avg` and `count` as well as non-standard functions (specific to using window functions) such as `rank`, `row_number`, and `cume_dist`.
- Unless you're simply grouping into categories, you'll generally need to order the rows for the window function to make sense.
The syntax is a bit involved, so let's see with a range of examples:
- Aggregate within groups but with one output value per input row
```{r, eval=TRUE}
## Total number of questions for each owner
dbGetQuery(db, "select ownerid,
count() over (partition by ownerid) as n
from questions where ownerid is not NULL limit 10")
```
- Compute cumulative calculations; note the need for ORDER BY within the PARTITION clause (the other ORDER BY is just for display purposes here):
```{r, eval=TRUE}
## Rank (based on ordering by creationdate) of questions by owner
dbGetQuery(db, "select *,
rank() over (partition by ownerid order by creationdate) as rank
from questions order by ownerid desc limit 10")
```
(Sidenote: we rely here on the fact that ordering alphabetically by `creationdate` is equivalent to time ordering.)
- Do a lagged analysis
```{r, eval=TRUE}
## Get previous value (based on ordering by creationdate) by owner
dbGetQuery(db, "select ownerid, creationdate,
lag(creationdate, 1) over
(partition by ownerid order by creationdate)
as previous_date
from questions order by ownerid desc limit 5")
```
So one could now calculate the difference between the previous and current date to analyze the time gaps between users posting questions.
- Do an analysis within an arbitrary window of rows based on the values in one of the columns
```{r, eval=TRUE}
## Summarize questions within 15 days of current question
dbGetQuery(db, "select ownerid, creationdate,
count() over
(partition by ownerid order by julianday(creationdate)
range between 15 preceding and 15 following)
as n_window
from questions where ownerid is not null limit 30")
```
There the '15 preceding' and '15 following' mean to include all rows within each ownerid
that are within 15 Julian days (based on 'creationdate') of each row.
So one could now analyze bursts of activity.
One can also choose a fixed number of rows by replacing 'range' with 'rows'. The ROWS and RANGE syntax allow one to specify the *window frame* in more flexible ways than simply the categories of a categorical variable.
Ranking becomes more complicated when there are ties. `RANK` will assign the same value to any ties, and then increment based on the number of ties, e.g., you can get `1, 1, 1, 4`, if the three lowest value are tied. `DENSE RANK` will avoid skipping values, giving `1, 1, 1, 2` in the same situation. `ROW_NUMBER` just numbers, ignoring ties and resulting in ambiguity when there are ties, giving `1, 2, 3, 4` in the same situation.
So the syntax of a window function will generally have these elements:
- a call to some function that calculates within the window and assigns value(s) to the rows in the window
- OVER
- PARTITION BY (optional)
- ORDER BY (optional)
- RANGE or ROW (optional)
- AS (optional)
You can also name window functions, which comes in handy if you want multiple functions applied to the same window:
```{r, eval=TRUE}
dbGetQuery(db, "select ownerid, creationdate,
lag(creationdate, 1) over w as lag1,
lag(creationdate, 2) over w as lag2
from questions where ownerid is not null
window w as (partition by ownerid order by creationdate)
order by ownerid limit 5")
```
What does that query do?
Finally, you can use window functions on the entire table, without partitioning.
```{r, eval=TRUE}
## Summarize questions within 15 (decimal) days of current question
dbGetQuery(db, "select ownerid, creationdate,
count() over
(order by julianday(creationdate)
range between 15 preceding and 15 following)
as n_window
from questions where ownerid is not null limit 30")
```
> **Challenge**: Use a window function to compute the average viewcount for each ownerid for the 10 questions preceding each question.
> **Challenge**: For each question, get the answer given by the user with the maximum reputation amongst users answering the question.
Hint: you'll need to first create a subquery that determines the maximum reputation amongst the answers for each question and then use that to get the answer of interest for each question.
> **Challenge (hard)**: Find the users who have asked one question that is highly-viewed (viewcount > 1000) with their remaining questions not highly-viewed (viewcount < 20 for all other questions).
### 3.4 Putting it all together to do complicated queries
Here are some real-world style questions one might try to create queries to answer. The context for these questions is a situation in which you have data on user sessions on a website or data on messages between users.
(1) Given a table of user sessions with the format
```
date | session_id | user_id | session_time
```
calculate the distribution of the average daily total session time in the last month. I.e., you want to get each user's daily average and then find the distribution over users. The output should be something like:
```
minutes_per_day | number_of_users
```
(2) Consider a table of messages of the form
```
sender_id | receiver_id | message_id
```
For each user, find the three users they message the most.
(3) Suppose you have are running an online experiment and have a table on the experimental design:
```
user_id | test_group | date_first_exposed
```
Suppose you also have a messages table that indicates if each message
was sent on web or mobile:
```
date | sender_id | receiver_id | message_id | interface (web or mobile)
```
What is the average (over users) in the average number of messages sent per day for each test group
if you look at the users who have sent messages only on mobile in the last month.
#### Challenge questions with the Stack Overflow data
If we take the three challenges above and translate into problems for the Stack Overflow data, one could consider the following three problems.
1) For each user who has asked at least one question find the average number of questions per month. Then determine the distribution of that average across the users. (I.e., determine the values that would go into a histogram of the average number of questions per month across users.) The output should be something like:
```
number of questions per month (rounded) | number of users
```
Next try to figure out how to include the users who have asked no questions. Hint: think about how to get NULL values included and then count a column containing such values.
2) For each user, find the three most common tags they apply to their questions.
The output should be something like:
```
userid | tag | count
```
Hints: You'll need to use subqueries and the final selection of just the top 3 tags will need to be done in its own query and not as part of defining a field using a window function.
3) Consider trying to determine whether users who answer a lot of questions also ask a lot of questions. Grouping the users based on the number of questions they've answered (0, 1, 2, etc.), determine the aerage number of questions per month for each group.
The output should be something like:
```
number of answers | average number of questions per month
```
You'll want to work through this in pieces. Try to think about the initial tables you would need and then build up your query in a nested fashion.
### 3.5 A summary of SQL syntax by example
This section shows the syntax for various queries so as to demonstrate syntax by example. It may be useful to test your understanding by figuring out (either with or without running the query) what the query does.
#### Selecting columns
```
select ownerid, title from questions
select ownerid, title from questions limit 5
select * from questions
select * from questions order by answercount desc
select count(*) as n from questions
select count(ownerid) as n from questions
select sum(answercount) from questions
```
#### Using `distinct`
```
select distinct tag from questions_tags limit 15
select distinct ownerid, answercount from questions limit 15
select count(distinct tag) from questions_tags limit 15
```
#### Filtering rows with `where`
```
select * from questions where answercount > 40
select * from questions where answercount > 40 order by answercount desc
select * from questions where answercount = 10 limit 5
select * from questions_tags where tag like 'r-%' limit 10
select * from questions_tags where tag similar to 'r-%|%-r|r|%-r-%' limit 10
select * from questions_tags where tag in ('java','r','python') limit 10
```
#### Grouping and reduction/aggregation
```
select tag, count(*) as n from questions_tags \
group by tag
select tag, count(*) as n from questions_tags \
group by tag having n > 1000
select ownerid, count(*) as n from questions \
group by ownerid order by n desc limit 15
select ownerid, sum(viewcount) as viewed from questions \
group by ownerid
select *, sum(viewcount) as viewed from questions \
group by ownerid
select answercount, commentcount, count(*) as n from questions \
group by answercount, commentcount
select tag, count(*) as n from questions_tags \
where tag like 'python%' group by tag having n > 1000
```
The query above starting with `select *, sum(viewcount)` behaves differently in SQLite and DuckDB.
#### Joins
Inner joins
```
select * from questions join questions_tags \
on questions.questionid = questions_tags.questionid
select * from questions Q join questions_tags T \
on Q.questionid = T.questionid
select * from questions Q join questions_tags T \
using(questionid)
select * from questions Q, questions_tags T \
where Q.questionid = T.questionid
```
Outer joins
```
select * from questions Q left outer join answers A \
on Q.questionid = A.questionid
select * from questions Q left outer join answers A \
on Q.questionid = A.questionid \
where A.creationdate is NULL
# Note no right outer join in SQLite so here we reverse order of answers and questions \
select * from questions Q right outer join answers A \
on Q.questionid = A.questionid \
where Q.creationdate is NULL
select Q.questionid, count(*) as n_tags from questions Q join questions_tags T \
on Q.questionid = T.questionid \
group by Q.questionid
```
Self joins
First we'll set up a view (a temporary) table that combines questions and tags for ease of illustrating ideas around self joins.
```
create view QT as select * from questions join questions_tags using(questionid)
```
In small groups, discuss what these queries do.
```
select * from QT as QT1 join QT as QT2 \
using(questionid)
select * from QT as QT1 join QT as QT2 \
using(questionid) where QT1.tag < QT2.tag
select QT1.tag, QT2.tag, count(*) as n from QT as QT1 join QT as QT2 \
using(questionid) where QT1.tag < QT2.tag \
group by QT1.tag, QT2.tag order by n desc limit 10
select * from QT as QT1 join QT as QT2 using(ownerid)
```
#### Set operations
```
select ownerid from QT where tag='python' \
intersect \
select ownerid from QT where tag='r'
select ownerid from QT where tag='python' \
except \
select ownerid from QT where tag='r'
select ownerid from QT where tag='python' \
union \
select ownerid from QT where tag='r'
select userid, displayname, location from users \
where location like '%United States%' \
intersect \
select userid, displayname, location from users \
where reputation > 10
```
#### Subqueries
```
select * from \
answers A \
join \
( select ownerid, count(*) as n_answered from answers \
group by ownerid order by n_answered desc limit 1000 ) most_responsive \
on A.ownerid = most_responsive.ownerid
```
```
select avg(upvotes) from users \
where userid in \
( select distinct ownerid from \
questions join questions_tags using(questionid) \
where tag = 'python' )
```
## 4 Efficient SQL queries
### 4.1 Overview
In general, your DBMS should examine your query and try to implement it in the fastest way possible.
Some tips for faster queries include:
- use indexes on fields used in WHERE and JOIN clauses (see next section)
- try to avoid wildcards at the start of LIKE string comparison when you have an index on the field (as this requires looking at all of the rows)
- similarly try to avoid using functions on indexed columns in a WHERE clause as this requires doing the calculation on all the rows in order to check the condition
- only select the columns you really need
- create (temporary) tables to store intermediate results that you need to query repeatedly
- use filtering (WHERE clauses) in inner statements when you have nested subqueries
- use LIMIT as seen in the examples here if you only need some of the rows a query returns
### 4.2 Indexes
An index is an ordering of rows based on one or more fields. DBMS use indexes to look up values quickly, either when filtering (if the index is involved in the WHERE condition) or when doing joins (if the index is involved in the JOIN condition). So in general you want your tables to have indexes.
DBMS use indexing to provide sub-linear time lookup. Without indexes, a database needs to scan through every row sequentially, which is called linear time lookup -- if there are n rows, the lookup is $O(n)$ in computational cost. With indexes, lookup may be logarithmic -- O(log(n)) -- (if using tree-based indexes) or constant time -- O(1) -- (if using hash-based indexes). A binary tree-based search is logarithmic; at each step through the tree you can eliminate half of the possibilities.
Here's how we create an index, with some time comparison for a simple query.
```{r}
system.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 2.4 seconds
system.time(dbExecute(db, "create index count_index on questions (viewcount)")) # 5.6 seconds
system.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 0.9 seconds
## restore earlier state by removing index
system.time(dbExecute(db, "drop index count_index"))
```
In many contexts (but not the example above), an index can save huge amounts of time. So if you're working with a database and speed is important, check to see if there are indexes.
One downside of indexes is that creation of indexes can be very time-consuming, as seen above. And if the database is updated frequently, this could be detrimental.
Finally, using indexes in a lookup is not always advantageous, as discussed next.
#### 4.2.1 Index lookup vs. sequential scan
Using an index is good in that can go to the data needed very quickly based on random access to the disk locations of the data of interest, but if it requires the computer to examine a large number of rows, it may not be better than sequential scan. An advantage of sequential scan is that it will make good use of the CPU cache, reading chunks of data and then accessing the individual pieces of data quickly.
For example, if you compare the change the query above that filters on viewcount
to use a much smaller threshold than 10000, you will probably see that the time used when there is an index is more than without an index.
Ideally you'd do sequential scan of exactly the subset of the rows that you need, with that subset available in contiguous storage.
#### 4.2.2 How indexes work
Indexes are often implemented using tree-based methods. For example in Postgres, b-tree indexes are used for indexes on things that have an ordering. Trees are basically like decision trees - at each node in the tree, there is a condition that sends one down the left or right branch (there might also be more than two branches. Eventually, one reaches the leaves of the tree, which have the actual values that one is looking for. Associated with each value is the address of where that row of data is stored. With a tree-based index, the time cost of b-tree lookup is logarithmic (based on the binary lookup), so it does grow with the number of elements in the table, but it does so slowly. The lookup process is that given a value (which would often be referred to as a `key`), one walks down the tree based on comparing the value to the condition at each split in the tree until one finds the elements corresponding to the value and then getting the addresses for where the desired rows are stored.
Here's [some information](https://use-the-index-luke.com/sql/anatomy/the-tree) on how such trees are constructed and searched.
In SQLite, indexes are implemented by creating a separate index table that maps from the value to the row index in the indexed table, allowing for fast lookup of a row.
### 4.3 SQL query plans and EXPLAIN
You can actually examine the query plan that the system is going to use for a query using the EXPLAIN keyword. I'd suggest trying this in Postgres as the output is more interpretable than SQLite.
```{r}
dbGetQuery(db, "explain select * from webtraffic where count > 500")
```
In PostgreSQL that gives the following:
```
QUERY PLAN
1 Gather (cost=1000.00..388634.17 rows=8513 width=61)
2 Workers Planned: 2
3 -> Parallel Seq Scan on webtraffic (cost=0.00..386782.88 rows=3547 width=61)
4 Filter: (count > 500)
```
The "Workers Planned: 2" seems to indicate that there will be some parallelization used, even without us asking for that.
Now let's see what query plan is involved in a join and when using indexes.
```{r}
dbGetQuery(db, "explain select * from questions join questions_tags on
questions.questionid = questions_tags.questionid")
```