Skip to content

Latest commit

 

History

History
47 lines (33 loc) · 3.63 KB

analytical-tools.md

File metadata and controls

47 lines (33 loc) · 3.63 KB

Analytical Tools --Background Requirements

To be productive in the N3C Enclave during this short course, we recommend that you (a) are comfortable with basic SQL and (b) know at least a little Python or R.

SQL

From https://github.com/National-COVID-Cohort-Collaborative/book-of-n3c-v1

Almost every N3C project starts with SQL. For the past 30 years, SQL has been the de facto language for large datasets like the N3C. It is well-suited for efficiently (a) selecting patients following exacting selection criteria, (b) joining a variety of predictor and outcome variables from multiple tables, and (c) producing a dataset better suited for analyses. Consequently it is a common ability for people in data science and IT.

Our rule of thumb is to transform it in SQL if SQL can comfortable transform it. Otherwise use R or Python to transform it.

SQLBolt (https://sqlbolt.com/) is the easiest way to get self-started. This website teaches and evaluates you in small manageable chunks. Complete the first 12 lessons under "Interactive Tutorial", and then the Union lesson under "More Topics". The later lessons are good too, but not used in N3C workflows.

If you'd like to learn the basic syntax more thoroughly, consider one of these options:

This resources above use SQLite, which is almost completely compatible with the flavor of SQL used in N3C. The N3C Enclave uses Spark SQL. During the course, refer to these sources if your code looks correct, but the "SQL Transform" throws an error:

Python

Python has many resources for all levels of experience.

Some general ones for getting started:

If you like Python and want to learn more, check out:

R

R also has many books for all levels of experience. These three books are free online, as well as available as paperback.

If you like R and want to learn more, check out: