diff --git a/.github/workflows/update_build_environment.yml b/.github/workflows/update_build_environment.yml index 9e555fb2..707fe21a 100644 --- a/.github/workflows/update_build_environment.yml +++ b/.github/workflows/update_build_environment.yml @@ -2,6 +2,8 @@ name: Rebuild and publish new ubcdsci/py-intro-to-ds image on DockerHub on: pull_request: types: [opened, synchronize] + branches: + - 'main' jobs: rebuild-docker: diff --git a/source/classification1.md b/source/classification1.md index 471c0ae7..6727e44b 100755 --- a/source/classification1.md +++ b/source/classification1.md @@ -16,6 +16,7 @@ kernelspec: :tags: [remove-cell] from chapter_preamble import * from IPython.display import HTML +from IPython.display import Image from sklearn.metrics.pairwise import euclidean_distances import numpy as np import plotly.express as px @@ -281,6 +282,7 @@ perimeter and concavity variables. Recall that the default palette in `altair` is colorblind-friendly, so we can stick with that here. ```{code-cell} ipython3 +:tags: ["remove-output"] perim_concav = alt.Chart(cancer).mark_circle().encode( x=alt.X("Perimeter").title("Perimeter (standardized)"), y=alt.Y("Concavity").title("Concavity (standardized)"), @@ -289,12 +291,16 @@ perim_concav = alt.Chart(cancer).mark_circle().encode( perim_concav ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] +glue("fig:05-scatter", perim_concav) +``` + +:::{glue:figure} fig:05-scatter :name: fig:05-scatter -:figclass: caption-hack Scatter plot of concavity versus perimeter colored by diagnosis label. -``` +::: +++ @@ -855,7 +861,11 @@ for neighbor_df in neighbor_df_list: # tight layout fig.update_layout(margin=dict(l=0, r=0, b=0, t=1), template="plotly_white") -glue("fig:05-more", fig) +# if HTML, use the plotly 3d image; if PDF, use static image +if "BOOK_BUILD_TYPE" in os.environ and os.environ["BOOK_BUILD_TYPE"] == "PDF": + glue("fig:05-more", Image("img/classification1/plot3d_knn_classification.png")) +else: + glue("fig:05-more", fig) ``` ```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 @@ -1432,6 +1442,7 @@ The new imbalanced data is shown in {numref}`fig:05-unbalanced`, and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 +:tags: ["remove-output"] rare_cancer = pd.concat(( cancer[cancer["Class"] == "Benign"], cancer[cancer["Class"] == "Malignant"].head(3) @@ -1445,12 +1456,16 @@ rare_plot = alt.Chart(rare_cancer).mark_circle().encode( rare_plot ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] +glue("fig:05-unbalanced", rare_plot) +``` + +:::{glue:figure} fig:05-unbalanced :name: fig:05-unbalanced -:figclass: caption-hack Imbalanced data. -``` +::: ```{code-cell} ipython3 rare_cancer["Class"].value_counts() @@ -1947,16 +1962,15 @@ unscaled_plot + prediction_plot ``` ```{code-cell} ipython3 -:tags: [remove-input] +:tags: [remove-cell] glue("fig:05-workflow-plot", (unscaled_plot + prediction_plot)) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +:::{glue:figure} fig:05-workflow-plot :name: fig:05-workflow-plot -:figclass: caption-hack Scatter plot of smoothness versus area where background color indicates the decision of the classifier. -``` +::: +++ @@ -1974,6 +1988,7 @@ found in {numref}`Chapter %s `. This will ensure that and guidance that the worksheets provide will function as intended. +++ + ## References ```{bibliography} diff --git a/source/classification2.md b/source/classification2.md index 45f18b46..1b0d0fe1 100755 --- a/source/classification2.md +++ b/source/classification2.md @@ -395,20 +395,20 @@ the `random_state` argument that is available in many `pandas` and `scikit-learn functions. Those functions will then use your `Generator` to generate random numbers instead of `numpy`'s default generator. For example, we can reproduce our earlier example by using a `Generator` object with the `seed` value set to 1; we get the same lists of numbers once again. -```{code} +```python from numpy.random import Generator, PCG64 rng = Generator(PCG64(seed=1)) random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rng).to_list() random_numbers1_third ``` -```{code} +```text array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5]) ``` -```{code} +```python random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rng).to_list() random_numbers2_third ``` -```{code} +```text array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7]) ``` @@ -432,6 +432,7 @@ You will also notice that we set the random seed using the `np.random.seed` func as described in {numref}`randomseeds`. ```{code-cell} ipython3 +:tags: ["remove-output"] # load packages import altair as alt import pandas as pd @@ -462,11 +463,18 @@ perim_concav = alt.Chart(cancer).mark_circle().encode( perim_concav ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] +glue("fig:06-precode", perim_concav) +``` + +:::{glue:figure} fig:06-precode :name: fig:06-precode Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label. -``` +::: + + +++ @@ -2205,10 +2213,10 @@ and guidance that the worksheets provide will function as intended. text, it requires a bit more mathematical background than we require. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/clustering.md b/source/clustering.md index f87355db..c44d9e00 100755 --- a/source/clustering.md +++ b/source/clustering.md @@ -1063,10 +1063,10 @@ and guidance that the worksheets provide will function as intended. learning, it covers *principal components analysis (PCA)*, which is a very popular technique for reducing the number of predictors in a data set. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/img/classification1/plot3d_knn_classification.png b/source/img/classification1/plot3d_knn_classification.png new file mode 100644 index 00000000..ea9818eb Binary files /dev/null and b/source/img/classification1/plot3d_knn_classification.png differ diff --git a/source/img/regression1/plot3d_knn_regression.png b/source/img/regression1/plot3d_knn_regression.png new file mode 100644 index 00000000..81ea7bb4 Binary files /dev/null and b/source/img/regression1/plot3d_knn_regression.png differ diff --git a/source/img/regression2/plot3d_linear_regression.png b/source/img/regression2/plot3d_linear_regression.png new file mode 100644 index 00000000..3cb5ffb9 Binary files /dev/null and b/source/img/regression2/plot3d_linear_regression.png differ diff --git a/source/inference.md b/source/inference.md index b280ec0b..ced12c67 100755 --- a/source/inference.md +++ b/source/inference.md @@ -645,7 +645,7 @@ was \$`r round(mean(airbnb$price),2)`. --> ```{code-cell} ipython3 -:tags: [remove-input] +:tags: ["remove-cell"] glue( "fig:11-example-means5", @@ -681,12 +681,12 @@ glue( ) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +:::{glue:figure} fig:11-example-means5 :name: fig:11-example-means5 -:figclass: caption-hack Comparison of population distribution, sample distribution, and sampling distribution. -``` +::: + +++ @@ -699,7 +699,7 @@ sampling distribution of the sample mean. We indicate the mean of the sampling distribution with a vertical line. ```{code-cell} ipython3 -:tags: [remove-input] +:tags: ["remove-cell"] # Plot sampling distributions for multiple sample sizes base = alt.Chart( @@ -753,12 +753,11 @@ glue( ) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +:::{glue:figure} fig:11-example-means7 :name: fig:11-example-means7 -:figclass: caption-hack Comparison of sampling distributions, with mean highlighted as a vertical line. -``` +::: +++ @@ -963,8 +962,7 @@ one_sample ``` ```{code-cell} ipython3 -:tags: [] - +:tags: ["remove-output"] one_sample_dist = alt.Chart(one_sample).mark_bar().encode( x=alt.X("price") .bin(maxbins=30) @@ -975,12 +973,17 @@ one_sample_dist = alt.Chart(one_sample).mark_bar().encode( one_sample_dist ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping1", one_sample_dist) +``` + +:::{glue:figure} fig:11-bootstrapping1 :name: fig:11-bootstrapping1 -:figclass: caption-hack Histogram of price per night (dollars) for one sample of size 40. -``` +::: +++ @@ -1002,7 +1005,7 @@ Since we need to sample with replacement when bootstrapping, we change the `replace` parameter to `True`. ```{code-cell} ipython3 -:tags: [] +:tags: ["remove-output"] boot1 = one_sample.sample(frac=1, replace=True) boot1_dist = alt.Chart(boot1).mark_bar().encode( @@ -1015,12 +1018,17 @@ boot1_dist = alt.Chart(boot1).mark_bar().encode( boot1_dist ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping3", boot1_dist) +``` + +:::{glue:figure} fig:11-bootstrapping3 :name: fig:11-bootstrapping3 -:figclass: caption-hack Bootstrap distribution. -``` +::: ```{code-cell} ipython3 boot1["price"].mean() @@ -1055,10 +1063,10 @@ boot20000 Let's take a look at the histograms of the first six replicates of our bootstrap samples. ```{code-cell} ipython3 -:tags: [] +:tags: ["remove-output"] six_bootstrap_samples = boot20000.query("replicate < 6") -alt.Chart(six_bootstrap_samples, height=150).mark_bar().encode( +six_bootstrap_fig = alt.Chart(six_bootstrap_samples, height=150).mark_bar().encode( x=alt.X("price") .bin(maxbins=20) .title("Price per night (dollars)"), @@ -1067,14 +1075,20 @@ alt.Chart(six_bootstrap_samples, height=150).mark_bar().encode( "replicate:N", # Recall that `:N` converts the variable to a categorical type columns=2 ) +six_bootstrap_fig +``` + +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping-six-bootstrap-samples", six_bootstrap_fig) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +:::{glue:figure} fig:11-bootstrapping-six-bootstrap-samples :name: fig:11-bootstrapping-six-bootstrap-samples -:figclass: caption-hack Histograms of the first six replicates of the bootstrap samples. -``` +::: +++ @@ -1125,7 +1139,7 @@ boot20000_means ``` ```{code-cell} ipython3 -:tags: [] +:tags: ["remove-output"] boot_est_dist = alt.Chart(boot20000_means).mark_bar().encode( x=alt.X("mean_price") @@ -1137,12 +1151,17 @@ boot_est_dist = alt.Chart(boot20000_means).mark_bar().encode( boot_est_dist ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping5", boot_est_dist) +``` + +:::{glue:figure} fig:11-bootstrapping5 :name: fig:11-bootstrapping5 -:figclass: caption-hack Distribution of the bootstrap sample means. -``` +::: +++ @@ -1150,10 +1169,10 @@ Let's compare the bootstrap distribution—which we construct by taking many the true sampling distribution—which corresponds to taking many samples from the population. ```{code-cell} ipython3 -:tags: [remove-input] +:tags: [remove-cell] sampling_distribution.encoding.x["bin"]["extent"] = (90, 250) -alt.vconcat( +bootstr6fig = alt.vconcat( alt.layer( sampling_distribution, alt.Chart(sample_estimates).mark_rule(color="black", size=1.5, strokeDash=[6]).encode(x="mean(mean_price)"), @@ -1175,12 +1194,19 @@ alt.vconcat( ) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping6", bootstr6fig) +``` + +:::{glue:figure} fig:11-bootstrapping6 :name: fig:11-bootstrapping6 -:figclass: caption-hack Comparison of the distribution of the bootstrap sample means and sampling distribution. -``` +::: + + ```{code-cell} ipython3 :tags: [remove-cell] @@ -1277,7 +1303,7 @@ the middle 95\% of the sample mean prices in the bootstrap distribution. We can visualize the interval on our distribution in {numref}`fig:11-bootstrapping9`. ```{code-cell} ipython3 -:tags: [remove-input] +:tags: [remove-cell] # Create the annotation for for the 2.5th percentile rule_025 = alt.Chart().mark_rule(color="black", size=1.5, strokeDash=[6]).encode( x=alt.datum(ci_bounds[0.025]) @@ -1301,15 +1327,22 @@ text_975 = text_025.encode( rule_975 = rule_025.encode(x=alt.datum(ci_bounds[0.975])) # Layer the annotations on top of the distribution plot -boot_est_dist + rule_025 + text_025 + rule_975 + text_975 +bootstr9fig = boot_est_dist + rule_025 + text_025 + rule_975 + text_975 +``` + +```{code-cell} ipython3 +:tags: ["remove-cell"] + +glue("fig:11-bootstrapping9", bootstr9fig) ``` -```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 +:::{glue:figure} fig:11-bootstrapping9 :name: fig:11-bootstrapping9 -:figclass: caption-hack Distribution of the bootstrap sample means with percentile lower and upper bounds. -``` +::: + + +++ @@ -1365,8 +1398,6 @@ and guidance that the worksheets provide will function as intended. ## References -+++ - ```{bibliography} :filter: docname in docnames ``` diff --git a/source/intro.md b/source/intro.md index 1c2f207d..8c08ade0 100755 --- a/source/intro.md +++ b/source/intro.md @@ -367,7 +367,6 @@ and we set `name` to the string `"Alice"`. ```{code-cell} ipython3 my_number = 1 + 2 - name = "Alice" ``` @@ -397,12 +396,13 @@ Other symbols won't work since they have their own meanings in Python. For examp `-` is the subtraction symbol; if we try to assign a name with the `-` symbol, Python will complain and we will get an error! -``` +```{code-cell} ipython3 +:tags: ["remove-output"] my-number = 1 ``` - -``` -SyntaxError: cannot assign to expression here. Maybe you meant '==' instead of '='? +```{code-cell} ipython3 +:tags: ["remove-input"] +print("SyntaxError: cannot assign to expression here. Maybe you meant '==' instead of '='?") ``` ```{index} object; naming convention @@ -705,7 +705,7 @@ The `ten_lang_percent` data frame shows that the ten Aboriginal languages in the `ten_lang` data frame were spoken as a mother tongue by between 0.008% and 0.18% of the Canadian population. -## Combining analysis steps with chaining and multiline expressions +## Combining steps with chaining and multiline expressions It took us 3 steps to find the ten Aboriginal languages most often reported in 2016 as mother tongues in Canada. Starting from the `can_lang` data frame, we: @@ -1233,10 +1233,10 @@ make sure to follow the instructions for computer setup found in {numref}`Chapter %s `. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/jupyter.md b/source/jupyter.md index bdd5fd16..928e18a3 100755 --- a/source/jupyter.md +++ b/source/jupyter.md @@ -134,7 +134,7 @@ To run a code cell independently, the cell needs to first be activated. This is done by clicking on it with the cursor. Jupyter will indicate a cell has been activated by highlighting it with a blue rectangle to its left. After the cell has been activated ({numref}`activate-and-run-button`), the cell can be run by either pressing -the **Run** (▸) button in the toolbar, or by using a keyboard shortcut of +the **Run** (⏵) button in the toolbar, or by using a keyboard shortcut of `Shift + Enter`. ```{figure} img/jupyter/activate-and-run-button-annotated.png @@ -231,7 +231,7 @@ To edit a Markdown cell in Jupyter, you need to double click on the cell. Once you do this, the unformatted (or *unrendered*) version of the text will be shown ({numref}`markdown-cell-not-run`). You can then use your keyboard to edit the text. To view the formatted -(or *rendered*) text ({numref}`markdown-cell-run`), click the **Run** (▸) button in the toolbar, +(or *rendered*) text ({numref}`markdown-cell-run`), click the **Run** (⏵) button in the toolbar, or use the `Shift + Enter` keyboard shortcut. ```{figure} img/jupyter/markdown-cell-not-run.png @@ -285,7 +285,7 @@ As you might know (or at least imagine) by now, Jupyter notebooks are great for interactively editing, writing and running Python code; this is what they were designed for! Consequently, Jupyter notebooks are flexible in regards to code cell execution order. This flexibility means that code cells can be run in any -arbitrary order using the **Run** (▸) button. But this flexibility has a downside: +arbitrary order using the **Run** (⏵) button. But this flexibility has a downside: it can lead to Jupyter notebooks whose code cannot be executed in a linear order (from top to bottom of the notebook). A nonlinear notebook is problematic because a linear order is the conventional way code documents are run, and @@ -294,7 +294,7 @@ code is used in some automated process, it will need to run in a linear order, from top to bottom of the notebook. The most common way to inadvertently create a nonlinear notebook is to rely solely -on using the (▸) button to execute cells. For example, +on using the (⏵) button to execute cells. For example, suppose you write some Python code that creates a Python object, say a variable named `y`. When you execute that cell and create `y`, it will continue to exist until it is deliberately deleted with Python code, or when the Jupyter @@ -515,3 +515,12 @@ underscore (`_`). formatting, two good places to start are CommonMark's [Markdown cheatsheet](https://commonmark.org/help/) and [Markdown tutorial](https://commonmark.org/help/tutorial/). + ++++ + +## References + +```{bibliography} +:filter: docname in docnames +``` + diff --git a/source/reading.md b/source/reading.md index c0f5ec37..e0d4c5f1 100755 --- a/source/reading.md +++ b/source/reading.md @@ -109,14 +109,16 @@ So in this case, `happiness_report.csv` would be reached by starting at the root then the `dsci-100` folder, then the `project3` folder, and then finally the `data` folder. So its absolute path would be `/home/dsci-100/project3/data/happiness_report.csv`. We can load the file using its absolute path as a string passed to the `read_csv` function from `pandas`. -```python +```{code-cell} ipython3 +:tags: ["remove-output"] happy_data = pd.read_csv("/home/dsci-100/project3/data/happiness_report.csv") ``` If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current working directory to the file, with slashes `/` separating each step. Since we are currently in the `project3` folder, we just need to enter the `data` folder to reach our desired file. Hence the relative path is `data/happiness_report.csv`, and we can load the file using its relative path as a string passed to `read_csv`. -```python +```{code-cell} ipython3 +:tags: ["remove-output"] happy_data = pd.read_csv("data/happiness_report.csv") ``` Note that there is no forward slash at the beginning of a relative path; if we accidentally typed `"/data/happiness_report.csv"`, @@ -147,13 +149,13 @@ all of the folders between the computer's root, represented by `/`, and the file across different computers. For example, suppose Fatima and Jayden are working on a project together on the `happiness_report.csv` data. Fatima's file is stored at -``` +```text /home/Fatima/project3/data/happiness_report.csv ``` while Jayden's is stored at -``` +```text /home/Jayden/project3/data/happiness_report.csv ``` @@ -275,11 +277,13 @@ With this extra information being present at the top of the file, using into Python. In the case of this file, Python just prints a `ParserError` message, indicating that it wasn't able to read the file. -```python +```{code-cell} ipython3 +:tags: ["remove-output"] canlang_data = pd.read_csv("data/can_lang_meta-data.csv") ``` -```text -ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6 +```{code-cell} ipython3 +:tags: ["remove-input"] +print("ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6") ``` ```{index} ParserError @@ -841,7 +845,8 @@ be able to connect to a database using this information. ```{index} ibis; postgres, ibis; connect ``` -```python +```{code-cell} ipython3 +:tags: ["remove-output"] conn = ibis.postgres.connect( database="can_mov_db", host="fakeserver.stat.ubc.ca", @@ -859,12 +864,14 @@ connecting to and working with an SQLite database. For example, we can again use ```{index} ibis; list_tables ``` -```python +```{code-cell} ipython3 +:tags: ["remove-output"] conn.list_tables() ``` -```text -["themes", "medium", "titles", "title_aliases", "forms", "episodes", "names", "names_occupations", "occupation", "ratings"] +```{code-cell} ipython3 +:tags: ["remove-input"] +print('["themes", "medium", "titles", "title_aliases", "forms", "episodes", "names", "names_occupations", "occupation", "ratings"]') ``` We see that there are 10 tables in this database. Let's first look at the @@ -874,16 +881,20 @@ database. ```{index} ibis; table ``` -```python +```{code-cell} ipython3 +:tags: ["remove-output"] ratings_table = conn.table("ratings") ratings_table ``` -```text +```{code-cell} ipython3 +:tags: ["remove-input"] +print(""" AlchemyTable: ratings title string average_rating float64 num_votes int64 +""") ``` ```{index} ibis; [] @@ -892,12 +903,15 @@ AlchemyTable: ratings To find the lowest rating that exists in the data base, we first need to select the `average_rating` column: -```python +```{code-cell} ipython3 +:tags: ["remove-output"] avg_rating = ratings_table[["average_rating"]] avg_rating ``` -```text +```{code-cell} ipython3 +:tags: ["remove-input"] +print(""" r0 := AlchemyTable: ratings title string average_rating float64 @@ -906,6 +920,7 @@ r0 := AlchemyTable: ratings Selection[r0] selections: average_rating: r0.average_rating +""") ``` ```{index} database; ordering, ibis; order_by, ibis; head @@ -914,7 +929,8 @@ Selection[r0] Next we use the `order_by` function from `ibis` order the table by `average_rating`, and then the `head` function to select the first row (i.e., the lowest score). -```python +```{code-cell} ipython3 +:tags: ["remove-output"] lowest = avg_rating.order_by("average_rating").head(1) lowest.execute() ``` @@ -925,7 +941,6 @@ lowest = pd.DataFrame({"average_rating" : [1.0]}) lowest ``` - We see the lowest rating given to a movie is 1, indicating that it must have been a really bad movie... @@ -1250,7 +1265,8 @@ page we want to scrape by providing its URL in quotations to the `requests.get` function. This function obtains the raw HTML of the page, which we then pass to the `BeautifulSoup` function for parsing: -```python +```{code-cell} ipython3 +:tags: ["remove-output"] import requests import bs4 @@ -1338,7 +1354,8 @@ below that `read_html` found 17 tables on the Wikipedia page for Canada. ```{index} read function; read_html ``` -```python +```{code-cell} ipython3 +:tags: ["remove-output"] canada_wiki_tables = pd.read_html("https://en.wikipedia.org/wiki/Canada") len(canada_wiki_tables) ``` @@ -1514,7 +1531,8 @@ response using the `json` method. -```python +```{code-cell} ipython3 +:tags: ["remove-output"] import requests nasa_data_single = requests.get( @@ -1539,7 +1557,8 @@ in an object called `nasa_data`; now the response will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object), and there will be 74 items total, one for each day between the start and end dates: -```python +```{code-cell} ipython3 +:tags: ["remove-output"] nasa_data = requests.get( "https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&start_date=2023-05-01&end_date=2023-07-13" ).json() @@ -1548,6 +1567,10 @@ len(nasa_data) ```{code-cell} ipython3 :tags: [remove-input] +# need to secretly re-load the nasa data again because the above running code destroys it +# see PR 341 for why we need to do things this way (essentially due to PDF build) +with open("data/nasa.json", "r") as f: + nasa_data = json.load(f) len(nasa_data) ``` @@ -1626,10 +1649,10 @@ and guidance that the worksheets provide will function as intended. - [extracting the data for apartment listings on Craigslist](https://www.youtube.com/embed/YdIWI6K64zo), and - [extracting Canadian city names and populations from Wikipedia](https://www.youtube.com/embed/O9HKbdhqYzk). -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/references.bib b/source/references.bib index e135c672..d306de25 100644 --- a/source/references.bib +++ b/source/references.bib @@ -336,7 +336,7 @@ @inproceedings{kluyver2016jupyter title={Jupyter Notebooks: a publishing format for reproducible computational workflows}, author={Kluyver, Thomas and Ragan-Kelley, Benjamin and P{\'e}rez, Fernando and Granger, Brian and Bussonnier, Matthias and Frederic, Jonathan and Kelley, Kyle and Hamrick, Jessica and Grout, Jason and Corlay, Sylvain and Ivanov, Paul and Avila, Dami{\'a}n and Abdalla, Safia and Willing, Carol and {Jupyter Development Team}}, year={2016}, - booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas: Proceedings of the $20^{\text{th}}$ international conference on electronic publishing}, + booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas: Proceedings of the 20th international conference on electronic publishing}, publisher = {IOS Press}, volume = {87}, address = {Amsterdam} @@ -508,7 +508,7 @@ @article{requests @article{rhoophiuchi, title = {Rho Ophiuchi cloud complex}, author = {{NASA} and {ESA} and {CSA} and {STScI} and {K. Pontoppidan (STScI)} and {A. Pagan (STScI)}}, - year = {2023}, + year = {Accessed Online: 2023}, journal={URL: https://esawebb.org/images/weic2316a/}} diff --git a/source/regression1.md b/source/regression1.md index 87ce1371..a4ed7633 100755 --- a/source/regression1.md +++ b/source/regression1.md @@ -20,6 +20,7 @@ kernelspec: from chapter_preamble import * from IPython.display import HTML +from IPython.display import Image import plotly.express as px import plotly.graph_objects as go ``` @@ -1148,7 +1149,11 @@ fig.update_layout( template="plotly_white", ) -glue("fig:07-knn-mult-viz", fig) +# if HTML, use the plotly 3d image; if PDF, use static image +if "BOOK_BUILD_TYPE" in os.environ and os.environ["BOOK_BUILD_TYPE"] == "PDF": + glue("fig:07-knn-mult-viz", Image("img/regression1/plot3d_knn_regression.png")) +else: + glue("fig:07-knn-mult-viz", fig) ``` ```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 @@ -1208,8 +1213,6 @@ and guidance that the worksheets provide will function as intended. ## References -+++ - ```{bibliography} :filter: docname in docnames ``` diff --git a/source/regression2.md b/source/regression2.md index 67160ece..9db8e2c0 100755 --- a/source/regression2.md +++ b/source/regression2.md @@ -20,6 +20,7 @@ kernelspec: from chapter_preamble import * from IPython.display import HTML +from IPython.display import Image import numpy as np import plotly.express as px import plotly.graph_objects as go @@ -827,7 +828,11 @@ fig.update_layout( template="plotly_white", ) -glue("fig:08-3DlinReg", fig) +# if HTML, use the plotly 3d image; if PDF, use static image +if "BOOK_BUILD_TYPE" in os.environ and os.environ["BOOK_BUILD_TYPE"] == "PDF": + glue("fig:08-3DlinReg", Image("img/regression2/plot3d_linear_regression.png")) +else: + glue("fig:08-3DlinReg", fig) ``` ```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 @@ -1395,10 +1400,10 @@ and guidance that the worksheets provide will function as intended. covered earlier are indeed more flexible but become very slow when given lots of data. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/setup.md b/source/setup.md index 4ef17991..c45a0980 100755 --- a/source/setup.md +++ b/source/setup.md @@ -195,7 +195,8 @@ found in [the online Docker documentation](https://docs.docker.com/desktop/insta ### Ubuntu **Installation** To install Docker on Ubuntu, open the terminal and enter the following five commands. -``` +```{code-cell} +:tags: ["remove-output"] sudo apt update sudo apt install ca-certificates curl gnupg curl -fsSL https://get.docker.com -o get-docker.sh @@ -207,7 +208,8 @@ sudo sh get-docker.sh and look for the line `FROM ubcdsci/py-dsci-100:` followed by a tag consisting of a sequence of numbers and letters. Then in the terminal, navigate to the directory where you want to run JupyterLab, and run the following command, replacing `TAG` with the *tag* you found earlier. -``` +```{code-cell} +:tags: ["remove-output"] docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/py-dsci-100:TAG jupyter lab ``` The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the @@ -283,7 +285,8 @@ A JupyterLab Desktop session, showing the Terminal option at the bottom. In this terminal, run the following commands: -``` +```{code-cell} +:tags: ["remove-output"] pip install --upgrade jupyterlab-git conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/main/environment.yml ``` @@ -304,7 +307,8 @@ correctly set up and ready for use. Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY)) and type the following command: -``` +```{code-cell} +:tags: ["remove-output"] xcode-select --install ``` Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation). @@ -326,18 +330,21 @@ the various Python software packages needed for the worksheets. **Installation** First, we will install Git for version control. Open the terminal and type the following commands: -``` +```{code-cell} +:tags: ["remove-output"] sudo apt update sudo apt install git ``` Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation). Download the `JupyterLab-Setup-Debian.deb` installer file for Ubuntu/Debian. Open a terminal, navigate to where the installer file was downloaded, and run the command -``` +```{code-cell} +:tags: ["remove-output"] sudo dpkg -i JupyterLab-Setup-Debian.deb ``` Run JupyterLab Desktop using the command -``` +```{code-cell} +:tags: ["remove-output"] jlab ``` diff --git a/source/version-control.md b/source/version-control.md index 941ee49d..48c49026 100755 --- a/source/version-control.md +++ b/source/version-control.md @@ -1130,10 +1130,10 @@ you can expand your knowledge through the resources listed below: is an excellent additional resource to consult if you need help generating and using personal access tokens. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/viz.md b/source/viz.md index f82d85c9..e062a679 100755 --- a/source/viz.md +++ b/source/viz.md @@ -851,8 +851,8 @@ and is just added for readability. ```{code-cell} ipython3 canadian_population = 35_151_728 -can_lang["mother_tongue_percent"] = can_lang["mother_tongue"] / canadian_population * 100 -can_lang["most_at_home_percent"] = can_lang["most_at_home"] / canadian_population * 100 +can_lang["mother_tongue_percent"] = can_lang["mother_tongue"]/canadian_population*100 +can_lang["most_at_home_percent"] = can_lang["most_at_home"]/canadian_population*100 can_lang[["mother_tongue_percent", "most_at_home_percent"]] ``` @@ -2063,10 +2063,10 @@ and guidance that the worksheets provide will function as intended. is where you should look if you want to learn about `date` and `time`, including how to create them, and how to use them to effectively handle durations, etc -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ``` diff --git a/source/wrangling.md b/source/wrangling.md index 998f6bc0..aa5fe42d 100755 --- a/source/wrangling.md +++ b/source/wrangling.md @@ -1490,8 +1490,9 @@ in our case, we get an error. region_lang["most_at_home":"lang_known"].groupby("region").max() ``` -``` -KeyError: "region" +```{code-cell} ipython3 +:tags: ["remove-input"] +print('KeyError: "region"') ``` This is because when we use `[]` we selected only the columns between @@ -1727,23 +1728,20 @@ Instead of using the `assign` method we can directly modify the `english_lang` d This would be a more natural choice in this particular case, since the syntax is more convenient for simple column modifications and additions. ```{code-cell} ipython3 -:tags: [remove-cell] -english_lang["city_pops"] = [4098927, 5928040, 1392609, 1321426, 2463431] -``` -```python +:tags: [remove-output] english_lang["city_pops"] = [4098927, 5928040, 1392609, 1321426, 2463431] english_lang ``` -```text +```{code-cell} ipython3 +:tags: ["remove-input"] +print(""" /tmp/ipykernel_12/2654974267.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy english_lang["city_pops"] = [4098927, 5928040, 1392609, 1321426, 2463431] -``` -```{code-cell} ipython3 -:tags: [remove-input] +""") english_lang ``` @@ -1910,10 +1908,10 @@ and guidance that the worksheets provide will function as intended. what you want. In that case, you may consider using [a for loop](https://wesmckinney.com/book/python-basics.html#control_for) {cite:p}`mckinney2012python`. -## References - +++ +## References + ```{bibliography} :filter: docname in docnames ```