-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBootcamp_notes.ft
1141 lines (1097 loc) · 83.3 KB
/
Bootcamp_notes.ft
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Unit 1
## Lesson 1
### Assignment 1: Data, Engineering and Machine Learning
- Technology stack- collection of elements that make up a product
- front end - interface that users interact with
- back end - servers, services, databases etc; heavy lifting that feeds data to the front end
- Related roles
- Data Analyst: draw conclusions and generate reports (proto data scientist)
- Data Engineer: gather and store data, create and manage data pipelines, databases; less interpretation of data
- Machine Learning Engineer: algorithms and modeling kwth an emphasis on algorithm design and efficiency
- Data Scientist: whatever the recruiter says it is
### Assignment 2: The data science toolkit
- Python
- Packages
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
- StatsModels
- SQL- Structured Query Langauge
- access and preprocess data
### Assignment 3: Thinking like a data scientist
- curiosity, practicality (have to define questions with limited scope that can be answerd with data), skepticism (how confident are we in the results we see; lies, damn lies, and statistics)
- sometimes the way an abstract question is translated into a concrete one is dictated by the available data
- findin and evaluating data sources: data archives, repositories, web scraping, logs, documents...
- evaluating uncertainty: assess how certain we are that conclusions based on a particular statistic are valid; are there flaws in the source of the sample (sampling method, representative nature of the sample for the population etc), size of the sample, noise (variance) in the data
### Assignment 4: Drill: What can data science do?
Take the following scenarios and describe how you would make it testable and translate it from a general question into something statistically rigorous
1. You work at an e-commerce company that sells three goods: widgets, doodads, and fizzbangs. The head of advertising asks you which they should feature in their new advertising campaign. You have data on individual visitors' sessions (activity on a website, pageviews, and purchases), as well as whether or not those users converted from an advertisement for that session. You also have the cost and price information for the goods.
- a person who saw an ad has four outcomes: buy nothing, buy w, buy d, or buy f
- we want to know which ad-buy pair results in the highest sales and whether that number is statistically significant
- Not clear what a follow up test would be given that it appears all the combinations have already been tested...
2. You work at a web design company that offers to build websites for clients. Signups have slowed, and you are tasked with finding out why. The onboarding funnel has three steps: email and password signup, plan choice, and payment. On a user level you have information on what steps they have completed as well as timestamps for all of those events for the past 3 years. You also have information on marketing spend on a weekly level.
- Did sign-ups slow because fewer people started the sign-up process or because more people fell out of the pipeline on the way to completion? What ad campaign was on when the person started the process?
- find the time when the max number of people got through step 3, calculate whether that is statistically different from the last several weeks. Then look at the relationship between the number of people who got through step 1 and 2 for both timeframes to see where people fell out of the pipeline. Then consider the marketting strategy during that time (assuming they are statistically different) and possibly reimplement that approach for a few weeks, track the numbers and compare against current status.
3. You work at a hotel website and currently the website ranks search results by price. For simplicity's sake, let's say it's a website for one city with 100 hotels. You are tasked with proposing a better ranking system. You have session information, price information for the hotels, and whether each hotel is currently available.
4. You work at a social network, and the management is worried about churn (users stopping using the product). You are tasked with finding out if their churn is atypical. You have three years of data for users with an entry for every time they've logged in, including the timestamp and length of session.
- examine the average length of time between sign up and last login of users grouped by sign-up week
### Assignment 5: Challenge: Personal goals
- academic style problems, but also some of these churn questions (I reckon)
- wrangling big datasets (dip into spark), wrangling small datasets (what constitutes over-mining data)
- hypothesis testing
- signal processing
## Lesson 2: SQL: data access methods
### Assignment 1: Introduction to Databases
- Database contexts
- operational layer- part of the application responsible for delivering the core user experience; server and client-side application code
- storate layer- database used to store information rather than storing locally in text files of some kind
- databases can be distributed across many machines and have higher capacity
- multiple users can access a remote database at the same time
- databases can take data from multiple sources
- analytics layer
- database structure
- relational databases consist of a series of tables, with each table having its own defined schema and a number of records (rows)
- each table has rows and columns; each column has a name and a data type associated with it
- database schema is a particular configuration of tables with columns
- schema can change, but migrating (systematic update) data to new structure is costly (tricky to make sure that things won't break in the move and that no data is lost)
- each record (row) must conform to the schema and ideally will have a value in every field (column) for efficiency (though realistically thre are NULL, blanks and N/As in tables)
- three kinds of tables:
- raw- contain simple, relatively unprocessed data (closely resembles the data produced by the operational services)
- processed tables- contain data that's been cleaned and transformed to be more readable/useable
- roll-up tables- specific kidn of processed table that take data and aggregate it @TODO
- SQL- Structured Query Language
- language used to create, retrieve, update and delete database records
- there are several flavors of SQL with meaningful minor differences; PSQL, SQLite, MySQL
- good to remember that SQL sees the world in rows, and we'll want to think about attributes at a rows level (with some grouping and aggregating)
### Assignment 2: Setting up PSQL (already done)
- psql -d database_name -f file.sql
### Project 3: SQL Basics
- The CREATE clause - to create a new table
- lowercase for table names, no spaces (underscores)
- each column has a column name and a TYPE (example below has arbitrary types filled in), but can also have contraints like prohibiting null values; column creation lines are seperated by commas
CREATE_TABLE table_name (
some_column_name FLOAT column_constraint,
some_second_column_name TEXT,
...
last_column_name TYPE);
- The SELECT and FROM clause
- SELECT retrieves rows FROM a table
- select all rows:
SELECT * FROM someTable;
- select specific rows
SELECT some_column_name, last_column_name from table_name;
- aliasing
- can SELECT a column and return it with a different label; useful when selecting across multiple tables that reuse column names
SELECT some_column_name AS col1 FROM table_name;
- Filtering with WHERE
- allows us to specify a set of conditions that the results must meet
- LIKE - for pattern matching when analyzing string data
- BETWEEN - check if a value is between a pair of values
- AND and OR - linking conditions together
- Ordering with ORDER BY
- control the order in which results are returned
- can link together multiple ordering conditions
ORDER BY some_column_name DESC;
- Limiting with LIMIT
- limit the number of results returned, for example to get the top 5
- Formatting notes:
- new lines for everything, indent column specific instructions
- ALL CAPS for SQL instructions, whatever casing is relevant for other names
#### Exercises:
https://gist.github.com/jordanplanders/298877231bb950a192223c681754dd56
- - 1. The IDs and durations for all trips of duration greater than 500, ordered by duration.
SELECT
trip_id,
duration
FROM
trips
WHERE
duration >500
ORDER BY
duration;
- - 2. Every column of the stations table for station id 84.
SELECT *
FROM
stations
WHERE
station_id = 84;
- - 3. The min temperatures of all the occurrences of rain in zip 94301.
SELECT
mintemperaturef
FROM
weather
WHERE
events = 'Rain'
AND zip = 94301;
### Project 4: Aggregating and grouping
- GROUP BY-
- comes after WHERE clause and before ORDER BY clauses;
- without aggregating function, just gets rid of duplicate entries
- all columns in GROUP BY clause must also be in SELECT statement
- can use col numbers instead of col names
- Aggregators
- functions that take a collection of values and return a single value
- return a column labelled by function, not column so need to alias
- AVG, MIN, MAX, COUNT(*)
#### Exercises: https://gist.github.com/jordanplanders/1c3e994b6246aa51cdba8b4e65166f28
-- 1. What was the hottest day in our data set? Where was that?
SELECT
maxtemperaturef,
zip
FROM
weather
ORDER BY
maxtemperaturef DESC
LIMIT 1;
-- 2. How many trips started at each station?
SELECT
COUNT(trip_id),
start_station
FROM
trips
GROUP BY
start_station
ORDER BY
COUNT(trip_id) DESC;
-- 3. What's the shortest trip that happened?
SELECT *
FROM
trips
ORDER BY
duration
LIMIT 1;
-- 4. What is the average trip duration, by end station?
SELECT
end_station,
avg(duration)
FROM
trips
GROUP BY
end_station
ORDER BY
avg(duration);
### Project 5: Joins and CTEs
- Basic Joins
- in a join clause indicate one or more pairs of columns you want to join the two tables on
- by default SQL performs an inner join (only returns rows that are succesfully joined from the two tables)
- comes after the FROM statement (if multiple tables, order matters), followed by an ON clause that specifies the table.columns that should be the same (to link the two)
SELECT
table1.col1.
table1.cal2
table2.col3
table2.col4
FROM
table1
JOIN
table2
ON
table1.col2 = table2.col2
- Table aliases
- in a join, often useful to alias table to simplify table names or add a table name (in the case of a self join) in the ON clause
- Types of Joins
- (INNER) JOIN: only returns rows that are successfully joined
- LEFT (OUTER) JOIN: returns all rows from the left table even if no common rows in right table; rows without a match wil be filled with NULL
- RIGHT (OUTER) JOIN: same as a LEFT (OUTER) JOIN if you reverse the table order in the FROM and JOIN clauses
- (FULL) OUTER JOIN: returns all rows with NULLs in all places where the join doesn't fill in data
- CTEs (Common Table Expressions)
- since a join statement returns a table, can join a join statment to other tables/the results of other queries
- note: JOINs happen before aggregate functions so if you want aggregate information about one table and information from the other table, but you don't want information from the other table to weight the results of the aggregate, create the first table with aggregation, THEN join to the second table
- if you join station with trips and calculated average lat and lon of all start_stations in a city, it will be the average location of start_station in a city of all trips, however, if calculate average start_station from station table and then join, it will be the average start_station location for the set of unique stations in a city.
WITH intermediate_table_name AS (query1)
- multiple joins are common to collect information from multiple tables
- Case
- set up conditions then take action in a column based on them
- CASE WHEN condition THEN value ELSE value END
- CASE statemnts go in the SELECT column and indicate what value to return given a conditional statment then aliased
#### Exercises:
https://gist.github.com/jordanplanders/f5d960fa280c27d5772a93b6bd268bf0
-- 1. What are the three longest trips on rainy days?
SELECT
trips.trip_id,
weather.date,
trips.duration,
weather.events
FROM
trips
JOIN
weather
ON
weather.date = SUBSTRING (trips.start_date ,0 , 11 )
WHERE
weather.events = 'Rain'
GROUP BY
trips.trip_id,
weather.date,
trips.duration,
weather.events
ORDER BY
weather.date
LIMIT 300;
-- 2. Which station is full most often?
--FIND WHICH STATION IS FULL (DOCKS_AVAILABLE = BIKES_AVAILABLE) MOST OFTEN (NUMBER OF STATUS UPDATES WHERE THIS WAS TRUE)
WITH station_full
AS(
SELECT
station_id,
COUNT(station_id) as times
FROM
status
WHERE
status.bikes_available = status.docks_available
GROUP BY
station_id
ORDER BY
COUNT(station_id) DESC
LIMIT 1)
-- MATCHING A STATION NAME WITH THE STATION_ID FROM ABOVE
SELECT
stations.name,
station_full.times
FROM
stations
JOIN
station_full
ON
stations.station_id = station_full.station_id;
-- 3. Return a list of stations with a count of number of trips starting at that station but ordered by dock count.
-- Query to find number of trips started at each station
WITH
stations2 AS
(
SELECT
stations.station_id AS station_id,
stations.name, COUNT(*) as trips_started
FROM
stations
JOIN
trips
ON
stations.name = trips.start_station
GROUP BY
trips.start_station, stations.station_id, stations.name)
-- QUERY TO ORDER STATIONS2 BY NUMBER OF DOCKS AVAILABLE
SELECT
status.docks_available,
stations2.name,
stations2.trips_started
FROM
stations2
JOIN
status
ON
stations2.station_id = status.station_id
ORDER BY
status.docks_available DESC
LIMIT 10;
-- 4. (Challenge) What's the length of the longest trip for each day it rains anywhere?
-- QUERY TO ONLY TRIPS WHEN IT'S RAINING
WITH raining AS (
SELECT
weather.date AS date,
trips.start_date AS tmestamp,
trips.duration AS duration
FROM
weather
JOIN
trips
ON
weather.date = SUBSTRING(trips.start_date, 0, 11)
WHERE
events = 'Rain')
SELECT
max(duration)/360 AS max_duration_hrs,
date AS start_date
FROM
raining
GROUP BY
date
ORDER BY
date
LIMIT 100;
### Project 6: Airbnb Cities
- What's the most expensive listing? What else can you tell me about the listing?
WITH max_price_list AS (
SELECT * from calendar
WHERE cast(price as float) >0
ORDER BY cast(price AS float) DESC
LIMIT 1)
SELECT
listings.neighbourhood,
listings.room_type,
listings.minimum_nights,
max_price_list.date,
max_price_list.price
FROM listings
JOIN max_price_list
ON cast(max_price_list.listing_id AS int) = listings.id;
Calabasas Entire home/apt 2 2018-12-22 61000.00
- What neighborhoods seem to be the most popular?
WITH taken_list as (
SELECT
count(*) AS num_taken_list,
listing_id
FROM calendar
WHERE available = 't'
AND substring(date, 0, 5)= '2018'
GROUP BY listing_id )
SELECT
listings.neighbourhood,
SUM(taken_list.num_taken_list) AS num_taken
FROM listings
JOIN taken_list
ON cast(taken_list.listing_id as int) = listings.id
GROUP BY listings.neighbourhood
ORDER BY SUM(taken_list.num_taken_list) DESC
LIMIT 5;
neighborhood num_taken_nights
Hollywood 181607
Venice 158504
Downtown 107685
Long Beach 86799
Santa Monica 67395
- What time of year is the cheapest time to go to your city? November
SELECT
AVG(cast(price AS float)),
substring(date, 6, 2) AS mo
FROM calendar
GROUP BY substring(date, 6, 2);
average price month
231.06087495398208 11
- What about the busiest? November
WITH freerooms AS(
SELECT
COUNT(*) AS free,
substring(date, 6, 2) AS mo
FROM calendar
WHERE available = 't'
GROUP BY substring(date, 6, 2)),
busyrooms AS(
SELECT
COUNT(*) AS busy,
substring(date, 6, 2) AS mo
FROM calendar
WHERE available = 'f'
GROUP BY substring(date, 6, 2))
SELECT
cast(busyrooms.busy as float)/(cast(freerooms.free as float) +cast(busyrooms.busy as float)) AS busy_room_rate,
busyrooms.mo
FROM busyrooms
JOIN freerooms
ON busyrooms.mo = freerooms.mo;
## Lesson 3: Intermediate visualization
### Assignment 1: The basics of plotting review
- Basic plot types
- line plots- data over some continuous variable
- scatter plots- relationship between two variables
- histograms- distribution of a continuous dataset
- bar plot- counts of categorical variables
- QQ plot- how close a variable is to a known distribution & outliers
- box plot- compare groups and identify differences in variance & outliers
### Assignment 2: Formatting, subplots, and seaborn
- seaborn @TODO let's talk about the structure of the seaborn package
- sns.load_dataset() @TODO what form does the data need to be in for it to load properly?
- sns.set(style = )
- sns.despine()
- sns.FacetGrid()
- .map(plottype, variable)_to_be_plotted)
- sns.boxplot(x = , y = , hue = , data = )
- sns.factorplot(x= , y= , hue= , data= , kind=* , ci=, join= , dodge= )
- bar: bar plot
- point: like a bar plot but more efficient; good for points that have error bars; may or may not be connected
- sns.lmplot(x= , y= , hue= , data= , fit_reg= , ci = , lowess = scatter_kws={})
- scatter plot
- fit_reg: with/without a regression line
- ci: with/without confidence interval error cloud @TODO I don't know how to calculate this error cloud manually
- lowess: using local wighing to fit a line @TODO
- col= " parameter results in the data being split out by category and plotted in subplots @TODO
#### Drill 3: Presenting the same data multiple ways
https://github.com/jordanplanders/Thinkful/blob/master/Bootcamp/Unit%201/bike_data/Kevin_bike_datavis.ipynb
### Assigment 4: Cleaning Data
- Finding Dirt
- anomolous values
- fake answers (straightlining, repeating answer sequence, time to finish below some threshold)
- Cleaning Dirt
- replace with NULL or None
- map to a valid response (an extreme value maps to the highest non-outlier response, winsorizing )
- remove (duplicate entries)
- other (data issues that are systemic, widspread, or for a particuar data-related reason)
- clean with code so that there's a record of the cleaning process
- don't alter original data, keep a separate "clean copy"
### Assignment 5: Manipulating strings in Python
- string methods
- re (regular expresssions or regex)
- a regular expression is a sequence of characters that defines a search patterns
- not always more efficient than string methods
- extracting different categories of character from a string
- isdigit()
- isalpha()
- isnumeric()
- isspace()
- isalnum()
- Apply: .apply() allows one to apply a method to each element in a data frame or series
- lambda functions: small, temporary, unnamed function of the format:
lambda x: f(x) (if [condition] else [alternative])
- one line functions that would usually be:
def function(x):
return f(x)
- filter:
- returns an iterator of booleans (based on a function that returns booleans) that when applied to a series or a string only returns entries/characters that are True. @TODO @WTF
- list(filter(lambda x: boolean function, series))
- ex: list(filter(lambda x: str(x).isdigit(), money))
- series.apply(lambda x: ''.join(list(filter(boolean function, string)))
- ex: money.apply(lambda x: ''.join(list(filter(str.isdigit, str(x)))))
- splitting strings apart: split a string into a list of strings; by default splits at spaces, but can be split at some other character or string
- pandas has its own version: series.str.split(delimeter, expand = True) that will split the series of strings on the delimeter or regex pattern and return a set of series that correspond to the first, second, third, ... nth elment in the split
- ex: word_split = words.str.split('$', expand=True)
names = word_split[0]
emails = word_split[1]
- replace: replace specific characters or strings with a new string
- pandas has its own version: series.str.replace(str1, str2)
- changing case: often it will be useful to standardize the case either with: .lower(), .upper(), .capitalize
- stripping whitespace: string.strip(), also lstrip(), rstrip()
- pandas has its own version for whitespace: series.str.strip()
### Assignment 6: Exercise on cleaning data
### Assignment 7: Missing data
- missing data can be systematicall missing, which raises questions about dataset reliability
- even if missingness is random, analysis can break because of missing values --> df.dropna() is a built in method in pandas to drop all rows with a missing value
- When does missingness matter?
- so many rows have to be thrown out that the set loses statistical power
- systematic missing values causes systematic subsets of data to be thrown out making the dataset biased
- MCAR- missing completely at random: three year old inserts crayons into a random server in a server room and corrupts a drive (could have been a three year old from anywhere given its random that one would be there in the first place, and they picked a random server and a random place to put crayons)
- MAR- missing at random: if a particular group is likely to skip a question regardless of response and we know this, we can explain the absense of the data and carefully work around it
- check correlation between missing scores and various variables to identify what is lurking
- MNAR- missing not at ransom: if samples that are likely to have a particular value are absent, stop
- Imputing Data- guessing at what values would fill empty fields
- replacing with the mode, median, or mean works for keeping central tendancy the same, but reduces the variance and alters correlations with other variables
- can group existing entries into similar groups and impute strategically if data
- Beyond Imputation
- sometimes its possible to collect more data either in a focused way targetting the MAR group or not if its an MCAR problem
## Lesson 4: Experimental Design
### Project 1: A/B tests
- One of many possible experiemental designs used to identify whether one version of an object of interst is better at producing a deisred outcome
- Components:
- Two versions of something whose effects will be compared (preferably a control version and an alternate, though realistically it is sometime two unknowns)
- a sample (representative of the population), divided into two groups (that are the same in composition and preferably randomly chosen)
- a hypothesis stating an expectation of what will happen
- identified outcome(s) of interest; the measurable key metric that will be used to identify/characterize change
- other measured variables: measure the hell out of everything to help check that the two groups were sufficiently similar, and to idnetify other responses to the change
- Getting a good sample: the sample has to be representative of the population and any differences in outcome should be due to differences in treatment
- easy when there's a constant flow and it's possible to just sample the flow
- hard when it has to be all or nothing (music on, music off)
- Key to key metrics
- metric as close to the business goal as possible; a metric that reflects an intermediate step and doesn't measure the final outcome doesn't really help
- metrics that are reliably measurable, preferably somethign passively measured and not based on specific engagement with subjects or self-reported data
- metrics may have different time windows; it may take a few months for something to surface as a win or a loss.
#### Exercises
- Does a new supplement help people sleep better?
Hypothesis: would presumably be that the supplement would help people sleep better
Sample: It's difficult to sample the population at large, but perhaps by getting people to opt in via their PCPs, making sure that the percentage of people who say they don't sleep well matches the national average would be a start.
Experiement: The experiment would involve sleep studies of all people without the supplement as a baseline, measuring reported quality of sleep as well as EEG and other biometrics. Then the group would be split and half the sample would be given a placebo adn half the group would be given the supplement and the sleep study would be repeated. Ideally there would be more than one night on each side, but realistically it would be two nights per subject.
Key Metric: fewer bouts of restlessness, less EEG activity, or self reported ratings. (As a lay person in the field, I would probably have to identify the key metric from the baseline night reports)
- Will new uniforms help a gym's business?
Hypothesis: new uniforms will help a gym's business.
Sample: People who walk into a gym
Experiemnt: This is an all-on, all-off scenario, but potentilly by taking September as a control month and measuring October with treatment applied, one could avoid seasonal effects of upcoming holidays, or post-holiday season, or pre-summer, or summer vacation lulls.
Key Metric: "Help" would likely be based on revenue per month, or perhaps revenue per quarter if there are discounts on new sign ups that would need take time to manifest.
Additional measured variables: new information seekers, new sign ups, number of services added to existing subscriptions, number of people re-upping their memberships, potentially surveying about quality of service/professionalism as a back of the envelope impression
- Will a new homepage improve my online exotic pet rental business?
Hypothesis: a new homepage will help with exotic pet rental business
Sample: visitors to exotic pet rental site
Experiment: Use a splitting server to show some people the new site and some people the old site and track the outcomes (e.g. split.io)
Key Metric: increase in monthly revenue?
Additional: number of new rentals, number of extended rentals, differences in the types pets rented
- If I put 'please read' in the email subject will more people read my emails?
Hypothesis: adding 'please read' in email subject lines will prompt more people to read emails
Sample: listserve
Experiment: Create a sample set from the email list and send half the altered subject line, and half the standard subject line for the same email with the same send date/time.
Key Metric: "people who read emails" as those who clicked on something in the email or followed up
Additional variables: open rate, unsubscribe rate
### Assignment 2: Simpson's paradox
- phenonmenon in which the average over a number of groups shows one trend, but the average for each individual group shows the opposite or no trend (luriking variable paradox: an unaccounted for varaible changes the relationship between two other variables)
- using randomization to make sure splits don't have lurking demographic tendancies can help
- confirm the groups are similar before interpreting your results.
- Make it a habit to look at subgroups within your A/B test to make sure the overall trend is reflected in the subgroups.
- If the subgroups differ from the overall trend, your question should guide whether you report conclusions based on the overall sample, the subgroups, or both. You don't want to advocate for condition A, even if it performs better overall, if condition B actually works better within every subgroup
### Project 3: Bias and A/A Testing
- bias- anything that causes a sample to systematically differ from the population
- sampling bias/selection bias: when the sample differs from the population in a systematic way
- assignment bias: when the sample is split in a way that makes the make-up of the groups to differ
- contextual bias: when a feature of the environment of testing prompts people in one group to have a different experience (and thus to potentially feel differently about the situation than they would otherwise)
- observer bias: when the tester/interviewer interferes with the testing (which is to say, interacts with the subjects substantially)
- A/A Testing- comparing the out come of choice between two identical versions of something. Sets a baseline for what the difference might be between groups even in the situation in which nothting was different
- testing method errors can be exposed
- sample split errors can be exposed
- sample size errors can be exposed (perhaps the event is to rare to detect in the planned sample size)
#### Drill: Am I biased?
- You're testing advertising emails for a bathing suit company and you test one version of the email in February and the other in May.
- The design of the study does not appear to take into account the seasonality and effect of geography on bathing suit sales. Are the subjects in the northern or southern hemisphere? Are they from a cold place where people are more likely to go on holiday to warm places in March, or is summer the only bathing suit season? Anyway around it, there are contextual biases lurking.
- You open a clinic to treat anxiety and find that the people who visit show a higher rate of anxiety than the general population.
- Primarily people concerned about anxiety will visit the clinic so the visiting population won't reflect the composition of the population at large leading to selection bias.
- You launch a new ad billboard based campaign and see an increase in website visits in the first week.
- A billboard campaign will disproportionately affect people local to a particular region, and/or people in cars passing by. These two groups may or not be representative of the population at large
- You launch a loyalty program but see no change in visits in the first week.
- A week is likely too short a window for measuring shifts in behavior prompted by a program that likely involves accrueing points. Without more information about the program, unless people were automatically enrolled and there was a huge marketting push around awareness, it is unlikely to sway behavior immediately (particularly if users have to opt-in explicitly).
### Project 1-4-4
https://docs.google.com/document/d/1RK7Uil3IxYxlqCezrhFsfb0VaHFaqWGoQR3kAjMK9vk/edit?usp=sharing
### Project 1-4-5: The research proposal
- The problem:
- define the question or problem
- justify why the problem should be studied
- review what we already know about the problem
- The potential solution
- a potential solution is also a hypothesis about what might solve the problem
- The method of testing the solution
- design of the experiement
- analysis plan
- benchmarks (key metrics, points of interest)
- Why bother?
- adjust disconnect between question and study design
- adjust study design that will not generate usable data
- account for false positives
- prevents mixed expectations about what will be done and how it will be executed
#### Drill:
Prompt:
To prevent cheating, a teacher writes three versions of a test. She stacks the three versions together, first all copies of Version A, then all copies of Version B, then all copies of Version C. As students arrive for the exam, each student takes a test. When grading the test, the teacher finds that students who took Version B scored higher than students who took either Version A or Version C. She concludes from this that Version B is easier, and discards it.
Plan:
Problem: Students cheat on exams. By using three versions of an exam, it is more difficult for students to look at each others' paper for answers because the papers are not necessarily the same type. However, but administering three versions of the test, it is also possible that one of those tests will be notably easier or harder than the other two, causing the grades to be uncomparable.
Potential Solution: Using multiple exam versions is a reasonable strategy for combatting this problem, however it should be executed and calibrated as carefully as possible. Exams should be collated ABC and passed out to students only once all are present and seated (preventing potential clusters of A exams, for example).
Experiment: Once the test has been administered, results should be examined for student subgroups. Did one test have notable higher or lower scores than the other two? (ANOVA?) Did students deviate notably and inexplicably from their historical performance? (paired t-tests between past exam scores and current scores?) Making sure each test taker was surrounded by students taking different versions by passing out the exam carefully, should significantly reduce the probability of cheating occuring on a particular version.
### Assignment 1-4-6: AB Testing and t-tests
- Evaluating A/B Test Data using tests
- t-test is a statistical test that calculates the size of the difference between two means given [their variance and sample size] noise in the data
- t = (y1_mean - y2_mean)/(s1^2/N1 +s2^2/N2)^(1/2)
- y1_mean, y2_mean are the central tendancies of the two datasets
- s1, s2 are the standard deviation of the datasets
- N1, N2 are the sizes (number of individuals) in th two datasets
- larger t values indicate more significant differences in the means relative to the noise and lead to small p values--> the two samples in question were not drawn from the same population;
- depending on the problem, we choose a threshold of improbability called a significance level and if the p-value is smaller than alpha, there is a significant differences between the two sets of samples
### Assignment 1-4-7: Null-hypothesis significance and testing
- Null Hypothesis testing
- tester has
- a hypothesis that desribes what they think the data will look like if ther expectations are confirmed
- a null hypothesis that describes what the data will look like if expectations are not confirmed
- data is then compared to the null hypothesis
- calculate a t-value which is situated in a t-distribution that represents the tvalues you would get if the null hypothesis were true; the farther the calculated t-value is from the center, the less likely that the null hypothesis is true
- the total area under the curve defined by our t-value sums to the p-value!
- the p-value represents the probability of getting a t-value that is large or larger if the null hypothesis is true
- p<.05
- is the rule of thumb; that means that there is a 1 in 20 chance of returning a false positive
- corresponds to the two sigma (standard deviation) mark
- not ubiquitous; there are fields where you have to be much more sure than p = .05 and the threshold value will be much lower
### Assignment 1-4-8: T-tests and Philosophy of NHST
- T-values
- the default t-test is two tailed, which is to say it's the probability of getting a more extreme value in either direction.
- if a negative result is impossible for some reason (for example) one can use a one-taled t-test
- Philosophy of NHST (Null Hypothesis significance testing)
- p-value represents the probability of getting the data you have if the null hypothesis were true in the population; put another way, the probability of pulling this data by random chance from the same population that is unaffected
- no mention of an "actual hypothesis", but rather an acceptance of "not not rejectign the null hypothesis" is tantamount to accepting the hypothesis
- However, you can't truly limit the possibility space to two outcomes so we can't prove taht our hypothesis occured adn that the effect wasn't due to some other factor. Instead we stick with disproving a null hypothesis and stating that the results support the hypothesis we put forward
### Assignment 1-4-9: Experiementation guided example
# Unit 2
## Lesson 1: Preparing to Model
### Assignment 2-1-1: What is a model?
- statistical model: a simplified mathematical representation of the data scientist's best guess about the underlying processes that created the data
- simplified: prioritize information dense features, viz. the ones that explain the most variance. Probably exclude low variance features in the name of making the model computationally cheaper
- mathematical representation: series of formulas
- best guess: based on our bester understanding and testing of the available information (likely will need to be update as more information becomes available)
- underlying process: @todo
- Models and Math
- pick models that are suited to characteristics of problem (contiuous data, categorical data, two-variable, multivariate)
### Assignment 2-1-2: Formulating a research question
- model: mathematical expression of a research question; different types of research questions beg for different kinds of models
- What is already known about this topic? Check out how others have approached similar questions
- What sort of data, or ways to collect data, are available to me on this topic? Do not engage without data to work with...
- What skills do I have? Don't take on quesitons you can't cope with in the alloted time
- Can this question be answered using quanitities or probabilities? The question must be amenable to a numeric solution.
- Can this research question be asked in one sentence? If not, refine it.
### Project 2-1-3: Drill: Formulating a good research question https://docs.google.com/document/d/1FFr1JjqG21LxhLtED01rYJMPCGTQ-8WWcqUIZKPbzEY/edit?usp=sharing
1. What is the 1994 rate of juvenile delinquency in the U.S.? [Good; regression]
2. What can we do to reduce juvenile delinquency in the U.S.? [What are the most effective approaches to reducing juvenile deliquency in the U.S.?]
3. Does education play a role in reducing juvenile delinquents' return to crime? [Good; binary classifier]
4. How many customers does AT&T currently serve in Washington, DC? [Bad? Isn't this a fact, rather than a reseach question?]
5. What factors lead consumers to choose AT&T over other service providers? [Good; PCA]
6. How can AT&T attract more customers? [Bad; Which of the following methods are most effective in attracting customers?]
7. Why did the Challenger Shuttle explode? [Good?]
8. Which genes are associated with increased risk of breast cancer? [Good]
9. Is it better to read to children at night or in the morning? [Good]
10. How does Google’s search algorithm work? [Bad, though I'm not sure how to fix it]
### Assignment 2-1-4: Exploring the data
- Univariate (looking at one variable at a time)
- how many variables?
- how many data points?
- what kind (categorical, continuous, ordinal)
- do any variables have known distributions
- missing data? How much and what kind?
- variance in each of the variables
- Bivariate
- continuous-continuous
- scatterplot (scatterplot matrix)
- sns.PairGrid(df.dropna(), diag_sharey= False)
- lmplot (scatterplot with regression line and r**2 value;
- sns.lmplot(data = df, x = "", y = "")
- correlation ranges from -1 (strong relationship, as one goes up, the other goes down), to 1 (both go up together))
- sns.heatmap(df.corr())
- NB: check for two-dimensional outliers that represent unusual combinations of values
- continuous-categorical
- estimate the value of a continuous variable for each value of a categorical variable
- sns.boxplot()
- sns.violinplot()
- sns.stripplot()
- FacetGrid is a grid that shows a particular pair of variable broken out by one or two categories
- g = sns.FacetGrid(df, col = variable)
g = g.map(plottype, x_label, y_label)
- categorical-categorical
- relates the number of counts for a category for each label in another category
- sns.countplot(data = df x = variable, hue = variable2)
- pd.crosstab(df.variable1, df.variable2) table of counts giving the number of datapoints for each combination
- chi-square test- indicate whether one combination of levels is significantly larger or smaller than the rest (rather than compare the means of two data sets, compare the counts of a variable in two datasets to see if they could have come from the same population)
- NB: check for subgroups with very small counts
- Interpretting pairwise plots and stats
- flag two-dimensional outliers
- identify variables that are redundant to each other (variables that are strongly correlated)
### Project 2-1-5: Feature Engineering
- Feature- variable that hs been transformed in ways as to make it well suited to work within a model and explain variance in the outcome of interest
- Working with categorical variables
- translate a categorical variable with x labels into x-1 numerical features
- reference value- level without a feature
- can group categorical variables (e.g. Norway, Sweden into Nordic)
- Changing variable types
- can make binaries out of continuous variables (e.g. some are less than some value and some are more than taht value )
- Numerical variable types
- Ordinal variables- ordering; so doesn't give information about the difference between one and the next, just that one was first and the next was after
- Interval variables- variables that indicate rank order and distance but don't have an absolute zero point
- ratio variables- variables that indicate rank, distance and meaninful absolute zero value
#### Drill: categorize each variable from the ESS dataset
1. cntry (country)- categorical
2. year- numeric, ratio
3. idno (respondent's identification number)- categorical (no indication that these were assigned with order significance)
4. tvtot (tv watching per avg wkday)- numeric, ratio
5. ppltrst (most people can be trusted?)- numeric, ordinal
6. pplfair (most people are fair?)- numeric, ordinal
7. pplhlp (most people are helpful)- numeric, ordinal
8. happy (how happy are you?)- numeric, ordinal
9. sclmeet (how often do you meet friends etc?)- numeric, ordinal
10. sclact (take part in social activities)- numeric, ordinal
11. gndr (gender)- categorical
12. agea (age)- numeric, ratio
13. partner- categorical
- Combining two or more highly-correlated variables
- want minimum set of features that describe the space, therefore want features that are correlated with the outcome, but uncorrelated with each other
- average highly correlated variables or drop one
- use Principal Component Analysis (PCA) to reduce the correlated set of variables
- Dealing with non-normality
- if normality is a model-assumption (and it often is) it may be necessary to transform (e.g. log, sqrt, or invert) variables so that they have more normal distributions
- creating linear relationships
- many models assume the relationship between a feature and an outcome is linear so in order to accommodate it in a model it may be useful to work with it as a transformation (square, cube, etc)
- making variables easier to understand in light of the research question
- re-encode a variable into a feature that matches the terminology of the research question (make sure scaling is such that positivde correlations are intuitive)
- Leveling the playing field
- some models assume all features are scaled to the same bounds, so may need to rescale accordingly (usually to a mean of 0 and a standard deviation of 1)
- preprocessing.scale(df)
- All about interactions
- may want to build inferred features (?!) by multiplying two features together that together might correlate with another feature
### Assignment 2-1-6: Principle component analysis (PCA)
- What is PCA? complexity reduction technique that tries to reduce a set of variables down to a smaller set of components that represetn most of the information in the variables
- indetifies sets of variables that share variance and create a component to represent that variance
- lose variance in exchange for a smaller set of features (computationally cheaper, better satisfies requirement of features not being correlated, less vulnerable to overfitting)
- Things get messy
- need variables to be normally distributed
- relationships between variables are linear
- correlations are weak (but non-zero) to moderately strong but less than ~.8
- things can get unstable if it's fewer than three variable that are fairly tightly correlated, but also if it's a lot of variables that are highly correlated
- PCA: Rotation in space
- take a dataset of n variables as an n-dimensional space
- PCA standardizes variables so that mean = 0 and standard dev = 1 (so all variables go through the origin and share variance)
- choose a set of axes so as to minimize the distance between the data points and the axis
- PCA: math (ish)
- identify the axis (eigenvectors) that minimize the variance, then multiply the feature matrix by the transformation matrix
### Assignment 2-1-7: Feature Selection
- good practice to split dataset into a training and a test set and design feature selection on training set (not on both parts)
- Filter Methods: evaluate each feature seperately and assign a score that is used to rank features such that scores above or below (or both) some point are discarded
- select relevant features but also likely to produce redundant features because they don't weed out features that are highly correlated to each other
- e.g. variance threshold, correlation to target variable
- Wrapper Methods: select sets of features based on performance; construct different sets and evaluate based on predictive power in a model (in comparison to performance of other sets)
- forward passes: algorithm begins with no features and they are added one at a time, keeping the features that have the highest predictive power
- backward passes: algorithm begins with all features and they are removed one at a time, removing the feature that has the least predictive power
- computationally expensive, feature set is never re-evaluated (?)
- Embedded methods: select features based on a fitting method, e.g. in regression where there's a penalty against complexity and the fitting objective is to minimize a the cost function
## Lesson 2: Building your first model: Naive Bayes
### Assignment 1: Regression v. classification
- When building a model, there is always a kind of outcome we are intersted in: a label (categorical) or a value. This determines whether we need a classifier or a regression model
- Classification: returns one (or more) categorical value, a discrete value from a specified set
- assigns a category to a given test observation
- assigns a probability measure for each category
- N.B. the only outcomes that can be returned are ones that have been seen in the training set
- Regression: returns a numeric value from either a bounded or unbounded number line
### Assignment 2: Algorithms intro.
- What is an algorithm? A set of instructions for a computer (efficient is better than not...)
- Algorithm efficiency and complexity
- scaling: number of steps = scaling factor (e.g. linear, quadratic, etc.)
- Big O Notation: a way to describe most inefficient performance (at worst, O(elements^num_steps)), or complexity
- how efficiently the algorithm scales with additional data; sets an upper bound on the size dataset one can reasonably work with
### Project 3: Drill: Regression or classification
1. The amount a person will spend on a given site in the next 24 months. (regression)
2. What color car someone is going to buy. (classifier)
3. How many children a family will have. (regression, though plausibly either)
4. If someone will sign up for a service. (classifier)
5. The number of times someone will get sick in a year. (regression, though plausibly either)
6. The probability someone will get sick in the next month. (classifier)
7. Which medicine will work best for a given patient. (classifier)
## Lesson 3: Evaluating classifiers
### Assignment 1: Accuracy and error types
- Success Rate: the most basic measure of success is obviously how often the model was correct (compare the target labels to the predicted labels). However, not all errors are created equal so we are concerned with what gave rise to the incorrect predictions; important to be able to inform how to deal with the error as well as potentially how to fix the error in the model (e.g. by adding more features, tuning parameters, etc.)
- Confusion Matrix: matrix that shows the count of each possible permutation of target and prediction (predicted outcome on one side, actual outcome on the other)— allows us to identify how many negative-negative, positive-positive and false positives (positive when should have been negative) and false negatives (negative when should have been positive)
- From sklearn.metrics import confusion_matrix confusion_matrix(target, y_pred)
- False positive is also referred to as a “type I error” or “false alarm”
- False negative is also referred to as a “type II error” or “miss”
- Sensitivity: percentage of positives correctly identified (agree_pos/(agree_pos+false_neg))
- Specificity: percentage of negatives correctly identified (agree_neg/(agree_neg+false_pos))
- top row is [agree_neg, false_pos], second row is [false_neg, agree_pos]
### Assignment 2: Class imbalance
- Ideally the training set would contain an equal number of instances of each outcome so that the trained model has a good idea of what makes for each label and doesn’t over predict the dominant label. Put another way, if a rare instance has specific traits, but those traits are also seen in a dominant class, the model won’t learn to associate the rare traits with the rare class. (E.g. rare disease prediction, fraud detection)
- Baseline Performance: It’s important to consider the dominant class rate. If the model doesn’t do better than the percent of the dominant class represented in the training set, that means it could do as well or better by just predicting the dominant class all the time
- Dealing with class imbalance:
- Ignore it: engineer features that hopefully highlight rare class and hope for the best
- Change the sampling: deliberately over sample the minority class/under sample the majority class
- Probability outputs: some models like SVM or log regression can return probability of different labels and use different cut offs or more complex rules to decide what the final label should be
- Cost functions for errors: describes how errors are not equal so (for example) the cost of a type II error can be twice the cost of a type I error (not easy to do with the sklearn NB)
### Assignment 3: In sample evaluation and cross validation
- Overfitting- when the model is excessively complex such that it describes the training set perfectly rather than the generalized underlying relationship; we need the model to work on the data we haven’t seen yet
- Holdout Groups: essentially set up a test set separate from a training set; the model needs to perform well on the test set which was not included in the training of the model. The higher the variance, the larger the training set needs to be.
- From sklearn.model_selection import train_test_split; X_train, X_test, y_train, y_test = test_train_split(data, target, ...)
- Cross Validation: if there is enough data, break the training set up into n folds (usually 5), and train the model on n-1 folds and test on the nth fold (Leave One Out is when the number of folds= number of observations—usefully your concerned that one observation will skew the model). Repeat this for every combination of folds and compare the accuracy. If accuracies are similar then the model is probably not overfitting
◦ From sklearn.model_selection import cross_val_score; cross_val_score(model_instance, data, target, cv=n_folds)
- The only score that it will return is accuracy which doesn’t offer a lot of insight into the type of error so people often code up their own versions
- What’s a good score? check for class imbalance and the type of errors to really assess
### Project 4: Challenge: Iterate and evaluate your classifier
- test for overfitting using cross validation
- Already used test_train_split
- Already tuned feature set to remove words that appear in both the positive and negative tfidf sets
- Train model on different length ngrams
- Train model on positive words not subtracting off the negative words
- Train model on (positive words tfidf) - (negative words tfidf)
- Train model on (positive words tfidf) - (negative tfidf + count_vectorizer)
- Train model on (positive words tfidf) - (negative words count_vectorizer)
- Train model on (positive tfidf + count_vectorizer) - (negative tfidf)
- Train model on (positive tfidf + count_vectorizer) - (negative count_vectorizer)
- Train model on (positive words tfidf + count_vectorizer) - (negative tfidf + count_vectorizer)
- Run loop and report accuracy of model with all but one feature for watch feature to assess how important each feature is to model performance
## Lesson 4: Linear Regression
### Assignment 1: Simple linear regression
- regression alllows us to predict continuous variables
- Simple linear regression
- There are lots of regression techniques, but ordinary least squares (OLS) is far and away the most common and is often referred to as just “regression”
- Works by finding estimators of coefficients that describe the relationships between variables in a formula you define
- Simple linear regression is of the form y = alpha + beta*x where our objective is to estimate alpha and beta
- Least squares
- Seeks to estimate coefficients by minimizing error or residual (sum of the squared distances between each datapoint and the fit line— the type of regression is often named for the distance metric used to calculate the residual)
- Regr = linear_model.LinearRegression(); regr.fit(X_train, y_train); y_pred = regr(X_test)
regr.coef\_, regr.intercept\_ will yield info about the regression
- Predicting with Simple Linear Regression: with simple linear regression estimates of coefficients yield a y-intercept and a slope (colloqually). The domain of the equation is all reals even if that isn't the domain of the problem at hand though so heads up.
### Project 2: Multivariable Regression
- Multivariable Least Squares: when a least squares regression has more than one independent variable (AKA Multivariable least squares linear regression, multiple linear regression, Multivariable regression—NOT Multivariate regression which involves multiple dependent variables)
- All relationships between coefficients are linear; so an equation will look like y = alpha + c_1*x_1 + ...+ c_n *x_n
- Categorical Variables: each label has its own variable and thus its own coefficient. For example, suppose we are calculating rent and our dataset has features: square footage, bedrooms, bathrooms, state. The first three are number... no problem, the categorical variable (state) has to be split into one variable for each state included in data set, so the new feature set might look like: square footage, bedrooms, bathrooms, WA, CA, OR with each data point having a 1 for the state where the property is, and a 0 for the other state labels (labels should be mutually exclusive). By this reasoning linear regression can estimate a coefficient for each of the states and we can correctly note that being in CA is a financially insane place to rent.
- Linear doesn’t have to mean lines: while the relationships between the coefficients are linear, it’s fair game for the relationships between the variables to not be; e.g. y = alpha + c_1*x + c_2*x^2 (so here x_1 =x and x_2 = x^2), but be careful not to over complicate the model to fit trainig data, overfitting is a problem here too.
### Project 3: Explanatory power: assumptions of linear regression @todo need to complete project
- The extraordinary power of explanatory power: multiple linear regression both allows us to predict future outcomes, but also provides insight into the relationships between the underlying variables and using the r^2 value ([0,1]), a sense of how much of the variance in the data the model was able to explain and hence how much confidence we should have in our interpretation of the model.
- Low r^2 indicates a poor fit
- High r^2 is good unless it is very high in which case we should be concerned about overfitting
- Assumptions of multivariable linear regression
- Linear relationship: the coefficients have to be linearly related. Sometimes we can do something non linear to a feature to make the relationships between features linear
- Multivariate normality: the error from the model (model predictions-actual target values) should be normally distributed. Outliers and skewness in the distribution of error can often be traced back to outliers
- Homoscedasticy: error plotted against predicted values should be uniform. It is concerning if there is considerably more error when predicting high or low values
- Low multicollinarity: correlations among feature should be low or nonexistent (the model could attribute half the explanatory power to one variable and the other half to the other but underneath they two variables are telling the same story so this is not helpful if we are trying to explain the variance--as far as prediction goes, this might work fine)
- fixed by PCA (collapsing correlated variables into one)
- dropping correlated features
### Project 4: Challenge: make your own regression model
## Lesson 5: Evaluating Linear Regression models
### Assignment 1: Test Statistics
- Test Statistics
- Can evaluate:
- Whether the model, as a whole, explains more variance in the outcome than a model with no features
- Whether each individual feature in the model adds explanatory power
- Whole model: F-test
- Represents the ratio between the unexplained variance of the model and the unexplained variance of a reduced model for comparison
- Unexplained model variance: SSE_F = sum from 0 to n( (y_actual- y_predicted)^2)
- Unexplained reduced model variance: SSE_R = sum from 0 to n( (y_actual- y_mean)^2)
- Number of parameters in the model = p_F (2 in the case of simple linear regression)
- Number of parameters in the reduced model = p_R (1 in this case)
- Number of data points: n
- Degrees of freedom of Unexplained model variance (SSE_F): df_F = n-p_F
- Degrees of freedom of Unexplained reduced model variance (SSE_R): df_R = n-p_R
- F = (SSE_F-SSE_R)/(df_F-df_R) * df_R/SSE_F
- A parameter is any predictor in the model (intercept and features)
- Degrees of freedom: amount of information untapped to estimate variability after all parameters have been estimated
- Low degrees of freedom are a flag for overfitting or small sample
- Degrees of freedom define the F-distribution for the F-test
- Position of f-test value within the f-distribution is used to determine the p-value (probability of getting an f-test value that is equal or greater if there is no relationship between the outcome and parameters in the population; a significant p-value suggests that the model as a whole can explain some of the variance in the outcome
- Put another way, the f-test tests whether the r^2 of the model is different from zero
◦ Individual parameters: t-test: once there is a significant f-test, the next step is evaluating the performance of the individual parameters using a t-test that determines whether that parameter estimate is significantly different from zero
‣ Being statistically different from zero implies that that the variable explains a significant amount of unique variance in the outcome after controlling for the variance explained by the other parameters
‣ Suppose there are three circles (red, blue, yellow) overlapping where yellow represents the outcome...
• The statistical test of significance of the blue circle on the outcome will be a matter of only the green area, and of the pink circle, only the orange area but neither one will include the area where all three overlap
‣ A model with high collinearity may very well yield an f-test that is significant without any significant features at all
‣ Sklearn.LinearRegression can be used for Multivariate linear regression but it’s difficult to extract p-values for individual parameters so we can use statsmodels as an alternative (different packages offer different pieces of additional information about the different models)
• Write out model formula in the following format:
◦ Linear_formula = ‘dependent_variable ~ dep_var1+dep_var2+...+dep_var3’
• Fit the model with:
◦ lm = smf.ols(formula=linear_formula, data=data).fit()
• model parameters:
◦ lm.params
• p-values as a significance test for each of the coefficients (can usually drop any parameters with pvalues >.05):
◦ lm.pvalues
• r^2
◦ lm.rsquared
- Confidence Intervals
- the range of values within which our population parameter is likely to fall
- a 95% confidence interval means that if we were to resample our population, our population parameter would fall within these bounds 95% of the time
- so this allows us to make a statement not only about what to expect, but how confident we are that it will happen
- the wider the confidence interval, the less certain the estimate
- prstd, iv_l, iv_u = wls_prediction_std(lm)
### Project 2: Challenge: Validating a linear regression
### Project 3: Dimensionality Reduction in Linear Regression
- Dimensionality reduction in Linear Regression
- Plight of too many features:
- Having lots of features makes a model take longer to train and longer to run and more prone to overfitting
- Variance in the features that is unrelated to the outcome may create noise in predictions (bonus: if the variance is shared among features in multicollinearity)... therefore it is also the case that more features may add more unrelated variance and thus noise
- Having more variables than data points leads to negative degrees of freedom (a no fly)
- For the sake of prediction (not helpful if you need to interpret the roles of individual parameters) use dimension reduction to make the feature space manageable.
- Take a matrix of features X, and its reduced feature form R(X): the objective is to is to come up with transformation R such that the expected value of Y (predicted value) given X is the same as the expected value of Y (predicted value) given R(X)
- Similar to PCA, but in this case we aren’t trying to preserve all the variance in X, but rather the variance in X that is shared with Y
- Partial least squares regression
- Basic idea: find the vector in X with the highest covariance with y, then choose a second vector orthogonal to the first that explains the highest covariance with y unexplained by first vector and so on, up to n vectors (the number of features in X)
- pls1= PLSRegression(n_components=desired_num_features)
- pls1.fit(X,y)
- Y_PLS_pred = pls1.predict(X)
- pls1.score(X,y)
- Doesn't work well if features are uncorrelated
- trick is to pick the right number of features to collapse to
### Assignment 4: The Gradient Descent Algorithm
- regression finds the best model by minimizing the squared distance between each datapoint (the square removes issues around negative distances and penalizes more easily for larger distances)
- When the model is simple, the cost function can be minimized completely with a system of equations. However, other models are too complex to be minimized analytically like this so we have algorithms like gradient descent which minimizes the cost function using derivatives.
- put another way, given y= alpha + beta*x, we have a surface of points that are alpha, beta and error (y-y_pred) and we are trying to find the minium point on that surface. The same principle applies for higher dimensional spaces, but we can't visualize as well.
- Initialize weights
- calculate the gradient from that point (minimize the partials of the cost function)
- move a set distance in that direction
- repeat until you can't go down any farther (or where all possible alternatives yield higher error than the current)
- Can be calculated with set distance or adaptive distance
- Run loop where as long as the weights haven’t converged
- Decision-points in Gradient Descent
- how to initialize the weights
- how far to move at each step (learning rate)
- what constitutes "convergence"? set with a threshold of minimal acceptable change, and probably a maximum number of iterations
- Things get messy when we nab a local minima rather than a global minima so sometimes need to try multiple starting places to rule this out
# Unit 3
## Lesson 1:
### Project 1: KNN classifier
• K Nearest Neighbors Classifiers: simplest example of calculating which data points are most similar to an observation
◦ Nearest Neighbor: find the closest known observation in our training data by some distance metric (typically we use Euclidean distance) and predict that our value has the same label
‣ Greedy because have to calculate the distance from our point to every point in the set to find the datapoint with the smallest distance.
‣ Not a trainable model because it’s just based on the direct relationship between unlabeled data and labeled data
◦ K-Nearest Neighbors: look at the k nearest neighbors and label based on majority rules (also yield the probability that the label is correct by votes_in_favor/k)
‣ Calculate distances to all points, choose k smallest, label data
‣ Mesh: surface for describing the zones of different labels
• Define the limits of the surface by solving for the min_x, max_x, min_y, max_y
• Initialize the mesh with a grid size between bounds
• Initialize and train the model where x1 and x2 are x and y coordinates and y is the label
• Run whole grid of points through the model
• Create and plot color mesh
‣ one would expect any new point falling in a given zone to be labeled accordingly
### Project 2: Tuning KNN- normalizing distance, picking k
- Tuning KNN:
- Distance and Normalizing
- Implicit in the Euclidean distance metric is that the distance between all points is equal (for example, using “loudness” rather than “decibels”) and this is not always the case
- Also, if one dimension is on a wildly different scale than another, the model won’t care about the second variable
- Normalization
- Rescale: rescale everything to be between 0 and 1 (only works if the distances are linear to begin with and the data is known to be bounded so that rescaling to be between 0 and 1 does not impose an artificial boundary)
- Recalculate so that distance is in terms of standard deviations (z-scores) from the mean
- Not good practice to mix them in most cases
- Weighting
- In the vanilla version of KNN, all votes are counted as equal, which is reasonable when the data is densely populated
- However, when the k nearest neighbors are vastly different distances from the point in question, it may be questionable for all votes to count the same—perhaps weight by distance so that points are influenced according to their inverse distance to point in question
- KNeighborsClassifier(n_neighbors = 3, weights=‘distance’)
- Choosing K
- the larger the k, the more smoothed teh decision space will be with more observations gettna a vote on the prediction
- smaller k will pick up subtle deviations (which could just be deviations and overfitting)
- best bet is to try multiple models and use valdation techniques to choose
#### Drill:
Drill: Let's say we work at a credit card company and we're trying to figure out if people are going to pay their bills on time. We have everyone's purchases, split into four main categories: groceries, dining out, utilities, and entertainment. What are some ways you might use KNN to create this model? What aspects of KNN would be useful? Write up your thoughts in submit a link below.
Assuming there is a column of labelled data regarding whether or not people did pay their bills on time, all four variables could be entered and knn could be applied. It would be useful as different combinations of features may play different roles in the distance between two points because different people have different spending habits, but there may still be a larger picture model for predicting ability to pay bills. It might be worth looking at the correlation between the variables to see if, for example dining out and entertainment might collapse into one variable. Given we are working in 4-space, it would be impossible to visualize with a mesh plot, but we could run a battery of options to test different values of k. Without seeing the data, it would be difficult to assess the need for weights or normalization, but adding these parameters as options for optimizing the model would be worth considering. Perhaps we could assess by making a plot with the k distances of the k nearest points per point. If there is a pattern of outliers, it would be worth considering a weighting scheme.
### Project 3: KNN regression
- KNN Regression
- Everything’s the same: Rather than taking the popular label, take the average of a given variable over the k closest points @what happens if there are more than k points with the same x value, how does the model choose which k are the closest?
- KNN= neighbors.KNeighborsRegressor(n_neighbors=k)
- KNN.fit(X, Y)
- (Can run a line with regular intervals through model to see regression line)
- (Can use the same weighting trick (weights=‘distance’))
- Validating KNN:
- Can still use all the nornal methods (e.g. cross_val_score)
### Project 4: Challenge: Model Comparison
## Lesson 2
### Assignment 1: Decision trees
- Decision Trees
- Learning from questions
- Nodes: questions (either the root node (first node), interior nodes (follow up questions), leaf nodes (endpoints))
- The questions in the nodes are called rules
- Divides the data into a certain number (typically two) subgroups
- All data has to be accounted for, it cannot simply disappear
- Links between nodes are paths or branches
- Entropy
- Shannon Entropy H is a weighted sum; the sum of probability of a given out come times log base two of that probability for all outcomes
- As we limit the possible number of outcomes, entropy decreases
- Information Gain: change in entropy from the original state to the weighted potential outcomes of the following state
- writing code:
dec_tree = tree.DecisionTreeClassifier(criterion='entropy', max_features = 1, max_depth=4)
dec_tree.fit(X_train, y_train)
dot_data = tree.export_graphviz(dec_tree, out_file=None, feature_names=X_train.columns, class_names= [], filled=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
- for the code above, the tree specs have only one feature being used per node and there are only four decision levels to arrive at a classification
- Benefits
- the model well visually
- thay are interpretable
- they can handle mixed data (numeric and categorical)