Bootcamp_notes.ft

# Unit 1
## Lesson 1
### Assignment 1: Data, Engineering and Machine Learning
- Technology stack- collection of elements that make up a product
	- front end - interface that users interact with
	- back end - servers, services, databases etc; heavy lifting that feeds data to the front end
- Related roles
	- Data Analyst: draw conclusions and generate reports (proto data scientist)
	- Data Engineer: gather and store data, create and manage data pipelines, databases; less interpretation of data
	- Machine Learning Engineer: algorithms and modeling kwth an emphasis on algorithm design and efficiency
	- Data Scientist: whatever the recruiter says it is
### Assignment 2: The data science toolkit
- Python
- Packages
	- numpy
	- pandas
	- matplotlib
	- seaborn
	- scikit-learn
	- StatsModels
- SQL- Structured Query Langauge
	- access and preprocess data
### Assignment 3: Thinking like a data scientist
- curiosity, practicality (have to define questions with limited scope that can be answerd with data), skepticism (how confident are we in the results we see; lies, damn lies, and statistics)
- sometimes the way an abstract question is translated into a concrete one is dictated by the available data
- findin and evaluating data sources: data archives, repositories, web scraping, logs, documents...
- evaluating uncertainty: assess how certain we are that conclusions based on a particular statistic are valid; are there flaws in the source of the sample (sampling method, representative nature of the sample for the population etc), size of the sample, noise (variance) in the data 
### Assignment 4: Drill: What can data science do?
Take the following scenarios and describe how you would make it testable and translate it from a general question into something statistically rigorous
1. You work at an e-commerce company that sells three goods: widgets, doodads, and fizzbangs. The head of advertising asks you which they should feature in their new advertising campaign. You have data on individual visitors' sessions (activity on a website, pageviews, and purchases), as well as whether or not those users converted from an advertisement for that session. You also have the cost and price information for the goods.
		- a person who saw an ad has four outcomes: buy nothing, buy w, buy d, or buy f
		- we want to know which ad-buy pair results in the highest sales and whether that number is statistically significant
		- Not clear what a follow up test would be given that it appears all the combinations have already been tested...
		
2. You work at a web design company that offers to build websites for clients. Signups have slowed, and you are tasked with finding out why. The onboarding funnel has three steps: email and password signup, plan choice, and payment. On a user level you have information on what steps they have completed as well as timestamps for all of those events for the past 3 years. You also have information on marketing spend on a weekly level.
		- Did sign-ups slow because fewer people started the sign-up process or because more people fell out of the pipeline on the way to completion? What ad campaign was on when the person started the process? 
		- find the time when the max number of people got through step 3, calculate whether that is statistically different from the last several weeks.  Then look at the relationship between the number of people who got through step 1 and 2 for both timeframes to see where people fell out of the pipeline.  Then consider the marketting strategy during that time (assuming they are statistically different) and possibly reimplement that approach for a few weeks, track the numbers and compare against current status. 

3. You work at a hotel website and currently the website ranks search results by price. For simplicity's sake, let's say it's a website for one city with 100 hotels. You are tasked with proposing a better ranking system. You have session information, price information for the hotels, and whether each hotel is currently available.

4. You work at a social network, and the management is worried about churn (users stopping using the product). You are tasked with finding out if their churn is atypical. You have three years of data for users with an entry for every time they've logged in, including the timestamp and length of session.
		- examine the average length of time between sign up and last login of users grouped by sign-up week

### Assignment 5: Challenge: Personal goals
- academic style problems, but also some of these churn questions (I reckon)
- wrangling big datasets (dip into spark), wrangling small datasets (what constitutes over-mining data)
- hypothesis testing
- signal processing
## Lesson 2: SQL: data access methods
### Assignment 1: Introduction to Databases
- Database contexts
	- operational layer- part of the application responsible for delivering the core user experience; server and client-side application code
	- storate layer- database used to store information rather than storing locally in text files of some kind
		- databases can be distributed across many machines and have higher capacity 
		- multiple users can access a remote database at the same time
		- databases can take data from multiple sources
	- analytics layer
- database structure
	- relational databases consist of a series of tables, with each table having its own defined schema and a number of records (rows)
		- each table has rows and columns; each column has a name and a data type associated with it
		- database schema is a particular configuration of tables with columns
			- schema can change, but migrating (systematic update) data to new structure is costly (tricky to make sure that things won't break in the move and that no data is lost)
		- each record (row) must conform to the schema and ideally will have a value in every field (column) for efficiency (though realistically thre are NULL, blanks and N/As in tables)
	- three kinds of tables:
		- raw- contain simple, relatively unprocessed data (closely resembles the data produced by the operational services)
		- processed tables- contain data that's been cleaned and transformed to be more readable/useable
		- roll-up tables- specific kidn of processed table that take data and aggregate it @TODO
- SQL- Structured Query Language
	- language used to create, retrieve, update and delete database records
	- there are several flavors of SQL with meaningful minor differences; PSQL, SQLite, MySQL
	- good to remember that SQL sees the world in rows, and we'll want to think about attributes at a rows level (with some grouping and aggregating)
### Assignment 2: Setting up PSQL (already done)
- psql -d database_name -f file.sql
### Project 3: SQL Basics
- The CREATE clause - to create a new table 
		- lowercase for table names, no spaces (underscores)
		- each column has a column name and a TYPE (example below has arbitrary types filled in), but can also have contraints like prohibiting null values; column creation lines are seperated by commas
		CREATE_TABLE table_name (
			some_column_name FLOAT column_constraint,
			some_second_column_name TEXT, 
			...
			last_column_name TYPE);
- The SELECT and FROM clause
	- SELECT retrieves rows FROM a table
		- select all rows:
			SELECT * FROM someTable;
		- select specific rows
			SELECT some_column_name, last_column_name from table_name; 
- aliasing
	- can SELECT a column and return it with a different label; useful when selecting across multiple tables that reuse column names
			SELECT some_column_name AS col1 FROM table_name;
- Filtering with WHERE
	- allows us to specify a set of conditions that the results must meet
		- LIKE - for pattern matching when analyzing string data
		- BETWEEN - check if a value is between a pair of values
		- AND and OR - linking conditions together
- Ordering with ORDER BY
	- control the order in which results are returned 
	- can link together multiple ordering conditions 
			ORDER BY some_column_name DESC; 
- Limiting with LIMIT
	- limit the number of results returned, for example to get the top 5
- Formatting notes:
	- new lines for everything, indent column specific instructions
	- ALL CAPS for SQL instructions, whatever casing is relevant for other names
#### Exercises:
 https://gist.github.com/jordanplanders/298877231bb950a192223c681754dd56
	- - 1. The IDs and durations for all trips of duration greater than 500, ordered by duration.
	SELECT 
		trip_id,
		duration 
	FROM 
		trips 
	WHERE
		duration >500
	ORDER BY
		duration;
	  
	- - 2. Every column of the stations table for station id 84.
	SELECT *
	FROM 
		stations
	WHERE 
		station_id = 84;
	  
	- - 3. The min temperatures of all the occurrences of rain in zip 94301.
	SELECT 
		mintemperaturef
	FROM 
		weather
	WHERE 
		events = 'Rain'
		AND zip = 94301;
### Project 4: Aggregating and grouping
- GROUP BY- 
	- comes after WHERE clause and before ORDER BY clauses;
	- without aggregating function, just gets rid of duplicate entries
	- all columns in GROUP BY clause must also be in SELECT statement
	- can use col numbers instead of col names
- Aggregators
	- functions that take a collection of values and return a single value
	- return a column labelled by function, not column so need to alias 
	- AVG, MIN, MAX, COUNT(*) 
#### Exercises: https://gist.github.com/jordanplanders/1c3e994b6246aa51cdba8b4e65166f28
	-- 1. What was the hottest day in our data set? Where was that?
	SELECT 
	  maxtemperaturef, 
	  zip 
	FROM 
	  weather 
	ORDER BY 
	  maxtemperaturef DESC
	LIMIT 1;
	
	-- 2. How many trips started at each station?
	SELECT 
	  COUNT(trip_id), 
	  start_station 
	FROM 
	  trips 
	GROUP BY 
	  start_station
	ORDER BY 
	  COUNT(trip_id) DESC;
	  
	-- 3. What's the shortest trip that happened?
	SELECT * 
	FROM 
	  trips 
	ORDER BY 
	  duration 
	LIMIT 1;
	
	-- 4. What is the average trip duration, by end station?
	SELECT 
	  end_station, 
	  avg(duration) 
	FROM 
	  trips
	GROUP BY 
	  end_station 
	ORDER BY 
	  avg(duration);
### Project 5: Joins and CTEs
- Basic Joins
	- in a join clause indicate one or more pairs of columns you want to join the two tables on
	- by default SQL performs an inner join (only returns rows that are succesfully joined from the two tables)
	- comes after the FROM statement (if multiple tables, order matters), followed by an ON clause that specifies the table.columns that should be the same (to link the two)
	
		SELECT
			table1.col1.
			table1.cal2
			table2.col3
			table2.col4
		FROM 
			table1
		JOIN
			table2
		ON	
			table1.col2 = table2.col2
- Table aliases
	- in a join, often useful to alias table to simplify table names or add a table name (in the case of a self join) in the ON clause
- Types of Joins
	- (INNER) JOIN: only returns rows that are successfully joined
	- LEFT (OUTER) JOIN: returns all rows from the left table even if no common rows in right table; rows without a match wil be filled with NULL
	- RIGHT (OUTER) JOIN: same as a LEFT (OUTER) JOIN if you reverse the table order in the FROM and JOIN clauses
	- (FULL) OUTER JOIN: returns all rows with NULLs in all places where the join doesn't fill in data
- CTEs (Common Table Expressions)
	- since a join statement returns a table, can join a join statment to other tables/the results of other queries
	- note: JOINs happen before aggregate functions so if you want aggregate information about one table and information from the other table, but you don't want information from the other table to weight the results of the aggregate, create the first table with aggregation, THEN join to the second table
		- if you join station with trips and calculated average lat and lon of all start_stations in a city, it will be the average location of start_station in a city of all trips, however, if calculate average start_station from station table and then join, it will be the average start_station location for the set of unique stations in a city.

		WITH intermediate_table_name AS (query1)
	- multiple joins are common to collect information from multiple tables
- Case
	- set up conditions then take action in a column based on them
	- CASE WHEN condition THEN value ELSE value END
	- CASE statemnts go in the SELECT column and indicate what value to return given a conditional statment then aliased 
#### Exercises:
 https://gist.github.com/jordanplanders/f5d960fa280c27d5772a93b6bd268bf0
 -- 1. What are the three longest trips on rainy days?  
SELECT 
	trips.trip_id,
	weather.date,
	trips.duration, 
	weather.events  
FROM 
	trips 
JOIN 
	weather
ON 
	weather.date = SUBSTRING (trips.start_date ,0 , 11 ) 
WHERE 
	weather.events = 'Rain' 
GROUP BY 
	trips.trip_id,
	weather.date, 
	trips.duration,
	weather.events
ORDER BY 
	weather.date 
LIMIT 300;

-- 2. Which station is full most often?
--FIND WHICH STATION IS FULL (DOCKS_AVAILABLE = BIKES_AVAILABLE) MOST OFTEN (NUMBER OF STATUS UPDATES WHERE THIS WAS TRUE)

WITH station_full
AS(
	SELECT 
		station_id, 
		COUNT(station_id) as times 
	FROM 
		status 
	WHERE 
		status.bikes_available = status.docks_available 
	GROUP BY 
		station_id 
	ORDER BY 
		COUNT(station_id) DESC
	LIMIT 1)
	
-- MATCHING A STATION NAME WITH THE STATION_ID FROM ABOVE
SELECT 
	stations.name, 
	station_full.times 
FROM 
	stations 
JOIN 
	station_full 
ON 
	stations.station_id = station_full.station_id; 
	
-- 3. Return a list of stations with a count of number of trips starting at that station but ordered by dock count.
-- Query to find number of trips started at each station
WITH 
	stations2 AS 
	(
		SELECT 
			stations.station_id AS station_id, 
			stations.name, COUNT(*) as trips_started
		FROM 
			stations 
		JOIN 
			trips 
		ON 
			stations.name = trips.start_station 
		GROUP BY 
			trips.start_station, stations.station_id, stations.name)
			
-- QUERY TO ORDER STATIONS2 BY NUMBER OF DOCKS AVAILABLE
SELECT 
	status.docks_available, 
	stations2.name, 
	stations2.trips_started 
FROM 
	stations2 
JOIN 
	status
ON 
	stations2.station_id = status.station_id
ORDER BY 
	status.docks_available DESC
LIMIT 10;

-- 4. (Challenge) What's the length of the longest trip for each day it rains anywhere?
-- QUERY TO ONLY TRIPS WHEN IT'S RAINING
WITH raining AS (
	SELECT 
		weather.date AS date, 
		trips.start_date AS tmestamp, 
		trips.duration AS duration 
	FROM 
		weather 
	JOIN 
		trips 
	ON 
		weather.date = SUBSTRING(trips.start_date, 0, 11) 
	WHERE 
		events = 'Rain')


SELECT 
	max(duration)/360 AS max_duration_hrs, 
	date AS start_date 
FROM 
	raining 
GROUP BY 
	date 
ORDER BY 
	date 
LIMIT 100;
### Project 6: Airbnb Cities
- What's the most expensive listing? What else can you tell me about the listing?  
		
		WITH max_price_list AS (
			SELECT * from calendar 
			WHERE cast(price as float) >0 
			ORDER BY cast(price AS float) DESC 
			LIMIT 1)
			
		SELECT 
			listings.neighbourhood, 
			listings.room_type, 
			listings.minimum_nights, 
			max_price_list.date, 
			max_price_list.price 
		FROM listings 
		JOIN max_price_list 
		ON cast(max_price_list.listing_id AS int) = listings.id;

		Calabasas	Entire home/apt	2	2018-12-22	61000.00
- What neighborhoods seem to be the most popular?  
		
		WITH taken_list as (
			SELECT 
				count(*) AS num_taken_list, 
				listing_id 
			FROM calendar 
			WHERE available = 't' 
			AND substring(date, 0, 5)= '2018'
			GROUP BY listing_id )
				
		SELECT 
			listings.neighbourhood, 
			SUM(taken_list.num_taken_list) AS num_taken 
		FROM listings 
		JOIN taken_list 
		ON cast(taken_list.listing_id as int) = listings.id 
		GROUP BY listings.neighbourhood
		ORDER BY SUM(taken_list.num_taken_list) DESC
		LIMIT 5; 

		neighborhood	num_taken_nights
		Hollywood	181607
		Venice	158504
		Downtown	107685
		Long Beach	86799
		Santa Monica	67395

- What time of year is the cheapest time to go to your city? November  
		SELECT 
			AVG(cast(price AS float)),
			substring(date, 6, 2) AS mo  
		FROM calendar  
		GROUP BY substring(date, 6, 2);
		
		average price		month
		231.06087495398208	11
		
- What about the busiest? November  
		WITH freerooms AS(
			SELECT 
				COUNT(*) AS free, 
				substring(date, 6, 2) AS mo 
			FROM calendar 
			WHERE available = 't' 
			GROUP BY substring(date, 6, 2)),

		busyrooms AS(
			SELECT 
				COUNT(*) AS busy, 
				substring(date, 6, 2) AS mo 
			FROM calendar 
			WHERE available = 'f' 
			GROUP BY substring(date, 6, 2))
		
		SELECT 
			cast(busyrooms.busy as float)/(cast(freerooms.free as float) +cast(busyrooms.busy as float)) AS busy_room_rate, 
			busyrooms.mo 
		FROM busyrooms 
		JOIN freerooms 
		ON busyrooms.mo = freerooms.mo;
## Lesson 3: Intermediate visualization
### Assignment 1: The basics of plotting review
- Basic plot types
	- line plots- data over some continuous variable
	- scatter plots- relationship between two variables
	- histograms- distribution of a continuous dataset
	- bar plot- counts of categorical variables
	- QQ plot- how close a variable is to a known distribution & outliers
	- box plot- compare groups and identify differences in variance & outliers
### Assignment 2: Formatting, subplots, and seaborn
- seaborn @TODO let's talk about the structure of the seaborn package
	- sns.load_dataset() @TODO what form does the data need to be in for it to load properly?
	- sns.set(style = )
	- sns.despine()
	- sns.FacetGrid()
		- .map(plottype, variable)_to_be_plotted)
	- sns.boxplot(x = , y = , hue = , data = )
	- sns.factorplot(x= , y= , hue= , data= , kind=* , ci=, join= , dodge= )
		- bar: bar plot
		- point: like a bar plot but more efficient; good for points that have error bars; may or may not be connected
	- sns.lmplot(x= , y= , hue= , data= , fit_reg= , ci = , lowess =  scatter_kws={})
		- scatter plot 
			- fit_reg: with/without a regression line 
			- ci: with/without confidence interval error cloud @TODO I don't know how to calculate  this error cloud manually
			- lowess: using local wighing to fit a line @TODO
			- col= " parameter results in the data being split out by category and plotted in subplots @TODO
#### Drill 3: Presenting the same data multiple ways
https://github.com/jordanplanders/Thinkful/blob/master/Bootcamp/Unit%201/bike_data/Kevin_bike_datavis.ipynb
### Assigment 4: Cleaning Data
- Finding Dirt
	- anomolous values
	- fake answers (straightlining, repeating answer sequence, time to finish below some threshold)
- Cleaning Dirt
	- replace with NULL or None
	- map to a valid response (an extreme value maps to the highest non-outlier response, winsorizing )
	- remove (duplicate entries)
	- other (data issues that are systemic, widspread, or for a particuar data-related reason)

	- clean with code so that there's a record of the cleaning process
	- don't alter original data, keep a separate "clean copy"
### Assignment 5: Manipulating strings in Python
- string methods
- re (regular expresssions or regex)
	- a regular expression is a sequence of characters that defines a search patterns
	- not always more efficient than string methods
- extracting different categories of character from a string
	- isdigit()
	- isalpha()
	- isnumeric()
	- isspace()
	- isalnum()
- Apply: .apply() allows one to apply a method to each element in a data frame or series
- lambda functions: small, temporary, unnamed function of the format:
			lambda x: f(x) (if [condition] else [alternative])
		- one line functions that would usually be:
			def function(x):
				return f(x)
- filter: 
		- returns an iterator of booleans (based on a function that returns booleans) that when applied to a series or a string only returns entries/characters that are True. @TODO @WTF
			- list(filter(lambda x: boolean function, series))
			- ex: list(filter(lambda x: str(x).isdigit(), money))
			- series.apply(lambda x: ''.join(list(filter(boolean function, string)))
			- ex: money.apply(lambda x: ''.join(list(filter(str.isdigit, str(x)))))
- splitting strings apart: split a string into a list of strings; by default splits at spaces, but can be split at some other character or string
	- pandas has its own version: series.str.split(delimeter, expand = True) that will split the series of strings on the delimeter or regex pattern and return a set of series that correspond to the first, second, third, ... nth elment in the split
		- ex: word_split = words.str.split('$', expand=True)
				names = word_split[0]
				emails = word_split[1]
- replace: replace specific characters or strings with a new string
	- pandas has its own version: series.str.replace(str1, str2)
- changing case: often it will be useful to standardize the case either with: .lower(), .upper(), .capitalize
- stripping whitespace: string.strip(), also lstrip(), rstrip()
	- pandas has its own version for whitespace: series.str.strip()
### Assignment 6: Exercise on cleaning data
### Assignment 7: Missing data
- missing data can be systematicall missing, which raises questions about dataset reliability
- even if missingness is random, analysis can break because of missing values --> df.dropna() is a built in method in pandas to drop all rows with a missing value
- When does missingness matter?
	- so many rows have to be thrown out that the set loses statistical power
	- systematic missing values causes systematic subsets of data to be thrown out making the dataset biased

	- MCAR- missing completely at random: three year old inserts crayons into a random server in a server room and corrupts a drive (could have been a three year old from anywhere given its random that one would be there in the first place, and they picked a random server and a random place to put crayons)
	- MAR- missing at random: if a particular group is likely to skip a question regardless of response and we know this, we can explain the absense of the data and carefully  work around it
		- check correlation between missing scores and various variables to identify what is lurking
	- MNAR- missing not at ransom: if samples that are likely to have a particular value are absent, stop
- Imputing Data- guessing at what values would fill empty fields
	- replacing with the mode, median, or mean works for keeping central tendancy the same, but reduces the variance and alters correlations with other variables
	- can group existing entries into similar groups and impute strategically if data
- Beyond Imputation
	- sometimes its possible to collect more data either in a focused way targetting the MAR group or not if its an MCAR problem
## Lesson 4: Experimental Design
### Project 1: A/B tests
- One of many possible experiemental designs used to identify whether one version of an object of interst is better at producing a deisred outcome
- Components: 
	- Two versions of something whose effects will be compared (preferably a control version and an alternate, though realistically it is sometime two unknowns)
	- a sample (representative of the population), divided into two groups (that are the same in composition and preferably randomly chosen)
	- a hypothesis stating an expectation of what will happen
	- identified outcome(s) of interest; the measurable key metric that will be used to identify/characterize change
	- other measured variables: measure the hell out of everything to help check that the two groups were sufficiently similar, and to idnetify other responses to the change 
- Getting a good sample: the sample has to be representative of the population and any differences in outcome should be due to differences in treatment
	- easy when there's a constant flow and it's possible to just sample the flow
	- hard when it has to be all or nothing (music on, music off)
- Key to key metrics
	- metric as close to the business goal as possible; a metric that reflects an intermediate step and doesn't measure the final outcome doesn't really help
	- metrics that are reliably measurable, preferably somethign passively measured and not based on specific engagement with subjects or self-reported data
	- metrics may have different time windows; it may take a few months for something to surface as a win or a loss.
#### Exercises 
	- Does a new supplement help people sleep better?
		Hypothesis: would presumably be that the supplement would help people sleep better
		Sample: It's difficult to sample the population at large, but perhaps by getting people to opt in via their PCPs, making sure that the percentage of people who say they don't sleep well matches the national average would be a start.  
		Experiement: The experiment would involve sleep studies of all people without the supplement as a baseline, measuring reported quality of sleep as well as EEG and other biometrics.  Then the group would be split and half the sample would be given a placebo adn half the group would be given the supplement and the sleep study would be repeated.  Ideally there would be more than one night on each side, but realistically it would be two nights per subject. 
		Key Metric: fewer bouts of restlessness, less EEG activity, or self reported ratings.  (As a lay person in the field, I would probably have to identify the key metric from the baseline night reports)

	- Will new uniforms help a gym's business?
		Hypothesis: new uniforms will help a gym's business.  
		Sample: People who walk into a gym  
		Experiemnt: This is an all-on, all-off scenario, but potentilly by taking September as a control month and measuring October  with treatment applied, one could avoid seasonal effects of upcoming holidays, or post-holiday season, or pre-summer, or summer vacation lulls.  
		Key Metric: "Help" would likely be based on revenue per month, or perhaps revenue per quarter if there are discounts on new sign ups that would need take time to  manifest.
		Additional measured variables: new information seekers, new sign ups, number of services added to existing subscriptions, number of people re-upping their memberships, potentially surveying about quality of service/professionalism as a back of the envelope impression
		
	- Will a new homepage improve my online exotic pet rental business?
		Hypothesis: a new homepage will help with exotic pet rental business
		Sample: visitors to exotic pet rental site
		Experiment: Use a splitting server to show some people the new site and some people the old site and track the outcomes (e.g. split.io)
		Key Metric: increase in monthly revenue?
		Additional: number of new rentals, number of extended rentals, differences in the types pets rented
		
	- If I put 'please read' in the email subject will more people read my emails?
		Hypothesis: adding 'please read' in email subject lines will prompt more people to read emails
		Sample: listserve
		Experiment: Create a sample set from the email list and send half the altered subject line, and half the standard subject line for the same email with the same send date/time. 
		Key Metric: "people who read emails" as those who clicked on something in the email or followed up
		Additional variables: open rate, unsubscribe rate
### Assignment 2: Simpson's paradox
- phenonmenon in which the average over a number of groups shows one trend, but the average for each individual group shows the opposite or no trend (luriking variable paradox: an unaccounted for varaible changes the relationship between two other variables)
- using randomization to make sure splits don't have lurking demographic tendancies can help
	- confirm the groups are similar before interpreting your results.  
	- Make it a habit to look at subgroups within your A/B test to make sure the overall trend is reflected in the subgroups.  
	- If the subgroups differ from the overall trend, your question should guide whether you report conclusions based on the overall sample, the subgroups, or both.  You don't want to advocate for condition A, even if it performs better overall, if condition B actually works better within every subgroup
### Project 3: Bias and A/A Testing
- bias- anything that causes a sample to systematically differ from the population
	- sampling bias/selection bias: when the sample differs from the population in a systematic way
	- assignment bias: when the sample is split in a way that makes the make-up of the groups to differ
	- contextual bias: when a feature of the environment of testing prompts people in one group to have a different experience (and thus to potentially feel differently about the situation than they would otherwise)
	- observer bias: when the tester/interviewer interferes with the testing (which is to say, interacts with the subjects substantially) 
- A/A Testing- comparing the out come of choice between two identical versions of something.  Sets a baseline for what the difference might be between groups even in the situation in which nothting was different
	- testing method errors can be exposed
	- sample split errors can be exposed
	- sample size errors can be exposed (perhaps the event is to rare to detect in the planned sample size)
#### Drill: Am I biased?
- You're testing advertising emails for a bathing suit company and you test one version of the email in February and the other in May.
	- The design of the study does not appear to take into account the seasonality and effect of geography on bathing suit sales. Are the subjects in the northern or southern hemisphere?  Are they from a cold place where people are more likely to go on holiday to warm places in March, or is summer the only bathing suit season? Anyway around it, there are contextual biases lurking. 
- You open a clinic to treat anxiety and find that the people who visit show a higher rate of anxiety than the general population.
	- Primarily people concerned about anxiety will visit the clinic so the visiting population won't reflect the composition of the population at large leading to selection bias.
- You launch a new ad billboard based campaign and see an increase in website visits in the first week.
	- A billboard campaign will disproportionately affect people local to a particular region, and/or people in cars passing by. These two groups may or not be representative of the population at large 
- You launch a loyalty program but see no change in visits in the first week.
	- A week is likely too short a window for measuring shifts in behavior prompted by a program that likely involves accrueing points. Without more information about the program, unless people were automatically enrolled and there was a huge marketting push around awareness, it is unlikely to sway behavior immediately (particularly if users have to opt-in explicitly).
### Project 1-4-4
https://docs.google.com/document/d/1RK7Uil3IxYxlqCezrhFsfb0VaHFaqWGoQR3kAjMK9vk/edit?usp=sharing
### Project 1-4-5: The research proposal
- The problem: 
	- define the question or problem
	- justify why the problem should be studied
	- review what we already know about the problem
- The potential solution
	- a potential solution is also a hypothesis about what might solve the problem
- The method of testing the solution
	- design of the experiement
	- analysis plan
	- benchmarks (key metrics, points of interest)
- Why bother?
	- adjust disconnect between question and study design
	- adjust study design that will not generate usable data
	- account for false positives
	- prevents mixed expectations about what will be done and how it will be executed
#### Drill: 
Prompt: 
To prevent cheating, a teacher writes three versions of a test. She stacks the three versions together, first all copies of Version A, then all copies of Version B, then all copies of Version C. As students arrive for the exam, each student takes a test. When grading the test, the teacher finds that students who took Version B scored higher than students who took either Version A or Version C. She concludes from this that Version B is easier, and discards it.

Plan:
Problem: Students cheat on exams.  By using three versions of an exam, it is more difficult for students to look at each others' paper for answers because the papers are not necessarily the same type. However, but administering three versions of the test, it is also possible that one of those tests will be notably easier or harder than the other two, causing the grades to be uncomparable.  

Potential Solution: Using multiple exam versions is a reasonable strategy for combatting this problem, however it should be executed and calibrated as carefully as possible.  Exams should be collated ABC and passed out to students only once all are present and seated (preventing potential clusters of A exams, for example).

Experiment: Once the test has been administered, results should be examined for student subgroups. Did one test have notable higher or lower scores than the other two? (ANOVA?)  Did students deviate notably and inexplicably from their historical performance? (paired t-tests between past exam scores and current scores?) Making sure each test taker was surrounded by students taking different versions by passing out the exam carefully, should significantly reduce the probability of cheating occuring on a particular version.  
### Assignment 1-4-6: AB Testing and t-tests
- Evaluating A/B Test Data using tests
	- t-test is a statistical test that calculates the size of the difference between two means given [their variance and sample size] noise in the data
		- t = (y1_mean - y2_mean)/(s1^2/N1 +s2^2/N2)^(1/2)
		- y1_mean, y2_mean are the central tendancies of the two datasets
		- s1, s2 are the standard deviation of the datasets
		- N1, N2 are the sizes (number of individuals) in th two datasets

		- larger t values indicate more significant differences in the means relative to the noise and lead to small p values--> the two samples in question were not drawn from the same population; 
		- depending on the problem, we choose a threshold of improbability called a significance level and if the p-value is smaller than alpha, there is a significant differences between the two sets of samples
### Assignment 1-4-7: Null-hypothesis significance and testing
- Null Hypothesis testing
	- tester has
		- a hypothesis that desribes what they think the data will look like if ther expectations are confirmed
		- a null hypothesis that describes what the data will look like if expectations are not confirmed
	- data is then compared to the null hypothesis
		- calculate a t-value  which is situated in a t-distribution that represents the tvalues you would get if the null hypothesis were true; the farther the calculated t-value is from the center, the less likely that the null hypothesis is true
			- the total area under the curve defined by our t-value sums to the p-value!
			- the p-value represents the probability of getting a t-value that is large or larger if the null hypothesis is true
	- p<.05
		- is the rule of thumb; that means that there is a  1 in 20 chance of returning a false positive
		- corresponds to the two sigma (standard deviation) mark
		- not ubiquitous; there are fields where you have to be much more sure than p = .05 and the threshold value will be much lower	
### Assignment 1-4-8: T-tests and Philosophy of NHST
- T-values
	- the default t-test is two tailed, which is to say it's the probability of getting a more extreme value in either direction.  
	- if a negative result is impossible for some reason (for example) one can use a one-taled t-test
- Philosophy of NHST (Null Hypothesis significance testing)
	- p-value represents the probability of getting the data you have if the null hypothesis were true in the population; put another way, the probability of pulling this data by random chance from the same population that is unaffected
	- no mention of an "actual hypothesis", but rather an acceptance of "not not rejectign the null hypothesis" is tantamount to accepting the hypothesis
	- However, you can't truly limit the possibility space to two outcomes so we can't prove taht our hypothesis occured adn that the effect wasn't due to some other factor.  Instead we stick with disproving a null hypothesis and stating that the results support the hypothesis we put forward	
### Assignment 1-4-9: Experiementation guided example

# Unit 2
## Lesson 1: Preparing to Model
### Assignment 2-1-1: What is a model?
- statistical model: a simplified mathematical representation of the data scientist's best guess about the underlying processes that created the data
	- simplified: prioritize information dense features, viz. the ones that explain the most variance. Probably exclude low variance features in the name of making the model computationally cheaper
	- mathematical representation: series of formulas
	- best guess: based on our bester understanding and testing of the available information (likely will need to be update as more information becomes available)
	- underlying process: @todo
- Models and Math
	- pick models that are suited to characteristics of problem (contiuous data, categorical data, two-variable, multivariate)
### Assignment 2-1-2: Formulating a research question
- model: mathematical expression of a research question; different types of research questions beg for different kinds of models
	- What is already known about this topic? Check out how others have approached similar questions
	- What sort of data, or ways to collect data, are available to me on this topic? Do not engage without data to work with...
	- What skills do I have? Don't take on quesitons you can't cope with in the alloted time
	- Can this question be answered using quanitities or probabilities? The question must be amenable to a numeric solution.
	- Can this research question be asked in one sentence?  If not, refine it.
### Project 2-1-3: Drill: Formulating a good research question	https://docs.google.com/document/d/1FFr1JjqG21LxhLtED01rYJMPCGTQ-8WWcqUIZKPbzEY/edit?usp=sharing
	1. What is the 1994 rate of juvenile delinquency in the U.S.? [Good; regression]
	2. What can we do to reduce juvenile delinquency in the U.S.? [What are the most effective approaches to reducing juvenile deliquency in the U.S.?]
	3. Does education play a role in reducing juvenile delinquents' return to crime? [Good; binary classifier]
	4. How many customers does AT&T currently serve in Washington, DC? [Bad?  Isn't this a fact, rather than a reseach question?]
	5. What factors lead consumers to choose AT&T over other service providers? [Good; PCA]
	6. How can AT&T attract more customers? [Bad; Which of the following methods are most effective in attracting customers?]
	7. Why did the Challenger Shuttle explode? [Good?]
	8. Which genes are associated with increased risk of breast cancer? [Good]
	9. Is it better to read to children at night or in the morning? [Good]
	10. How does Google’s search algorithm work? [Bad, though I'm not sure how to fix it]
### Assignment 2-1-4: Exploring the data
- Univariate (looking at one variable at a time)
	- how many variables?
	- how many data points?
	- what kind (categorical, continuous, ordinal)
	- do any variables have known distributions
	- missing data?  How much and what kind?
	- variance in each of the variables
- Bivariate
	- continuous-continuous
		- scatterplot (scatterplot matrix)
			- sns.PairGrid(df.dropna(), diag_sharey= False)
		- lmplot (scatterplot with regression line and r**2 value; 
			- sns.lmplot(data = df, x = "", y = "")
		- correlation ranges from -1 (strong relationship, as one goes up, the other goes down), to 1 (both go up together))
			- sns.heatmap(df.corr())
		- NB: check for two-dimensional outliers that represent unusual combinations of values
	- continuous-categorical
		- estimate the value of a continuous variable for each value of a categorical variable
			- sns.boxplot()
			- sns.violinplot()
			- sns.stripplot()
		- FacetGrid is a grid that shows a particular pair of variable broken out by one or two categories
			- g = sns.FacetGrid(df, col = variable) 
				g = g.map(plottype, x_label, y_label)
	- categorical-categorical
		- relates the number of counts for a category for each label in another category
			- sns.countplot(data = df x = variable, hue = variable2)
			- pd.crosstab(df.variable1, df.variable2) table of counts giving the number of datapoints for each combination
		- chi-square test- indicate whether one combination of levels is significantly larger or smaller than the rest (rather than compare the means of two data sets, compare the counts of a variable in two datasets to see if they could have come from the same population)
		- NB: check for subgroups with very small counts
- Interpretting pairwise plots and stats
	- flag two-dimensional outliers
	- identify variables that are redundant to each other (variables that are strongly correlated)
### Project 2-1-5: Feature Engineering
- Feature- variable that hs been transformed in ways as to make it well suited to work within a model and explain variance in the outcome of interest
- Working with categorical variables
	- translate a categorical variable with x labels into x-1 numerical features 
		- reference value- level without a feature
		-  can group categorical variables (e.g. Norway, Sweden into Nordic)
- Changing variable types
	- can make binaries out of continuous variables (e.g. some are less than some value and some are more than taht value )
- Numerical variable types
	- Ordinal variables- ordering; so doesn't give information about the difference between one and the next, just that one was first and the next was after
	- Interval variables- variables that indicate rank order and distance but don't have an absolute zero point
	- ratio variables- variables that indicate rank, distance and meaninful absolute zero value
#### Drill: categorize each variable from the ESS dataset
		1. cntry (country)- categorical
		2. year- numeric, ratio
		3. idno (respondent's identification number)- categorical (no indication that these were assigned with order significance)
		4. tvtot (tv watching per avg wkday)- numeric, ratio
		5. ppltrst (most people can be trusted?)- numeric, ordinal
		6. pplfair (most people are fair?)- numeric, ordinal
		7. pplhlp (most people are helpful)- numeric, ordinal
		8. happy (how happy are you?)- numeric, ordinal
		9. sclmeet (how often do you meet friends etc?)- numeric, ordinal
		10. sclact (take part in social activities)- numeric, ordinal
		11. gndr (gender)- categorical
		12. agea (age)- numeric, ratio
		13. partner- categorical
- Combining two or more highly-correlated variables
	- want minimum set of features that describe the space, therefore want features that are correlated with the outcome, but uncorrelated with each other 
	- average highly correlated variables or drop one
	- use Principal Component Analysis (PCA) to reduce the correlated set of variables
- Dealing with non-normality
	- if normality is a model-assumption (and it often is) it may be necessary to transform (e.g. log, sqrt, or invert) variables so that they have more normal distributions
- creating linear relationships
	- many models assume the relationship between a feature and an outcome is linear so in order to accommodate it in a model it may be useful to work with it as a transformation (square, cube, etc)
- making variables easier to understand in light of the research question
	- re-encode a variable into a feature that matches the terminology of the research question (make sure scaling is such that positivde correlations are intuitive)
- Leveling the playing field
	- some models assume all features are scaled to the same bounds, so may need to rescale accordingly (usually to a mean of 0 and a standard deviation of 1) 
		- preprocessing.scale(df)
- All about interactions 
	- may want to build inferred features (?!) by multiplying two features together that together might correlate with another feature
### Assignment 2-1-6: Principle component analysis (PCA)
- What is PCA? complexity reduction technique that tries to reduce a set of variables down to a smaller set of components that represetn most of the information in the variables
	- indetifies sets of variables that share variance and create a component to represent that variance 
	- lose variance in exchange for a smaller set of features (computationally cheaper, better satisfies requirement of features not being correlated, less vulnerable to overfitting)
- Things get messy
	- need variables to be normally distributed
	- relationships between variables are linear
	- correlations are weak (but non-zero) to moderately strong but less than ~.8
	- things can get unstable if it's fewer than three variable that are fairly tightly correlated, but also if it's a lot of variables that are highly correlated
- PCA: Rotation in space
	- take a dataset of n variables as an n-dimensional space
	- PCA standardizes variables so that mean = 0 and standard dev = 1 (so all variables go through the origin and share variance)
	- choose a set of axes so as to minimize the distance between the data points and the axis
- PCA: math (ish)
	- identify the axis (eigenvectors) that minimize the variance, then multiply the feature matrix by the transformation matrix
### Assignment 2-1-7: Feature Selection
- good practice to split dataset into a training and a test set and design feature selection on training set (not on both parts)
- Filter Methods: evaluate each feature seperately and assign a score that is used to rank features such that scores above or below (or both) some point are discarded
	- select relevant features but also likely to produce redundant features because they don't weed out features that are highly correlated to each other
	- e.g. variance threshold, correlation to target variable
- Wrapper Methods: select sets of features based on performance; construct different sets and evaluate based on predictive power in a model (in comparison to performance of other sets)
	- forward passes: algorithm begins with no features and they are added one at a time, keeping the features that have the highest predictive power
	- backward passes: algorithm begins with all features and they are removed one at a time, removing the feature that has the least predictive power

	- computationally expensive, feature set is never re-evaluated (?)
- Embedded methods: select features based on a fitting method, e.g. in regression where there's a penalty against complexity and the fitting objective is to minimize a the cost function
## Lesson 2: Building your first model: Naive Bayes
### Assignment 1: Regression v. classification
- When building a model, there is always a kind of outcome we are intersted in: a label (categorical) or a value.  This determines whether we need a classifier or a regression model
- Classification: returns one (or more) categorical value, a discrete value from a specified set
	- assigns a category to a given test observation
	- assigns a probability measure for each category
	- N.B. the only outcomes that can be returned are ones that have been seen in the training set
- Regression: returns a numeric value from either a bounded or unbounded number line
### Assignment 2: Algorithms intro.
- What is an algorithm? A set of instructions for a computer (efficient is better than not...)
- Algorithm efficiency and complexity
	- scaling: number of steps = scaling factor (e.g. linear, quadratic, etc.)
- Big O Notation: a way to describe most inefficient performance (at worst, O(elements^num_steps)), or complexity
	- how efficiently the algorithm scales with additional data; sets an upper bound on the size dataset one can reasonably work with
### Project 3: Drill: Regression or classification
1. The amount a person will spend on a given site in the next 24 months. (regression)
2. What color car someone is going to buy. (classifier)
3. How many children a family will have. (regression, though plausibly either)
4. If someone will sign up for a service. (classifier)
5. The number of times someone will get sick in a year. (regression, though plausibly either)
6. The probability someone will get sick in the next month. (classifier)
7. Which medicine will work best for a given patient. (classifier)
## Lesson 3: Evaluating classifiers
### Assignment 1: Accuracy and error types
- Success Rate: the most basic measure of success is obviously how often the model was correct (compare the target labels to the predicted labels).  However, not all errors are created equal so we are concerned with what gave rise to the incorrect predictions; important to be able to inform how to deal with the error as well as potentially how to fix the error in the model (e.g. by adding more features, tuning parameters, etc.)
- Confusion Matrix: matrix that shows the count of each possible permutation of target and prediction (predicted outcome on one side, actual outcome on the other)— allows us to identify how many negative-negative, positive-positive and false positives (positive when should have been negative) and false negatives (negative when should have been positive)
	- From sklearn.metrics import confusion_matrix confusion_matrix(target, y_pred)
	- False positive  is also referred to as a “type I error” or “false alarm”
	- False negative is also referred to as a “type II error” or “miss”
	- Sensitivity: percentage of positives correctly identified (agree_pos/(agree_pos+false_neg))
	- Specificity: percentage of negatives correctly identified (agree_neg/(agree_neg+false_pos))
	- top row is [agree_neg, false_pos], second row is [false_neg, agree_pos]
### Assignment 2: Class imbalance
- Ideally the training set would contain an equal number of instances of each outcome so that the trained model has a good idea of what makes for each label and doesn’t over predict the dominant label.  Put another way, if a rare instance has specific traits, but those traits are also seen in a dominant class, the model won’t learn to associate the rare traits with the rare class. (E.g. rare disease prediction, fraud detection)
- Baseline Performance: It’s important to consider the dominant class rate.  If the model doesn’t do better than the percent of the dominant class represented in the training set, that means it could do as well or better by just predicting the dominant class all the time
- Dealing with class imbalance:
	- Ignore it: engineer features that hopefully highlight rare class and hope for the best
	- Change the sampling: deliberately over sample the minority class/under sample the majority class
	- Probability outputs: some models like SVM or log regression can return probability of different labels and use different cut offs or more complex rules to decide what the final label should be 
	- Cost functions for errors: describes how errors are not equal so (for example) the cost of a type II error can be twice the cost of a type I error (not easy to do with the sklearn NB)
### Assignment 3: In sample evaluation and cross validation
- Overfitting- when the model is excessively complex such that it describes the training set perfectly rather than the generalized underlying relationship; we need the model to work on the data we haven’t seen yet
- Holdout Groups: essentially set up a test set separate from a training set; the model needs to perform well on the test set which was not included in the training of the model. The higher the variance, the larger the training set needs to be.
	- From sklearn.model_selection import train_test_split; X_train, X_test, y_train, y_test = test_train_split(data, target, ...)
- Cross Validation: if there is enough data, break the training set up into n folds (usually 5), and train the model on n-1 folds and test on the nth fold  (Leave One Out is when the number of folds= number of observations—usefully your concerned that one observation will skew the model).  Repeat this for every combination of folds and compare the accuracy.  If accuracies are similar then the model is probably not overfitting
	◦ From sklearn.model_selection import cross_val_score; cross_val_score(model_instance, data, target, cv=n_folds)
	- The only score that it will return is accuracy which doesn’t offer a lot of insight into the type of error so people often code up their own versions
- What’s a good score? check for class imbalance and the type of errors to really assess  
### Project 4: Challenge: Iterate and evaluate your classifier 
- test for overfitting using cross validation
- Already used test_train_split
- Already tuned feature set to remove words that appear in both the positive and negative tfidf sets
	- Train model on different length ngrams 
	- Train model on positive words not subtracting off the negative words
	- Train model on (positive words tfidf) - (negative words tfidf) 
	- Train model on (positive words tfidf) - (negative tfidf + count_vectorizer)
	- Train model on (positive words tfidf) - (negative words count_vectorizer)
	- Train model on (positive tfidf + count_vectorizer) - (negative tfidf)
	- Train model on (positive tfidf + count_vectorizer) - (negative count_vectorizer)
	- Train model on (positive words tfidf + count_vectorizer) - (negative tfidf + count_vectorizer)
- Run loop and report accuracy of model with all but one feature for watch feature to assess how important each feature is to model performance 
## Lesson 4: Linear Regression
### Assignment 1: Simple linear regression
- regression alllows us to predict continuous variables 
- Simple linear regression
	- There are lots of regression techniques, but ordinary least squares (OLS) is far and away the most common and is often referred to as just “regression”
	- Works by finding estimators of coefficients that describe the relationships between variables in a formula you define
	- Simple linear regression is of the form y = alpha + beta*x where our objective is to estimate alpha and beta
- Least squares
	- Seeks to estimate coefficients by minimizing error  or residual (sum of the squared distances between each datapoint and the fit line— the type of regression is often named for the distance metric used to calculate the residual) 
	- Regr = linear_model.LinearRegression(); regr.fit(X_train, y_train); y_pred = regr(X_test)
		regr.coef\_, regr.intercept\_ will yield info about the regression 
- Predicting with Simple Linear Regression: with simple linear regression estimates of coefficients yield a y-intercept and a slope (colloqually).  The domain of the equation is all reals even if that isn't the domain of the problem at hand though so heads up.
### Project 2: Multivariable Regression 
- Multivariable Least Squares: when a least squares regression has more than one independent variable (AKA Multivariable least squares linear regression, multiple linear regression, Multivariable regression—NOT Multivariate regression which involves multiple dependent variables)
	- All relationships between coefficients are linear; so an equation will look like y = alpha + c_1*x_1 + ...+ c_n *x_n 
- Categorical Variables: each label has its own variable and thus its own coefficient.  For example, suppose we are calculating rent and our dataset has features: square footage, bedrooms, bathrooms, state.  The first three are number... no problem, the categorical variable (state) has to be split into one variable for each state included in data set, so the new feature set might look like: square footage, bedrooms, bathrooms, WA, CA, OR with each data point having a 1 for the state where the  property is, and a 0 for the other state labels (labels should be mutually exclusive).  By this reasoning linear regression can estimate a coefficient for each of the states and we can correctly note that being in CA is a financially insane place to rent.  
- Linear doesn’t have to mean lines: while the relationships between the coefficients are linear, it’s fair game for the relationships between the variables to not be; e.g. y = alpha + c_1*x + c_2*x^2 (so here x_1 =x and x_2 = x^2), but be careful not to over complicate the model to fit trainig data, overfitting is a problem here too.
### Project 3: Explanatory power: assumptions of linear regression @todo need to complete project
- The extraordinary power of explanatory power:  multiple linear regression both allows us to predict future outcomes, but also provides insight into the relationships between the underlying variables and using the r^2 value ([0,1]), a sense of how much of the variance in the data the model was able to explain and hence how much confidence we should have in our interpretation of the model. 
	- Low r^2 indicates a poor fit
	- High r^2 is good unless it is very high in which case we should be concerned about overfitting
- Assumptions of multivariable linear regression
	- Linear relationship: the coefficients have to be linearly related.  Sometimes we can do something non linear to a feature to make the relationships between features linear
	- Multivariate normality: the error from the model (model predictions-actual target values) should be normally distributed.  Outliers and skewness in the distribution of error can often be traced back to outliers
	- Homoscedasticy: error plotted against predicted values should be uniform.  It is concerning if there is considerably more error when predicting high or low values 
	- Low multicollinarity: correlations among feature should be low or nonexistent (the model could attribute half the explanatory power to one variable and the other half to the other but underneath they two variables are telling the same story so this is not helpful if we are trying to explain the variance--as far as prediction goes, this might work fine)
		- fixed by PCA (collapsing correlated variables into one)
		- dropping correlated features
### Project 4: Challenge: make your own regression model
## Lesson 5: Evaluating Linear Regression models
### Assignment 1: Test Statistics 
- Test Statistics
	- Can evaluate:
		- Whether the model, as a whole, explains more variance in the outcome than a model with no features 
		- Whether each individual feature in the model adds explanatory power
	- Whole model: F-test
		- Represents the ratio between the unexplained variance of the model and the unexplained variance of a reduced model for comparison
			- Unexplained model variance: SSE_F = sum from 0 to n( (y_actual- y_predicted)^2)
			- Unexplained reduced model variance: SSE_R = sum from 0 to n( (y_actual- y_mean)^2) 
			- Number of parameters in the model = p_F (2 in the case of simple linear regression)
			- Number of parameters in the reduced model = p_R (1 in this case)
			- Number of data points: n
			- Degrees of freedom of Unexplained model variance (SSE_F): df_F = n-p_F
			- Degrees of freedom of Unexplained reduced model variance (SSE_R): df_R = n-p_R
		- F = (SSE_F-SSE_R)/(df_F-df_R) * df_R/SSE_F
		- A parameter is any predictor in the model (intercept and features)
		- Degrees of freedom: amount of information untapped to estimate variability after all parameters have been estimated
			- Low degrees of freedom are a flag for overfitting or small sample
			- Degrees of freedom define the F-distribution for the F-test
		- Position of f-test value within the f-distribution is used to determine the p-value (probability of getting an f-test value that is equal or greater if there is no relationship between the outcome and parameters in the population; a significant p-value suggests that the model as a whole can explain some of the variance in the outcome
		- Put another way, the f-test tests whether the r^2 of the model is different from zero
	◦ Individual parameters: t-test: once there is a significant f-test, the next step is evaluating the performance of the individual parameters using a t-test that determines whether that parameter estimate is significantly different from zero
		‣ Being statistically different from zero implies that that the variable explains a significant amount of unique variance in the outcome after controlling for the variance explained by the other parameters  
		‣ Suppose there are three circles (red, blue, yellow) overlapping where yellow represents the outcome...
			• The statistical test of significance of the blue circle on the outcome will be a matter of only the green area, and of the pink circle, only the orange area but neither one will include the area where all three overlap
		‣ A model with high collinearity may very well yield an f-test that is significant without any significant features at all
		‣ Sklearn.LinearRegression can be used for Multivariate linear regression but it’s difficult to extract p-values for individual parameters so we can use statsmodels as an alternative (different packages offer different pieces of additional information about the different models)
			• Write out model formula in the following format:
				◦ Linear_formula = ‘dependent_variable ~ dep_var1+dep_var2+...+dep_var3’
			• Fit the model with:
				◦ lm = smf.ols(formula=linear_formula, data=data).fit()
			• model parameters:
				◦ lm.params
			• p-values as a significance test for each of the coefficients (can usually drop any parameters with pvalues >.05):
				◦ lm.pvalues
			• r^2
				◦ lm.rsquared
- Confidence Intervals
	- the range of values within which our population parameter is likely to fall
	- a 95% confidence interval means that if we were to resample our population, our population parameter would fall within these bounds 95% of the time
	- so this allows us to make a statement not only about what to expect, but how confident we are that it will happen
	- the wider the confidence interval, the less certain the estimate
	- prstd, iv_l, iv_u = wls_prediction_std(lm)
### Project 2: Challenge: Validating a linear regression 
### Project 3: Dimensionality Reduction in Linear Regression 
- Dimensionality reduction in Linear Regression
	- Plight of too many features:
		- Having lots of features makes a model take longer to train and longer to run and more prone to overfitting
		- Variance in the features that is unrelated to the outcome may create noise in predictions (bonus: if the variance is shared among features in multicollinearity)... therefore it is also the case that more features may add more unrelated variance and thus noise
		- Having more variables than data points leads to negative degrees of freedom (a no fly)
	- For the sake of prediction (not helpful if you need to interpret the roles of individual parameters) use dimension reduction to make the feature space manageable.
		- Take a matrix of features X, and its reduced feature form R(X): the objective is to is to come up with transformation R such that the expected value of Y (predicted value) given X is the same as the expected value of Y (predicted value) given R(X)
		- Similar to PCA, but in this case we aren’t trying to preserve all the variance in X, but rather the variance in X that is shared with Y
- Partial least squares regression
	- Basic idea: find the vector in X with the highest covariance with y, then choose a second vector orthogonal to the first that explains the highest covariance with y unexplained by first vector and so on, up to n vectors (the number of features in X)
		- pls1= PLSRegression(n_components=desired_num_features)
		- pls1.fit(X,y)
		- Y_PLS_pred = pls1.predict(X)
		- pls1.score(X,y)
	- Doesn't work well if features are uncorrelated
	- trick is to pick the right number of features to collapse to
### Assignment 4: The Gradient Descent Algorithm 
- regression finds the best model by minimizing the squared distance between each datapoint (the square removes issues around negative distances and penalizes more easily for larger distances)
- When the model is simple, the cost function can be minimized completely with a system of equations.  However, other models are too complex to be minimized analytically like this so we have algorithms like gradient descent which minimizes the cost function using derivatives. 
- put another way, given y= alpha + beta*x, we have a surface of points that are alpha, beta and error (y-y_pred) and we are trying to find the minium point on that surface. The same principle applies for higher dimensional spaces, but we can't visualize as well. 
	- Initialize weights
	- calculate the gradient from that point (minimize the partials of the cost function)
	- move a set distance in that direction
	- repeat until you can't go down any farther (or where all possible alternatives yield higher error than the current)

	- Can be calculated with set distance or adaptive distance
	- Run loop where as long as the weights haven’t converged
- Decision-points in Gradient Descent
	- how to initialize the weights
	- how far to move at each step (learning rate)
	- what constitutes "convergence"? set with a threshold of minimal acceptable change, and probably a maximum number of iterations
- Things get messy when we nab a local minima rather than a global minima so sometimes need to try multiple starting places to rule this out
# Unit 3
## Lesson 1:
### Project 1: KNN classifier 
• K Nearest Neighbors Classifiers: simplest example of calculating which data points are most similar to an observation
	◦ Nearest Neighbor: find the closest known observation in our training data by some distance metric (typically we use Euclidean distance) and predict that our value has the same label 
		‣ Greedy because have to calculate the distance from our point to every point in the set to find the datapoint with the smallest distance. 
		‣ Not a trainable model because it’s just based on the direct relationship between unlabeled data and labeled data
	◦ K-Nearest Neighbors: look at the k nearest neighbors and label based on majority rules (also yield the probability that the label is correct by votes_in_favor/k)
		‣ Calculate distances to all points, choose k smallest, label data
		‣ Mesh: surface for describing the zones of different labels
			• Define the limits of the surface by solving for the min_x, max_x, min_y, max_y
			• Initialize the mesh with a grid size between bounds
			• Initialize and train the model where x1 and x2 are x and y coordinates and y is the label
			• Run whole grid of points through the model
			• Create and plot color mesh
		‣ one would expect any new point falling in a given zone to be labeled accordingly
### Project 2: Tuning KNN- normalizing distance, picking k
- Tuning KNN: 
	- Distance and Normalizing
		- Implicit in the Euclidean distance metric is that the distance between all points is equal (for example, using “loudness” rather than “decibels”) and this is not always the case
		- Also, if one dimension is on a wildly different scale than another, the model won’t care about the second variable
		- Normalization
			- Rescale: rescale everything to be between 0 and 1 (only works if the distances are linear to begin with and the data is known to be bounded so that rescaling to be between 0 and 1 does not impose an artificial boundary)
			- Recalculate so that distance is in terms of standard deviations (z-scores) from the mean
			- Not good practice to mix them in most cases
	- Weighting
		- In the vanilla version of KNN, all votes are counted as equal, which is reasonable when the data is densely populated 
		- However, when the k nearest neighbors are vastly different distances from the point in question, it may be questionable for all votes to count the same—perhaps weight by distance so that points are influenced according to their inverse distance to point in question
			- KNeighborsClassifier(n_neighbors = 3, weights=‘distance’)
	- Choosing K
		- the larger the k, the more smoothed teh decision space will be with more observations gettna a vote on the prediction
		- smaller k will pick up subtle deviations (which could just be deviations and overfitting)
		- best bet is to try multiple models and use valdation techniques to choose
#### Drill:
Drill:  Let's say we work at a credit card company and we're trying to figure out if people are going to pay their bills on time. We have everyone's purchases, split into four main categories: groceries, dining out, utilities, and entertainment. What are some ways you might use KNN to create this model? What aspects of KNN would be useful? Write up your thoughts in submit a link below.

Assuming there is a column of labelled data regarding whether or not people did pay their bills on time, all four variables could be entered and knn could be applied. It would be useful as different combinations of features may play different roles in the distance between two points because different people have different spending habits, but there may still be a larger picture model for predicting ability to pay bills.  It might be worth looking at the correlation between the variables to see if, for example dining out and entertainment might collapse into one variable. Given we are working in 4-space, it would be impossible to visualize with a mesh plot, but we could run a battery of options to test different values of k.  Without seeing the data, it would be difficult to assess the need for weights or normalization, but adding these parameters as options for optimizing the model would be worth considering. Perhaps we could assess by making a plot with the k distances of the k nearest points per point.  If there is a pattern of outliers, it would be worth considering a weighting scheme.

### Project 3: KNN regression 
- KNN Regression
	- Everything’s the same: Rather than taking the popular label, take the average of a given variable over the k closest points @what happens if there are more than k points with the same x value, how does the model choose which k are the closest?
		- KNN= neighbors.KNeighborsRegressor(n_neighbors=k)
		- KNN.fit(X, Y)
		- (Can run a line with regular intervals through model to see regression line)
		- (Can use the same weighting trick (weights=‘distance’))
	- Validating KNN:
		- Can still use all the nornal methods (e.g. cross_val_score)
### Project 4: Challenge: Model Comparison 
## Lesson 2
### Assignment 1: Decision trees
- Decision Trees
	- Learning from questions
		- Nodes: questions (either the root node (first node), interior nodes (follow up questions), leaf nodes (endpoints))
		- The questions in the nodes are called rules
			- Divides the data into a certain number (typically two) subgroups
			- All data has to be accounted for, it cannot simply disappear 
		- Links between nodes are paths or branches
	- Entropy
		- Shannon Entropy H is a weighted sum; the sum of probability of a given out come times log base two of that probability for all outcomes
			- As we limit the possible number of outcomes, entropy decreases
		- Information Gain: change in entropy from the original state to the weighted potential outcomes of the following state
	- writing code:
			dec_tree = tree.DecisionTreeClassifier(criterion='entropy', max_features = 1, max_depth=4)
			dec_tree.fit(X_train, y_train)
			dot_data = tree.export_graphviz(dec_tree, out_file=None, feature_names=X_train.columns, class_names= [], filled=True)
			graph = pydotplus.graph_from_dot_data(dot_data)
			Image(graph.create_png())
		- for the code above, the tree specs have only one feature being used per node and there are only four decision levels to arrive at a classification
	- Benefits
		- the model well visually
		- thay are interpretable
		- they can handle mixed data (numeric and categorical)
		- don't take must data preparation
		- there is a classifier and a regression model
	- Downsides
		- there is a randomness to their generation which can affect variance
		- they aren't gauranteed to build the same way every time
		- prone to overfitting if allowed to sprout too many layers
		- they need balanced datat because they will tend to bias toward the dominent class
### Assignment 2: ID3 algorithm
- ID3 Algorithm: Iterative Dichotomizer 3 goes through the features and identifies the most valuable attribute and then splits data based on it.  THen it takes the next most valuable attribute and splits base don that... and so forth
- requirements:
	- outcomes have to be binary
	- attributes your splittling on have to be categorical
	- definition of entropy (like Shannon Entropy H)
- Let's consider an example: male and female students, short, medium, tall; M(2,6,4) F(5,2,1)
	- without a rule, the entropy is the formula plugged in over both the possible final classes of interest (in this case male and female) 
		- 12/20*log_2 (20/12) + 8/20*log_2(20/8) = .971
	- supposse we apply the rule height=short...
		- H(short)=P(short and male)*log_2(1/P(short and male))+P(short and female)* log_2(1/P(short and female)) = 2/7* log_2(7/2) + 5/7* log_2(7/5) = .863
		- H(not short)=P(not short and male)* log_2(1/P(not short and male)) + P(not short and female)* log_2(1/P(not short and female)) = 10/13* log_2(13/10) + 3/13* log_2(13/3) = .779
		- H = 7/20 * H(short)+ 13/20 * H(not short) = .809
- Pseudocode
	- If all observations return A, label root node A and return
	- If all observations return B, label root node B and return
	- Else for each attribute of each feature, calculate the entropy. The attribute of with the minimum entropy is the picked condition.  
		- if either the yes or no branch lead to only one label, that branch is considered finished and leads to that label. 
		- if a branch leads to more than one label, the set is reduced to the population that agrees with that branch and the entropy is calculated for all remaining attributes to decide on the next condition... and so on until either the max number of branches has been reached and then majority rules for the final labels, or all conditions terminate in unanimous decisions
### Assignment 3: Random Forests
- a forest of decision trees
- set parameters for tree (depth of tree, number of features used in each rule, method of construction (e.g,. Gini impurity, entropy/information gain))
- set parameters for forest (number of estimators)
	- more estimators increases accuracy but the model accuracy will plateau so when variance is acceptable, knock it off.
- bagging and random subspace
	- bagging: each tree samples the dataset (with replacement) to build a training set (in the style of cross validation)
	- each split is based on a random subset of features (features**(1/2) for classifiers and features/3 for regression)
- advantages: tends to perform well
- disadvantages:
	- will not predict outside the sample (unlike some regression models where you can predict values outside domain of training set)
	- can get large and slow if allowed to grow 
	- black box problem: unlike individual decision trees, random forests are not interpretable (no insight into the underlying processes) so you have to just trust the accuracy and the low variance that comes from using a lot of them
### Assignment 4: Ensemble modeling
- models made up of other models, e.g. Random forests whhich combine many decision trees to arrive at a sngle prediction via a voting process
- Methods of Ensemble Modeling
	- bagging: take susbets of the data and train a model on each subset. then the subsets are allowed to simultaneously vote on the outcome, either taking a majority of a mean (e.g Random Forests)
	- Boosting: uses the output of one model as tht input in the next model in a serial processing approach. Models can be daisy-chained until a stopping condition is met 
	- Stacking: first phase the models are trained in parallel, then in the second phase, those models are used as inputs into a final model to give a final predicition. (it is essentiall a hybrid of bagging and boosting)
- On the upside the performance of ensemble models is often better because they cover each other's short comings.  On the downside, they are prone to overfitting and interpretability is abstracted away. 
### Project 5: Random forest: guided example
## Lesson 3: Advanced Regression
### Assignment 1: Logistic Regression
- Logistic regression v. linear regression
	- linear regression is of the form y = alpha+beta*X_hat
	- binary logistic regression is of the form: ln(p/1-p) = beta*X_hat
		- relates p (the probability of getting y=1 rather than y=0) to a matrix of variables X
		- p/1-p represents the odds of getting y=1
	- both behave in the same way:
		- linearity (of the transformed probabilities and the predictors)
		- require 
			- multivariate normality of the residuals 
			- homoscedasticity Ivariance of the residuals is constant across all predicted values
			- low multicollinarity
- why log? A well fitting regression line reflects the shape of the data which isn't possible with a binary y.  Instead, we fit the regression to the odds of y=1 (which is calculated by counting the number of times y=1 for each value of x, divided by the count of y=1 values)
- linear v. nonlinear transformations
	- a linear transformation is one where the relationship between the original value nad the transformed value is the same for all values of the variable (+,-,*,/)
	- a nonlinear relationship is the opposite (e.g. logs and powers)
- using logs: if the space between values seems to increase as x increases, try taking the log and then running regression
	- draw back: to make the results really interpretable, have to exponentiate them. @question I can see how this works with OLS, but how does taking the log affect the math for logistic regression 
	- exercise:
- Fitting a binary logistic model using statsmodels
- Beyond Binary
	- multinomial logistic regression (classifier with more than two outcomes) is another sklearn model
	- ordinal logistic regression for dealing with cases where order matters but the distance between a given two sequential items may differ (see the mord package)
### Assignment 2: Ridge Regression
- Improving on OLS: we can get better predictions by modifying the OLS cost function
	- OLS optimizes variance explained in the training set, but Ridge and Lasso regressions optimize variance explained in the test set (how?!)
- Ridge Regression: takes the OLS cost function and adds a term that imposes a penalty for large coefficients; 
	- represents the sum of the square of all model coefficients multiplied by a regularization parameter. As the regularization parameter gets larger, so too does the penalty for coefficient size
	- also called "L2 regularization"
	- as models become more complex and features correlate with one another more and more (multicolinear), OLS coefficients get larger, which is a sign of overfitting (which in turn is related to overexplaining the variance of the training set)
	- Regularization parameter: Ridge
		- the regularization parameter can take any value >0
		- best approach is to use cross_val to search for an optimal value (GridSearchCV, model specific cross validation?)
		- @question go through the last plot in 3-3-2
### Project 3: Lasso regression
- Least Absolute Shrinkage and Selection Operator: model optimization approach that tries to force small parameters to zero, thereby dropping them from the model (reduces overfitting and functionally does some amount of feature selection by chucking out unimportant features)
	- similar to ridge regression except that it penalizes the absolute value of coefficients rather than the square of coefficients
	- also called, "L1 regularization"
	- Gradient Descent fails to converge where x=0 because of a discontinuitiy (presumably this is because of the corner in the absolute value function?)
- Fitting Lasso: Coordinate Descent Algorithm
	- uses Coordinate Descent rather than Gradient Descent because of above discussed discontinuity
		- choose a vector of beta values (often a zero vector)
		- for each feature j in beta_hat, (1) predict the outcome of all features except for j (2) look at correlation between residuals from the model using all beta but beta_j (called rho_j) and feature j (3) if the correlation is within the regularization parameter distance of 0: set beta_j=0, else if rho_j< reg_param/t: set beta_j=rho_j+reg_param/2, else if rho_j>reg_param/2: set beta_j=rho_j-reg_param/2... continue until the max difference between parameter estimates in the previous cycle and current cycle is less than some threshold
		- can work through all parameters or randomly choose some to test
	- put another way: lasso fitts the model to the data while excluding a feature and if the model fits well enough then it excludes the feature because it is deemed superfluous, otherwise the value of beta is revised
### Project 4: Challenge
## Lesson 4: Support Vector Machines
### Assignment 1: Linear SVM
- a method for devising the boundar(ies?) between categories
- Margin and Support Vectors
	- margin is the distance between the nearest point in each class and the boundary; being directly on the line makes the model suceptible to error due to what is essentially a version of overfitting
	- support vectors: the lines that split the nearest pairs of points; want the best boundary which is the one that optimizes the margin
- Finding the optimal boundary: for our introduction, we will stick with a linear boundary (which is to say, use a linear kernel) and use sklearn @question but the short form is that we solve for the two lines that bound the two categories and then look for the line that maximizes the distance of all points from that line 
- Things get messy: most datasets are in multiple dimensions and rather than having a boundary line the boundary is a hyperplane or n-1 dimensional space
	- hard margin: a boundary that groups each observation exclusively on one side of the other
	- soft margin: a boundary that doesn't fully seperate all observations
	- SVM uses a cost function that maximizes the size of the margin and minimizes the cumulative distance of points on the wrong side of the margin
### Assignment 2: The Kernel Trick
- when the boundary isn't linear, need to project observations into a dimension where it can be divided by a single surface, then return to original dimension @question what's happening under the hood that makes this work? How is the computer seeing this problem clearly differently from us because the circle is obvious by inspection, but not by calculation
- Kernel functions: a way of mapping data to a space using weights @question What is a kernel function?
- Kernels in SVM: a function that implicitly computes the dot product between two vectors in a higher dimensional space without actually transforming the vectors into that space
	- the dimensionality depends on the number of dimensions in the input @question isn't this a function of the number of features
- Kernel estimation in practice: by and large we use radial basis function, which uses Gaussian decay according to distance from the original point
### Assignment 3: Extensions
- Multiple classes: build classifiers for one category v else for all features, tabulating the confidence that each point belongs to each category 
- Regressor @question how does this work?
- Clustering: To be discussed
### Project 4: Challenge
## Lesson 5: Boosted Models
### Assignment 1: What is boosting
- model the data over and over, each time adjusting the model based on what was learned in the previous iteration
- use a model to fit the whole dataset, then add another model on top to fit the incorrect predictions/residuals for regression until some threshold. 
- Elements of boosting:
	- type of simple model 
	- index of error (e.g. classification error, cost function, regression residual)
	- how iterations target the error (weight inaccurately predicted cases high adn accurately predicted cases as low, model residuals, model inaccurate subset)
	- stopping rule (bound the number of models that can be implemented; avoid overfitting)
### Assignment 2: Gradient boost
- Gradient Boosting
	- recall gradient descent: for a particular regression model win n parameters, an n+1 dimensional space can be define where the additional dimension is introduced by a loss function.  The regression parameters are fitted by moving down the gradient until the lowest point on the surface.
	- in Gradient Descent the data being used for the calculation is always the same, but in Gradient Boosting, the data change.
	- Typically Gradient Boosting works with decision trees, either minimizing the residual (regression), or the negative log-likelihood (classification trees)
- Gradient Boosting with regression trees @question I don't understand the inner workings here
	- at the first iteration, train the model on y
	- at all subsequent iterations: train the model on the residual (y-y_pred1), calculate the new predicted value as y_pred2 = y_pred1 + residual_1, calculate the new residual y-y_pred2, train the model on the new residual... and so on
- Overfitting: more iterations we run, the closer we come to overfitting
	- subsampling: each iteration of the boost alg. uses a subsample of the original data (introduces some light randomness to the training process)
	- shrinkage: regularization parameter reduces the impact of subsequent iterations on the final solution (kind of like in ridge regression where less important features bite the dust)
		- learning rate: each step along the the loss function gradient is a little smaller than the previous (0= only first iteration matters, 1= all iterations carry equal weight)
### Project 3: Guided Example and Challenge
### Assignemnt 4: Wrap up
### Assignment 5: Challenge: What model can answer this question?
## Lesson 6: