This repo contains all the files material releated to WorldQuant University's Data Science Summer 2020 Session Unit 2: Machine Learning and Statistical Analysis
- Introduction to machine learning and Scikit-learn API
- Regression, classification, & model selection (miniproject: ml)
- Feature engineering
- NLP and dimension reduction (miniproject: nlp)
- KNeighbors, clustering and ensemble models
- Support vector machines
- Time series analysis and anomaly detection
- Clustering
- Introduction to Machine Learning
- 1.2.1 Intro to Scikit-learn
- 1.2.2 Predictors
- 1.2.3 Transformers and Pipeline
- 1.2.4 Feature Unions
- 1.2.5 Custom Transformers
- 1.2.6 Custom Predictors
- 1.2.7 Exercise Distance Transformer
- 1.2.8 Exercise Majority Classifier
- 1.3.1 Persisting Your Model
- 1.3.2 Common Mistakes
- 2.1.1 Regression Metrics
- 2.2.1 Linear Regression Intro
- 2.2.2 Gradient Descent and Huber Loss
- 2.2.3 Multivariate Regression
- 2.2.4 Feature Importance
- 2.3.1 Classification Metrics
- 2.3.2 Probabilistic Models and Metrics
- 2.3.3 Logistic Regression
- 2.3.4 Multiclassification
- 2.4.1 Model Selection
- 2.4.2 Intro to Decision Trees
- 2.4.3 Underfitting and Overfitting
- 2.4.4 GridSearchCV
- 2.4.5 Comparing Two Models
- 2.5.1 Imputation
- 2.5.2 Categorical Data
- 2.6.1 GridsearchCV and Pipelines 1
- 2.6.2 GridsearchCV and Pipelines 2
- 2.6.3 RandomizedSearchCV
- 3.1.1 Feature Engineering and Extraction
- 3.1.2 Feature Transformation
- 3.1.3 Curse of Dimensionality
- 3.1.4 Regularization
- 3.1.5 Multicollinearity and PCA
- 3.1.6 Ensemble Models
- 3.2.1 Bias and Variance
- 3.2.2 Learning Curves
- 3.3.1 Intro KNN
- 3.3.2 KNN Bias and Variance
- 3.3.3 KNN Time Complexity
- 3.3.4 KD Trees and Weights
- 3.4.1 Intro to NLP
- 3.4.2 Spacy
- 3.4.3 Obtaining a Corpus
- 3.4.4 Bag of Words Model
- 3.4.5 Hashing Vectorizer
- 3.4.6 TF-IDF
- 3.4.7 Improving Signal
- 3.4.8 N-grams and Similarity
- 3.4.9 Word Usage Classifier
- 3.4.10 Exercise I
- 3.4.11 Exercise II
- 3.4.12 Exercise III and IV
- 4.1.1 Intro to Decision Trees
- 4.1.2 Tree Error Metrics
- 4.1.3 Trees for Regression
- 4.1.4 Training Trees and Hyperparameters
- 4.1.5 Geometric Interpretation and Time Complexity
- 4.1.6 Time Complexity Continued
- 4.1.7 Random Forests
- 4.1.8 Extreme Random Forests
- 4.1.9 Gradient Boosting Trees I
- 4.1.10 Gradient Boosting Trees II
- 4.1.11 Feature Importance
- 4.1.12 Exercises
- 5.1.1 Intro to SVM
- 5.1.2 Largest Margin Classifier
- 5.1.3 Soft Margin Classifier
- 5.1.4 SVM Kernels
- 5.1.5 SVM vs Logistic Regression
- 5.1.6 SVM Regression
- 5.1.7 SVM Lagrangian Dual
- 5.1.8 Kernel Trick
- 5.1.9 SVM Time Complexity and Multiclass
- 5.1.10 SVM Tuning Kernels Exercise Part I
- 5.1.11 SVM Tuning Kernels Exercise Part II
- 5.1.12 SVM Kernel Approximations
- 5.1.13 SVM Online Learning
- 5.1.14 SVM Online Learning Pipeline
- 5.2.1 Intro to Clustering
- 5.2.2 Metrics for Clustering
- 5.2.3 KMeans Clustering
- 5.2.4 Elbow Plots
- 5.2.5 Gaussian Mixture Models
- 5.2.6 Choosing Cluster Based on Silhouette
- 5.2.7 GMM Choosing Number of Components
- 6.1.1 Intro to Time Series
- 6.1.2 Crossvalidation in Time Series
- 6.1.3 Stationary Signal
- 6.1.4 Modeling Drift
- 6.1.5 Fourier Transforms Part I
- 6.1.6 Fourier Transforms Part II
- 6.1.7 Fourier Components in our Model
- 6.1.8 Modeling Noise
- 6.1.9 Moving Statistics
- 6.1.10 Full Model
- 6.1.11 ARMA and ARIMA
- 6.1.12 AR Example
- 6.2.1 Intro to Dimension Reduction
- 6.2.2 Math of Projections
- 6.2.3 PCA
- 6.2.4 PCA in Scikit Learn
- 6.2.5 PCA Implementation Details
- 6.2.6 Choosing the Number of Components
- 6.2.7 Truncated SVD
- 6.2.8 NMF
- 6.2.9 Using PCA with Supervised ML
- 6.2.10 PCA for Visualization
- 6.2.11 NMF Exercise Part I
- 6.2.12 NMF Exercise Part II
- 6.2.13 Variants of PCA
- 7.1.1 Intro to Anomaly Detection
- 7.1.2 One class SVM
- 7.1.3 Isolation Forest
- 7.1.4 Comparison Between One-class SVM and Isolation Forest
- 7.1.5 Intro to Case Study
- 7.1.6 Initial Baseline Model Part I
- 7.1.7 Initial Baseline model Part II
- 7.1.8 Full Baseline Model
- 7.1.9 Z-score Detection
- 7.1.10 Rolling Z-score Detection
- 7.1.11 Using External Features Initial Model
- 7.1.12 Using External Features Tuning the Model
- 7.1.13 Packaging the Time Series Anomaly Detector
- 8.1.1 Model Considerations
- 8.1.2 Model Development
- 8.1.3 Flask App Local Development
- 8.1.4 GET requests
- 8.1.5 Making GET Requests with Model
- 8.1.6 Using our Model with Twitter Web API
- 8.1.7 POST Requests and Flask Templates
- 8.1.8 Preparing for Deployment to the Web
- 8.1.9 Deploying our App to the Web with Heroku
- 8.2.1 Rethinking Model Tuning
- 8.2.2 Intro to Bayes Theorem
- 8.2.3 Bayesian Inference
- 8.2.4 Bayesian Optimization
- 8.3.1 End of Course Material