Skip to content

Using Dask, a Python framework, I handle 900 million rows of S&P E-mini futures trade tick data directly on a local machine. Through exploratory data analysis, continuous series creation, and bar sampling, inspired by Marcos Lopez de Prado's work, I demonstrate efficient alternatives to costly data processing methods.

Notifications You must be signed in to change notification settings

saeed349/Advances-in-Financial-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Implementations of few key topics from chapter 2 of the the book Advances in Financial Machine Learning by Marcos Lopez de Prado.

Overview

  1. 1_rolling.ipynb: Initial analysis and creating adjusted futures price.
  2. 2_bar.ipynb: Sampling observation and create tick, volume and dollar-traded bars.
  3. 3_analysis.ipynb: Some statistical tests on the bars created.

The subsections of the notebooks is structured as follows

1_rolling.ipynb:

  • 1.1 Basic EDA (Exploratory Data Analysis)
    • 1.1.1 Initial Analysis
    • 1.1.2 Checking the full dataset to see how its structured before rolling
  • 1.2 Rolling Futures Contracts
    • 1.2.1 Pandas Implementation - sample data
    • 1.2.2 Dask Implementation - sample data
    • 1.2.3 Rolling over entire dataset
    • 1.2.4 Plotting to check rolling adjustment

2_bar.ipynb:

  • 2.1 Tick Bar
  • 2.2 Volume Bar
  • 2.3 Volume Bar

3_analysis.ipynb:

  • 3.1 Weekly Count
  • 3.2 Serial Correlation
  • 3.3 Monthly subset
  • 3.4 Check for normality

Environment and Tools

Since the tick data has around 900 million rows of data, doing all the computation in memory using Pandas is hard. So for the first 2 part of the project (rolling and bar sampling) we'll be using Dask which is a python library for parallel and distributed computing. The reason why we are using Dask is so that we can write all the rolling and sampling code using Pandas and just use Dask it to scale it without having to change the code. This lets us easily analyze the data and create our analytics in Pandas and just scale it using Dask without changing any code.

For this project all the data processing (including rolling, sampling, dask tasks) was done on my local machine (Intel i7 4 cores and 16GB RAM). The code has been run in Python 3.8 with the following packages

  • dask==2022.7.0
  • pandas==1.5.3
  • numpy==1.23.5
  • h5py==3.7.0
  • sklearn==1.2.1
  • scipy==1.10.0
  • statsmodels==0.13.5
  • matplotlib==3.7.0
  • plotly==5.9.0
  • seaborn==0.12.2

About

Using Dask, a Python framework, I handle 900 million rows of S&P E-mini futures trade tick data directly on a local machine. Through exploratory data analysis, continuous series creation, and bar sampling, inspired by Marcos Lopez de Prado's work, I demonstrate efficient alternatives to costly data processing methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published