Imperial x Optiver | Silver Medal

Optiver - Trading at the Close

Contents:

Imperial x Optiver | Silver Medal

Background

Each trading day on the NASDAQ Stock Exchange ends with the NASDAQ Closing Auction.

During the last ten minutes of a trading session on the Nasdaq exchange, market makers such as Optiver combine traditional order book trading with price bidding data. The ability to integrate information from both sources is crucial in order to provide the optimal price for all market participants.

Dataset

The dataset contains historic data for the daily ten minute closing auction on the NASDAQ stock exchange. The goal is to predict the future price movements of stocks relative to the price future price movement of a synthetic index composed of NASDAQ-listed stocks.

The features (columns) include

stock_id, data_id, imbalance_size, imbalance_buy_sell_flag, reference_price, matched_size, far_price, near_price, [bid/ask]_price, [bid/ask]_size, wap, seconds_in_bucket

Feature Engineering

Feature is all you need. The primary focus in this competition has been on feature engineering, which is considered as the key to alpha discovery.

The baseline model gradient boosting using LightGBM with feature engineering can achieve a competitive public board score without too much effort in tuning model parameters. In comparison, baseline LightGBM without feature engineering scores even lower than making constant zero value predictions.

Our feature engineering process absorbs many ideas from the code and discussion board.

---Basic features---

We first compute some common financial statistics and indicators (e.g. bid-ask spread, trading volume) to reflect liquidity, volatility, pressure, urgency etc.

The market urgency defined below is the strongest feature found in the public kernel.

df["price_spread"] = df["ask_price"] - df["bid_price"]
df["liquidity_imbalance"] = df.eval("(bid_size-ask_size)/(bid_size+ask_size)")
df["market_urgency"] = df["price_spread"] * df["liquidity_imbalance"]

df["market_urgency_v2"] = (df["ask_price"]+df["bid_price"])/2 - (df["bid_price"]*df["bid_size"]+df["ask_price"]*df["ask_size"]) / (df["bid_size"]+df["ask_size"])

See Insight on market_urgency

---Imbalance features---

This idea comes from various notebooks from the code and the discussion board.

From empirical experience, the imbalance features can bring significant improvement in model prediction ability.

Doublet imbalance ratios:

Take two feature columns $x$ and $y$ from the prices and sizes feature group, compute $(x - y) / (x + y)$.
Triplet imbalances:

Take three feature columns from the prices and sizes feature group, compute (max - mid) / (mid - min) (here min, mid, max are computed row-wise).

The triplet imbalance ratio feature computation is parallelised using numba.

---Imbalance ratios interpretation---

Personally, I find some imbalance ratios difficult to interpret and have little meaning in finance. Some features are just trying out possible combinations in the hope of improving prediction ability. Based on public scores, the new features created indeed decorrelate the underlying complex information carried in the original data to some extent.

---Lagged features---

Use diff, shift, pct_change in Pandas to compute lagged features for various features (prices, sizes) grouped by stock_id with various window periods (1, 2, 3, 10). Lagged features capture time series moving trends and periodic characteristics.

---Statistical aggregations---

Compute various statistics (mean, standard deviation, skew, kurt, max) for the prices and sizes feature group.

---Temporal features---

Based on provided date_id, seconds_in_bucket, we create features indicating days of the week, seconds/minute during the closing auction.

---Stock specific features---

Group by stock_id, bid_price, ask_price, we create global stock-specific features such as median_size, std_price. These global stock-specific features reflect global market trends.

---Synthetic index based features---

The competition goal is to predict the future price movements of stocks relative to the price future price movement of a synthetic index composed of NASDAQ-listed stocks.

It turns out that the weights of the synthetic index can be rebuilt by applying linear regression on the stock and index return. See Weights of the Synthetic Index.

Based on synthetic index weights, we create features such as stock_weights (map stock_id by weights), weighted_wap (re-weighted wap).
Group by time_id, weighted_wap, further features such as index wap can be created.

---Memory optimisation---

This is also community work aimed at improving performance and reducing memory-related issues. The original dataset contains various data types (int8, int16, float32 etc.). The dataset storage can be optimised by converting columns to the most memory-efficient data types.

See Memory Optimization Function with Data Type Conversion.

Model & Inference

We produced two solutions. Single models are usually less prone to overfitting compared to ensemble models. However, ensemble models may provide better generalisation ability.

In this competition, models are trained on historic data and are tested on latest real-time data. Therefore, the model generalisation ability can be crucial.

LightGBM

LightGBM (light gradient-boosting machine) is a popular gradient boosting framework developed by Microsoft. LightGBM grows decision trees in a leaf-wise manner, rather than level-wise in traditional boosting algorithms. It chooses the leaf that will yield the largest decrease in loss. This approach can lead to more depth in trees. LightGBM implements a histogram-based decision tree learning algorithm, which yields great advantages on efficiency and memory consumption. (XGBoost uses a sorted-based decision tree learning algorithm, which searches the best split point on sorted feature values.)

---LightGBM training and fine-tuning---

Traditional k-fold might lead to look-ahead bias due to the autocorrelation inherent in financial time series data. To avoid data leakage, we adopted time-based split (split data based on specific time points e.g. split_day=435) to separate the dataset into train and valid sets.

Some notebooks in the code board implemented k-fold cross-validation. e.g. 5-Fold CV implemented purged k-fold CV. The strategy is to divide the dataset into five distinct folds based on date_id. A purge period (a gap between training and validation sets) is introduced to prevent information leakage from validation set back into training set. For each fold, the model is trained on data occuring before the purge period and is validated on data following the purge period.

Compared with depth-wise growth, LightGBM leaf-wise algorithm can converge much faster. However, the leaf-wise growth may be over-fitting if not used with the appropriate parameters. We can use Optuna hyperparameter optimization framework to fine-tune our LightGBM model. See LightGBM Hyperparameter Optimisation with Optuna.

To reduce notebook running time, we used pre-trained model weights (pickled) in the submission notebook.

MLP (multilayer perception)

A multilayer perceptron (MLP) is a neural network consisting of fully connected neurons. MLPs are able to distinguish data that is not linearly separable. MLPs serve as a foundation for many sophisticated neural network architectures. Theoretically, MLPs can approximate any continuous function given enough neurons and layers.

---MLP architecture---

Input: continuous and categorical features.

Each categorical input is passed through an embedding layer.
The continuous input and the embedding layer output are concatenated to create a combined input for the MLP.
Multiple dense layers (hidden_units = [128,54]) with batch normalization, ReLU activation, and dropout are applied sequentially.
L2 regularization is applied to the kernel weights.

graph TD;
    Continuous --> Input[MLP input];
    Categorical -- Embedding, flatten --> Input;
    Input --> dense[Dense layers] -- Normalisation, ReLU, dropout --> Output;

Inference

Several tricks can be implemented in the inference process to improve the model public score.

---zero_sum post-processing---

def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices) / np.sum(std_error)
    out = prices - std_error * step
    return out

In gambling and economics, the favourite-longshot bias is an observed phenomenon where on average, bettors tend to overvalue longshots and relatively undervalue favourites. The favourite-longshot bias is not limited to gambling markets, it also exists in stock markets.

zero_sum adjusts all predicted stock prices by the same units of standard error to ensure all predicted stock prices relative to the index price sum to zero. This post-processing attempts to consider the favourite-longshot bias by utilising the wider standard errors implied for predicted stock prices with low trade volume and vice versa.

zero_sum is an implementation of goto_conversion: Novel Conversion of Betting Odds to Probabilities. See goto_conversion + Optiver|Baseline|Models.

---cache last 21 days---

Features are generated based on cached last 21 days data. This enables the accurate evaluation of lagged features and global features.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imperial x Optiver | Silver Medal

Optiver - Trading at the Close

Background

Dataset

Feature Engineering

Model & Inference

LightGBM

MLP (multilayer perception)

Inference

About

Releases

Packages

liyiyan128/optiver-trading-at-the-close

Folders and files

Latest commit

History

Repository files navigation

Imperial x Optiver | Silver Medal

Optiver - Trading at the Close

Background

Dataset

Feature Engineering

Model & Inference

LightGBM

MLP (multilayer perception)

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages