Introduction

This folder contains the scripts for reproducing the results in my master theis "An empirical comparison of methods for the prediction of survival from many longitudinal covariates".
The aim of the thesis is to empirically compare statistical methods for predicting the survival outcome using a large number of longitudinal covariates.
Four methods were compared:

pCox-baseline - penalized Cox model using baseline measurements only
pCox-landmarking - penalized Cox model using last observations carried forward
PRC-LMM - penalized regression calibration (Signorelli et al., 2019, https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9178)
MFPCCox - multivariate functional principal component analysis and Cox regression (Li et al., 2019, https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.8334)

Acknowledgement

The scripts for MFPCCox was based on the implementation of MFPCCox https://github.com/kan-li/MFPCCox and adapted for application to ADNI data.

Data availability

Data can be obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu/).

Files

There are various types of scripts in this repo, indicated by their prefixes:

function_: wrapper functions for model training, evaluation; utility functions for data preparation, processing result.
prepare_data: preparation of raw data into cleaned data used for model fitting
run_: running RCV to estimate the predictive performance of each method; result comparisons
IDA_: exploratory data analysis, to produce some figures and tables presented in the thesis

How to run

Step 1 - Set up

Install the R package ADNIMERGE (see Data availability above).
Check the folder structure:
Root
|_ data_cleaned
|_ figure
|_ model
|_ output

Step 2 - Prepare data

Run prepare_data.R to prepare two cleaned dataframes used for model fitting and testing:
- a surv data df.surv_preds: containg survival time, status, and baseline covariates (time-independent)
- a long data df.long_censored_transformed (df.long_censored if transformation is not needed)
The following steps are executed in the script:
- Data screening
- Remove data after survival time
- Reformat adnimerge dataframe to surv and long format respectively
- Data transformation
The resulting files are saved to /data_cleaned/adni_cleaned.RData for reuse.

Step 3 - Train and evaluate models

The following scripts will train and evaluate each method using repeated 10-fold CV, with stratified sampling.
Run the corresponding script for each method:
- pCox-baseline: run_repCV_glmnet.R, set method <- "pCox-bl" at the start of script
- pCox-landmarking: run_repCV_glmnet.R, set method <- "pCox-lm" at the start of script
- PRC-LMM: run_repCV_pencal.R
- MFPCCox: run_repCV_MFPCCox.R
Trained models are contained within a list folds of length 10 (for 10-fold CV).
Separately, evaluation results for corresponding model are contained within a list folds.eval.
For each CV in RCV, folds.eval will be saved to the a .RData in subfolder ./output/
- file naming is identifiable as "eval_ + model.name + _seed.RData", which model.name is an automatically generated string combining {method, hyperparameters, landmark time, data setting}
- e.g. eval_pCox-bl-min-ridge_lm2_scenario2_b5_transformed_scaled_seed721.RData
Trained model is not saved by default. Uncomment the code block # Save folds for training and future checking to save the trained models in similar fashion.
Check settings before running:
- n_RCV: number of repetitions for RCV
- T_LMs: landmark times
- Code block in # Set model hyperparam to change method-specific model hyperparameters
  - is_transformed: "transformed" (Transform covariates to reduce skewness)
  - is_scaled: "scaled" (scale covariates before model fitting)
- Code block in # Set data param to change the data setting i.e. candidate covariates. Options:
  - set_scenario accepts the following values:
    - "scenario2" (default) uses all logitudinal covariates that satistied missing_proportion_limit (default set at 0.1)
    - "scenario1": missing_proportion_limit set to 0. Every subject will have at least one observations in each longitudinal covariate selected
    - "scenario0": do not use any longitudinal covariates

Step 4 - Compare results

Open run_comparison_thesis.Rmd.
Evaluation files i.e. the folds.eavl saved in subfolder ./output/ will be loaded, aggregated, and compared.
The results are used to reproduce figures in the thesis.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
presentation		presentation
.gitattributes		.gitattributes
.gitignore		.gitignore
IDA_SurvLongData.Rmd		IDA_SurvLongData.Rmd
IDA_adnimerge.Rmd		IDA_adnimerge.Rmd
README.md		README.md
Thesis_dementia.Rproj		Thesis_dementia.Rproj
function_MFPCCox.R		function_MFPCCox.R
function_MFPCCox_exp.R		function_MFPCCox_exp.R
function_evaluation.R		function_evaluation.R
function_glmnet_exp.R		function_glmnet_exp.R
function_pencal_exp.R		function_pencal_exp.R
function_utility.R		function_utility.R
prepare_data.R		prepare_data.R
run_JM.R		run_JM.R
run_comparison_thesis.Rmd		run_comparison_thesis.Rmd
run_repCV_MFPCCox.R		run_repCV_MFPCCox.R
run_repCV_glmnet.R		run_repCV_glmnet.R
run_repCV_pencal.R		run_repCV_pencal.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Acknowledgement

Data availability

Files

How to run

Step 1 - Set up

Step 2 - Prepare data

Step 3 - Train and evaluate models

Step 4 - Compare results

About

Releases

Packages

Languages

freddy-feng/thesis_CompareSurvivalModels

Folders and files

Latest commit

History

Repository files navigation

Introduction

Acknowledgement

Data availability

Files

How to run

Step 1 - Set up

Step 2 - Prepare data

Step 3 - Train and evaluate models

Step 4 - Compare results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages