gbmSPM
contains the code used to study temporal patterns in tumor volumes and neurological exams to predict residual survival in a glioblastoma cohort in the paper:
Smedley, Nova F., Benjamin M. Ellingson, Timothy F. Cloughesy, and William Hsu. "Longitudinal Patterns in Clinical and Imaging Measurements Predict Residual Survival in Glioblastoma Patients." Scientific reports 8, no. 1 (2018): 14429.
Methods for data preprocessing, sequential pattern mining (SPM) with arules
and arulesSequences
, and logit modelling with glmnet
and caret
were converted into an R package named gbmSPM
.
This gbmSPM
repository has:
R
: R libraryman
: library documentationvignettes
: vignettes for using the librarydata
: dummy datapaper
: code used in the published paperexamples
: Rscripts similar to how the paper used the library
Additional details can be found in Supplemental Materials.
To just get the R package, in RStudio:
install.packages("devtools")
library(devtools)
install_github("novasmedley/gbmSPM")
Now you can run the examples in 'RStudio examples and vignettes.'
To get the full repository, on the terminal, cd
to your desired directory to put this repository and enter:
git clone https://github.com/novasmedley/gbmSpm.git
Now you can run the examples in 'command line examples.'
As usual, function documentation can be reached by running ?function_name
, in RStudio, e.g., ?getAge
.
In RStudio:
library(gbmSpm)
# create output directory
outputDir <- '~/gbm_spm_example'
spmDir <- file.path(outputDir,'spm')
logDir <- file.path(outputDir,'spm_logs')
lapply(c(outputDir, spmDir, logDir), function(i) ifelse(!dir.exists(i), dir.create(i, showWarnings = F, recursive = T), F) )
# load example data
data("fake_data")
# set SPM hyperparameters
minSupp <- 0.4
maxgap <- 60
maxlen <- 2
maxsize <- 2
tType <- 'rate'
suppList <- seq(minSupp, 0.4, .05)
# prep example data
fake_tumorInfo <- fake_data$events
fake_demo <- fake_data$demo
fake_data$events <- cleanData(fake_data$events, tType)
cat('...',nrow(fake_data$events), " events left for SPM after cleaning", '\n')
fake_data <- merge(fake_data$events, fake_data$person, by='iois', all.x=T)
fake_data <- prepDemographics(fake_data, fake_demo)
fake_data <- prepSurvivalLabels(fake_data)
fake_data <- getTumorLocation(fake_data, fake_tumorInfo)
# perform SPM to create data in output directories:
# 1) transactions, 2) patterns, and 3) feature vectors
runSPM(event = fake_data,
suppList = suppList,
maxgap = maxgap,
maxlen = maxlen,
maxsize = maxsize,
tType = tType,
outputDir = spmDir)
# load data back in for inspection
features <- readRDS(file.path(spmDir, 'sup0.4g60l2z2','featureVectors_rateChange.rds'))
The transactions file is used by arules
for generating frequent sequences, see ?createTransactions
. The patterns file is a list of frequent sequences and the days that it was observed for each patient, see ?findPatternDays
.
Feature vectors are clinical visits (delineated by each pair of patient ID and event ID) and whether a frequent sequences was observed or not, and contains other clinical information for that event (e.g., patient age at event ID). These are the inputs to the logit modeling.
For step-by-step examples, see the vignettes: "Generate sequential patterns"" and "Predicting residual survival", "Model selection and plotting performance."
Since the pipeline was repeated while searching for hyperparameters, two components were converted for use via the command line:
-
Generating temporal features:
To use dummy data, run
example_spm.R
from theexamples
folder:$ Rscript /PATHTO/example_spm.R --tType 'rate' --minSupp 0.4 --minSuppList 'no' --maxgap 60 --maxlength 2 --maxsize 2--outDir ~/gbm_spm_example
-
Predicting residual survival
To use dummy data, run
example_logit_spm.R
in theexamples
folder:$ Rscript example_logit_spm.R --tType 'rate' --maxgap 60 --maxlength 2 --lmax 0.2 --llength 100 --dataFolder sup0.25g60l2z2 --prefix logits --dir ~/gbm_spm_example_multi --saveLogit yes
Alternatively, the example scripts can be explored in RStudio. For more details, see in-depth examples in vignettes
.
The paper
folder contains the R code used in the published paper. Specifically, it has:
spm
: creating temporal patternslogit
: logistic regression modelingvol_thresholding
: predicting residual survival using only tumor volume information, aka by thresholding volumeresults_analysis
: extracting cross-validation results and select best approachstats_test
: testing for differences in ROC curves and univariate analysis of patternsfinalModel_retrain.R
: retraining selected approach and apply to test partition- misc. functions, e.g., creating paper figures
For example, to generate temporal features, run_spm.R
in the paper
folder was called in the terminal:
$ Rscript run_spm.R --tType 'rate' --minSupp 0.4 --minSupp 0.4 --minSuppList 'no' --maxgap 60 --maxlength 2 --outDir ~/spm_analysis
The script first calls a database, performs data preprocessing, mines for patterns, and then creates the feature vectors formatted for logit modeling. The feature vectors included temporal patterns identified by SPM, age, and static variables such as ethnicity.
run_spm.R
returns nothing, but saves the days for which the patterns occurs, patternDays
and the corresponding feature vectors, featureVectors
as .RData objects. These .RData objects are later read by run_logits_spm.R
.
Thus, bash scripts were used to execute SPM under different conditions, see run_spm.sh
and run_spm_all.sh
.
If you want to cite this work, please cite the paper:
@article{smedley2018longitudinal,
title={Longitudinal Patterns in Clinical and Imaging Measurements Predict Residual Survival in Glioblastoma Patients},
author={Smedley, Nova F and Ellingson, Benjamin M and Cloughesy, Timothy F and Hsu, William},
journal={Scientific reports},
volume={8},
number={1},
pages={14429},
year={2018},
publisher={Nature Publishing Group}
}
and the repo:
@misc{smedleygbmspm,
title={gbmSPM},
author={Smedley, Nova F},
year={2018},
publisher={GitHub},
howpublished={\url{https://github.com/novasmedley/gbmSpm}},
}