Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for an independence test for censored variables #1839

Open
samblechman opened this issue Jan 6, 2025 · 12 comments
Open

Request for an independence test for censored variables #1839

samblechman opened this issue Jan 6, 2025 · 12 comments

Comments

@samblechman
Copy link

For datasets with a time to event/failure variable, I do not think there is any conditional independence test for censored data (e.g., log rank test, or cox proportional hazards regression). For example, I have a survival variable that is right-censored (at 30 days because > 30 day death is not pertinent to the problem). Would it be possible/feasible to add an independence test to Tetrad that can handle censored data?

I have thought about alternatives: transforming data (e.g., imputing censored times), but believe I will lose statistical power in such cases.

Thanks!

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 6, 2025

Let me think about that for a bit. We have some ways to deal with missing data that do not lose statistical power. Testwise deletion is one option, though I need to think whether that's appropriate in this context. Also, Kun Zhang came up with a version of PC that explicitly made all of the adjustments in cases of missing data. Let me find that paper for you and send it.

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025

I thought about this for a bit. It's a fair request, though I don't believe the log-rank test or proportional hazards regression is a good way to incorporate this missing data into a causal search. As far as I understand, there are practical and theoretical problems with casting those as general conditional independence tests.

However, this is a kind of missing-not-at-random (MNAR) problem for causal search, which has been addressed in the literature. A couple of papers come to mind:

Mohan, K., Pearl, J., & Tian, J. (2021). Causal inference and missing data. Proceedings of the National Academy of Sciences (PNAS).

Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., & Zhang, K. (2019, April). Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1762-1770). Pmlr.

Other papers deal with this problem as well. The Mohan et al. article is theoretical and discusses the basic strategy pursued in the Tu et al. article. However, the Tu et al. article (as I understand it) gives practical advice for searching for a mechanism for the MNAR missing value case. You may already know a mechanism, so you could bypass the search or guide it with background knowledge.

I had implemented MVPC at one point, but I must have deleted the implementation. I could implement it again or some algorithm like that. It involves expanding the dataset to include missingness indicators and then searching over the expanded dataset. If you know the missingness mechanism, you can guide the search over the expanded dataset with background knowledge.

That would be a sensible addition to Tetrad. I think I can remember how to do it, and it wouldn't take much effort.

What do you think?

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025

To be honest, I don't know whether I can get this done anytime soon, but I will think about it further.

@samblechman
Copy link
Author

Totally understand, it's a big ask. Thank you for your response and for the papers linked.

I would like to clarify a few things. As far as I understand, censored data is different than missing data. I suppose censoring (right censoring, in my case) is most similar to MAR or MNAR because healthier patients live longer and are therefore more likely to survive past the observation period. Although, it is unlikely that all the variables that contribute to survival time, and therefore cause censoring/missingness, are observed in the dataset.

This feels fundamentally different than missing data because it's not that censored patients don't have a survival time, its just that they don't have one that is less than the observed period. They either have a survival time longer than the observed period, or they are still alive.

Might there not be any general conditional independence test that can handle censored data? It's also worth mentioning that for some cases, linear regression using the imputed survival data yields a similar p-value as the log-rank test and coxph models. So if an algorithm uses linear regression as the conditional independence test (and if I encode the survival variable as continuous and impute the survival times for censored individuals as the censoring time), then that might work fine.

I would also like to ask exactly what "the missingness mechanism" refers to. Does that mean I need to know exactly which variables cause individuals to be censored?

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025

Peter Spirtes also sent along this relevant paper:

Strobl, E. V., Visweswaran, S., & Spirtes, P. L. (2018). Fast causal inference with non-random missingness by test-wise deletion. International journal of data science and analytics, 6, 47-62.

Testwise deletion is already implemented in Tetrad, but I need to check some details. I found one issue last night.

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025 via email

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025

Oh wait, this:

This feels fundamentally different than missing data because it's not that censored patients don't have a 
survival time, its just that they don't have one that is less than the observed period. They either have a 
survival time longer than the observed period, or they are still alive.

Do you mean to say when you're censoring data, you don't have missing data? Do you have a marker indicating that a variable's value is maxed out beyond the scope of the study? Do you take that to be different from not recording a value for that variable for that patient? Maybe I misunderstand how you're handling it.

@samblechman
Copy link
Author

I will read those papers, thank you.

Example data below to illustrate what I am describing, using a manually imposed 30-day censoring. Every patient is observed for at least 30 days after treatment (many for much much longer) but we only care about the first 30 days, so we ignore any death beyond 30 days, give them a survival time of 30 days, and indicate that they were censored. If a patient was treated in 2021 and is still living to the point of data collection (e.g., patient 5 in the table below), they are treated the same as a patients 3 and 4.

The censoring indicator is called status in some software (coxph() and survfit() functions in R).

Patient Survival time (raw) Survival time (censored) censoring indicator
1 5 5 0
2 19 19 0
3 104 30 1
4 788 30 1
5 NA (still alive) 30 1

In this instance, if a variable (e.g., drug treatment) causes patients to live longer, it has a causal influence on survival and on the censoring indicator (since the censoring indicator is a deterministic function of survival time).

As for a general test of conditional independence for censored data, I am not sure how that would work. But if only one variable was censored (e.g., survival time) and it is on the lowest temporal tier (cannot be the cause of any other variable), would it ever be any other variable than X or Y in the formula below?
X || Y | Z1,...Zn

I really appreciate you putting so much thought and work into this. I hope my questions/responses have made sense.

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 7, 2025 via email

@jdramsey
Copy link
Collaborator

jdramsey commented Jan 8, 2025

I did the next step on the censored data project by writing an MNAR simulator and putting it in the Tetrad interface. This generates a graph, augments it with missingness variables for each node, adds extra missingness influences from other variables to the missingness variables, produces LG data over all variables, and thresholds the missingness variables to form indicator variables. The data variables are then set to missing if their missingness variables are set to 0. Then, the missingness variables are removed from the data, and the data with missing values is returned.

This is all random. I could give the user more control. Also, as indicated, it's linear Gaussian; I suppose I could allow the user to pick a model class or add this option to any model class.

I added this to the Tetrad interface as a new simulator. As usual, the interface lets you pick a graph randomly (or create one), set the parameters for the simulation, and obtain MNAR data to analyze.

I tested it by running LV-Lite on the data with testwise deletion and got the right answer on the first example I tried. This uses the idea of the paper by Strobl et al., which recommends you run a latent variable algorithm on the data using testwise deletion.

I have not stress-tested it.

@cg09
Copy link

cg09 commented Jan 8, 2025 via email

@jdramsey
Copy link
Collaborator

Just to put this here as a note for Cox regression, a suggestion I saw was to try the 'smile' library. Here's a code example:

import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.data.vector.IntVector;
import smile.data.vector.DoubleVector;
import smile.regression.CoxPH;

public class CoxRegressionMultipleCovariates {

    public static void main(String[] args) {
        // Survival times
        double[] time = {5, 8, 12, 14, 18, 22, 25};

        // Event indicator (1 = event occurred, 0 = censored)
        int[] event = {1, 1, 1, 0, 1, 0, 1};

        // Predictor variables (covariates)
        double[] covariateX1 = {0, 1, 0, 1, 0, 1, 1}; // Covariate 1
        double[] covariateX2 = {3.2, 4.1, 2.8, 4.5, 3.3, 4.0, 4.2}; // Covariate 2

        // Create a DataFrame
        DataFrame data = DataFrame.of(
            IntVector.of("Event", event),          // Event indicator
            DoubleVector.of("Time", time),         // Survival time
            DoubleVector.of("X1", covariateX1),    // Covariate 1
            DoubleVector.of("X2", covariateX2)     // Covariate 2
        );

        // Define the formula for the Cox model
        Formula formula = Formula.lhs("Time").rhs("X1 + X2");

        // Fit the Cox regression model
        CoxPH coxModel = CoxPH.fit(formula, data);

        // Print the results
        System.out.println("Coefficients: " + coxModel.coefficients());
        System.out.println("Log-Likelihood: " + coxModel.logLikelihood());
        System.out.println("Concordance Index: " + coxModel.concordanceIndex());
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants