Request for an independence test for censored variables #1839

samblechman · 2025-01-06T14:30:18Z

For datasets with a time to event/failure variable, I do not think there is any conditional independence test for censored data (e.g., log rank test, or cox proportional hazards regression). For example, I have a survival variable that is right-censored (at 30 days because > 30 day death is not pertinent to the problem). Would it be possible/feasible to add an independence test to Tetrad that can handle censored data?

I have thought about alternatives: transforming data (e.g., imputing censored times), but believe I will lose statistical power in such cases.

Thanks!

jdramsey · 2025-01-06T17:35:41Z

Let me think about that for a bit. We have some ways to deal with missing data that do not lose statistical power. Testwise deletion is one option, though I need to think whether that's appropriate in this context. Also, Kun Zhang came up with a version of PC that explicitly made all of the adjustments in cases of missing data. Let me find that paper for you and send it.

jdramsey · 2025-01-07T09:03:12Z

I thought about this for a bit. It's a fair request, though I don't believe the log-rank test or proportional hazards regression is a good way to incorporate this missing data into a causal search. As far as I understand, there are practical and theoretical problems with casting those as general conditional independence tests.

However, this is a kind of missing-not-at-random (MNAR) problem for causal search, which has been addressed in the literature. A couple of papers come to mind:

Mohan, K., Pearl, J., & Tian, J. (2021). Causal inference and missing data. Proceedings of the National Academy of Sciences (PNAS).

Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., & Zhang, K. (2019, April). Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1762-1770). Pmlr.

Other papers deal with this problem as well. The Mohan et al. article is theoretical and discusses the basic strategy pursued in the Tu et al. article. However, the Tu et al. article (as I understand it) gives practical advice for searching for a mechanism for the MNAR missing value case. You may already know a mechanism, so you could bypass the search or guide it with background knowledge.

I had implemented MVPC at one point, but I must have deleted the implementation. I could implement it again or some algorithm like that. It involves expanding the dataset to include missingness indicators and then searching over the expanded dataset. If you know the missingness mechanism, you can guide the search over the expanded dataset with background knowledge.

That would be a sensible addition to Tetrad. I think I can remember how to do it, and it wouldn't take much effort.

What do you think?

jdramsey · 2025-01-07T09:17:40Z

To be honest, I don't know whether I can get this done anytime soon, but I will think about it further.

samblechman · 2025-01-07T16:21:00Z

Totally understand, it's a big ask. Thank you for your response and for the papers linked.

I would like to clarify a few things. As far as I understand, censored data is different than missing data. I suppose censoring (right censoring, in my case) is most similar to MAR or MNAR because healthier patients live longer and are therefore more likely to survive past the observation period. Although, it is unlikely that all the variables that contribute to survival time, and therefore cause censoring/missingness, are observed in the dataset.

This feels fundamentally different than missing data because it's not that censored patients don't have a survival time, its just that they don't have one that is less than the observed period. They either have a survival time longer than the observed period, or they are still alive.

Might there not be any general conditional independence test that can handle censored data? It's also worth mentioning that for some cases, linear regression using the imputed survival data yields a similar p-value as the log-rank test and coxph models. So if an algorithm uses linear regression as the conditional independence test (and if I encode the survival variable as continuous and impute the survival times for censored individuals as the censoring time), then that might work fine.

I would also like to ask exactly what "the missingness mechanism" refers to. Does that mean I need to know exactly which variables cause individuals to be censored?

jdramsey · 2025-01-07T16:21:52Z

Peter Spirtes also sent along this relevant paper:

Strobl, E. V., Visweswaran, S., & Spirtes, P. L. (2018). Fast causal inference with non-random missingness by test-wise deletion. International journal of data science and analytics, 6, 47-62.

Testwise deletion is already implemented in Tetrad, but I need to check some details. I found one issue last night.

jdramsey · 2025-01-07T16:49:19Z

If you knew such information, it would be helpful. If not, the idea is to try to discover it from data, or at least to account for it. You know, years ago I went to the trouble of implementing an independence test based on regression but have since deleted it since it was equivalent to (and gave the same answers as) zero conditional correlation (our Fisher Z test). Also, our score-based algorithms, especially BOSS and GRaSP (recent algorithms), are much more accurate than our test-based algorithms, so I'm motivated to find a solution based on those algorithms if possible. Andrews, B., Ramsey, J., Sanchez Romero, R., Camchong, J., & Kummerfeld, E. (2023). Fast scalable and accurate discovery of dags using the best order score search and grow shrink trees. *Advances in Neural Information Processing Systems*, *36*, 63945-63956. What is your idea for a general conditional independence test using, say, log rank or coxph? When I thought about it, I couldn't think of how to make it work in a general way. It would have to work for X _||_ Y | Z1,...Zn, where any subset of these could have censored data. Also, if you get a chance, have a look at the papers I sent. They're modeling missingness using missingness indicators, the last paper implicitly. The idea is that if there is a way that missingness influences the causal results, these algorithms should account for it in a theoretically correct way. Whether that pans out in practice is something that would have to be explored; I've never explored it with censored data in particular. My idea, when I get a chance, is first to build a simulation model for MNAR in Tetrad and try various methods on it. I should be able to simulate censored data. That way I'd have some sense of it, which I don't currently have. I don't know exactly when I'll do this of course, but as I suggested you posed a well-formed question so I'm interested. Last night I did the first step, which was to write some code to produce the expanded dataset with the missingness indicators, which works well. Then I ran BOSS on the result to see what would happen, found a bug in the testwise deletion code, and fixed that. It's a first step; I would like to think of a simulation method. By the way, if you want to help out with any of this, feel free.

…

On Tue, Jan 7, 2025 at 11:21 AM Samuel Blechman ***@***.***> wrote: Totally understand, it's a big ask. Thank you for your response and for the papers linked. I would like to clarify a few things. As far as I understand, censored data is different than missing data. I suppose censoring (right censoring, in my case) is most similar to MAR or MNAR because healthier patients live longer and are therefore more likely to survive past the observation period. Although, it is unlikely that all the variables that contribute to survival time, and therefore cause censoring/missingness, are observed in the dataset. This feels fundamentally different than missing data because it's not that censored patients don't have a survival time, its just that they don't have one that is less than the observed period. They either have a survival time longer than the observed period, or they are still alive. Might there not be any general conditional independence test that can handle censored data? It's also worth mentioning that for some cases, linear regression using the imputed survival data yields a similar p-value as the log-rank test and coxph models. So if an algorithm uses linear regression as the conditional independence test (and if I encode the survival variable as continuous and impute the survival times for censored individuals as the censoring time), then that might work fine. I would also like to ask exactly what "the missingness mechanism" refers to. Does that mean I need to know exactly which variables cause individuals to be censored? — Reply to this email directly, view it on GitHub <#1839 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLFSR3U7XM63BCVGP3MPHL2JP5IHAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZVG4YTCNZZGQ> . You are receiving this because you commented.Message ID: ***@***.***>

jdramsey · 2025-01-07T17:20:38Z

Oh wait, this:

This feels fundamentally different than missing data because it's not that censored patients don't have a 
survival time, its just that they don't have one that is less than the observed period. They either have a 
survival time longer than the observed period, or they are still alive.

Do you mean to say when you're censoring data, you don't have missing data? Do you have a marker indicating that a variable's value is maxed out beyond the scope of the study? Do you take that to be different from not recording a value for that variable for that patient? Maybe I misunderstand how you're handling it.

samblechman · 2025-01-07T18:02:06Z

I will read those papers, thank you.

Example data below to illustrate what I am describing, using a manually imposed 30-day censoring. Every patient is observed for at least 30 days after treatment (many for much much longer) but we only care about the first 30 days, so we ignore any death beyond 30 days, give them a survival time of 30 days, and indicate that they were censored. If a patient was treated in 2021 and is still living to the point of data collection (e.g., patient 5 in the table below), they are treated the same as a patients 3 and 4.

The censoring indicator is called status in some software (coxph() and survfit() functions in R).

Patient	Survival time (raw)	Survival time (censored)	censoring indicator
1	5	5	0
2	19	19	0
3	104	30	1
4	788	30	1
5	NA (still alive)	30	1

In this instance, if a variable (e.g., drug treatment) causes patients to live longer, it has a causal influence on survival and on the censoring indicator (since the censoring indicator is a deterministic function of survival time).

As for a general test of conditional independence for censored data, I am not sure how that would work. But if only one variable was censored (e.g., survival time) and it is on the lowest temporal tier (cannot be the cause of any other variable), would it ever be any other variable than X or Y in the formula below?
X || Y | Z1,...Zn

I really appreciate you putting so much thought and work into this. I hope my questions/responses have made sense.

jdramsey · 2025-01-07T18:44:05Z

I see; let me think about it. On the face of it, the difference between the two approaches is that instead of writing "30" for the censored values, you indicate you don't know what the values are, by putting missing value markers there (e.g., "*"). The censoring indicator is simply reversed from the missingness indicator in the Tu et al. paper, which would switch the 1's and 0's (I think). The raw survival time would be a latent variable, not available in the data. You are right about the temporal tiers. That could be made a condition for doing this sort of analysis. Of course, it would be difficult in the search to specify that different independence tests should be used for different conditions, so the test would need to handle more than just the final tier checks. I need to think about this more.

…

On Tue, Jan 7, 2025 at 1:02 PM Samuel Blechman ***@***.***> wrote: I will read those papers, thank you. Example data below to illustrate what I am describing, using a manually imposed 30-day censoring. Every patient is observed for at least 30 days after treatment (many for much much longer) but we only care about the first 30 days, so we ignore any death beyond 30 days, give them a survival time of 30 days, and indicate that they were censored. If a patient was treated in 2021 and is still living to the point of data collection (e.g., patient 5 in the table below), they are treated the same as a patients 3 and 4. The censoring indicator is called status in some software (coxph() and survfit() functions in R). Patient Survival time (raw) Survival time (censored) censoring indicator 1 5 5 0 2 19 19 0 3 104 30 1 4 788 30 1 5 NA (still alive) 30 1 In this instance, if a variable (e.g., drug treatment) causes patients to live longer, it has a causal influence on survival and on the censoring indicator (since the censoring indicator is a deterministic function of survival time). As for a general test of conditional independence for censored data, I am not sure how that would work. But if only one variable was censored (e.g., survival time) and it is on the lowest temporal tier (cannot be the cause of any other variable), would it ever be any other variable than X or Y in the formula below? X *||* Y | Z1,...Zn I really appreciate you putting so much thought and work into this. I hope my questions/responses have made sense. — Reply to this email directly, view it on GitHub <#1839 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLFSR2PRJMMCXN4MXJPZS32JQJDJAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZVHEYTQOJVGA> . You are receiving this because you commented.Message ID: ***@***.***>

jdramsey · 2025-01-08T08:51:09Z

I did the next step on the censored data project by writing an MNAR simulator and putting it in the Tetrad interface. This generates a graph, augments it with missingness variables for each node, adds extra missingness influences from other variables to the missingness variables, produces LG data over all variables, and thresholds the missingness variables to form indicator variables. The data variables are then set to missing if their missingness variables are set to 0. Then, the missingness variables are removed from the data, and the data with missing values is returned.

This is all random. I could give the user more control. Also, as indicated, it's linear Gaussian; I suppose I could allow the user to pick a model class or add this option to any model class.

I added this to the Tetrad interface as a new simulator. As usual, the interface lets you pick a graph randomly (or create one), set the parameters for the simulation, and obtain MNAR data to analyze.

I tested it by running LV-Lite on the data with testwise deletion and got the right answer on the first example I tried. This uses the idea of the paper by Strobl et al., which recommends you run a latent variable algorithm on the data using testwise deletion.

I have not stress-tested it.

cg09 · 2025-01-08T15:19:20Z

Impressive. Censoring is often in extreme values--high or low or both and is for a region of values rather than a point value.

…

On Wed, Jan 8, 2025 at 3:51 AM Joseph Ramsey ***@***.***> wrote: I did the next step on the censored data project by writing an MNAR simulator and putting it in the Tetrad interface. This generates a graph, augments it with missingness variables for each node, adds extra missingness influences from other variables to the missingness variables, produces LG data over all variables, and thresholds the missingness variables to form indicator variables. The data variables are then set to missing if their missingness variables are set to 0. Then, the missingness variables are removed from the data, and the data with missing values is returned. This is all random. I could give the user more control. Also, as indicated, it's linear Gaussian; I suppose I could allow the user to pick a model class or add this option to any model class. I added this to the Tetrad interface as a new simulator. As usual, the interface lets you pick a graph randomly (or create one), set the parameters for the simulation, and obtain MNAR data to analyze. I tested it by running LV-Lite on the data with testwise deletion and got the right answer on the first example I tried. This uses the idea of the paper by Strobl et al., which recommends you run a latent variable algorithm on the data using testwise deletion. I have not stress-tested it. — Reply to this email directly, view it on GitHub <#1839 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD4Y3OODTNWB57DY3L6I3IL2JTRJJAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXGEYTGMJUGA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jdramsey · 2025-01-12T08:26:32Z

Just to put this here as a note for Cox regression, a suggestion I saw was to try the 'smile' library. Here's a code example:

import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.data.vector.IntVector;
import smile.data.vector.DoubleVector;
import smile.regression.CoxPH;

public class CoxRegressionMultipleCovariates {

    public static void main(String[] args) {
        // Survival times
        double[] time = {5, 8, 12, 14, 18, 22, 25};

        // Event indicator (1 = event occurred, 0 = censored)
        int[] event = {1, 1, 1, 0, 1, 0, 1};

        // Predictor variables (covariates)
        double[] covariateX1 = {0, 1, 0, 1, 0, 1, 1}; // Covariate 1
        double[] covariateX2 = {3.2, 4.1, 2.8, 4.5, 3.3, 4.0, 4.2}; // Covariate 2

        // Create a DataFrame
        DataFrame data = DataFrame.of(
            IntVector.of("Event", event),          // Event indicator
            DoubleVector.of("Time", time),         // Survival time
            DoubleVector.of("X1", covariateX1),    // Covariate 1
            DoubleVector.of("X2", covariateX2)     // Covariate 2
        );

        // Define the formula for the Cox model
        Formula formula = Formula.lhs("Time").rhs("X1 + X2");

        // Fit the Cox regression model
        CoxPH coxModel = CoxPH.fit(formula, data);

        // Print the results
        System.out.println("Coefficients: " + coxModel.coefficients());
        System.out.println("Log-Likelihood: " + coxModel.logLikelihood());
        System.out.println("Concordance Index: " + coxModel.concordanceIndex());
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for an independence test for censored variables #1839

Request for an independence test for censored variables #1839

samblechman commented Jan 6, 2025

jdramsey commented Jan 6, 2025

jdramsey commented Jan 7, 2025

jdramsey commented Jan 7, 2025

samblechman commented Jan 7, 2025

jdramsey commented Jan 7, 2025

jdramsey commented Jan 7, 2025 via email

jdramsey commented Jan 7, 2025 •

edited

Loading

samblechman commented Jan 7, 2025

jdramsey commented Jan 7, 2025 via email •

edited

Loading

jdramsey commented Jan 8, 2025

cg09 commented Jan 8, 2025 via email

jdramsey commented Jan 12, 2025

Request for an independence test for censored variables #1839

Request for an independence test for censored variables #1839

Comments

samblechman commented Jan 6, 2025

jdramsey commented Jan 6, 2025

jdramsey commented Jan 7, 2025

jdramsey commented Jan 7, 2025

samblechman commented Jan 7, 2025

jdramsey commented Jan 7, 2025

jdramsey commented Jan 7, 2025 via email

jdramsey commented Jan 7, 2025 • edited Loading

samblechman commented Jan 7, 2025

jdramsey commented Jan 7, 2025 via email • edited Loading

jdramsey commented Jan 8, 2025

cg09 commented Jan 8, 2025 via email

jdramsey commented Jan 12, 2025

jdramsey commented Jan 7, 2025 •

edited

Loading

jdramsey commented Jan 7, 2025 via email •

edited

Loading