-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for an independence test for censored variables #1839
Comments
Let me think about that for a bit. We have some ways to deal with missing data that do not lose statistical power. Testwise deletion is one option, though I need to think whether that's appropriate in this context. Also, Kun Zhang came up with a version of PC that explicitly made all of the adjustments in cases of missing data. Let me find that paper for you and send it. |
I thought about this for a bit. It's a fair request, though I don't believe the log-rank test or proportional hazards regression is a good way to incorporate this missing data into a causal search. As far as I understand, there are practical and theoretical problems with casting those as general conditional independence tests. However, this is a kind of missing-not-at-random (MNAR) problem for causal search, which has been addressed in the literature. A couple of papers come to mind: Mohan, K., Pearl, J., & Tian, J. (2021). Causal inference and missing data. Proceedings of the National Academy of Sciences (PNAS). Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., & Zhang, K. (2019, April). Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1762-1770). Pmlr. Other papers deal with this problem as well. The Mohan et al. article is theoretical and discusses the basic strategy pursued in the Tu et al. article. However, the Tu et al. article (as I understand it) gives practical advice for searching for a mechanism for the MNAR missing value case. You may already know a mechanism, so you could bypass the search or guide it with background knowledge. I had implemented MVPC at one point, but I must have deleted the implementation. I could implement it again or some algorithm like that. It involves expanding the dataset to include missingness indicators and then searching over the expanded dataset. If you know the missingness mechanism, you can guide the search over the expanded dataset with background knowledge. That would be a sensible addition to Tetrad. I think I can remember how to do it, and it wouldn't take much effort. What do you think? |
To be honest, I don't know whether I can get this done anytime soon, but I will think about it further. |
Totally understand, it's a big ask. Thank you for your response and for the papers linked. I would like to clarify a few things. As far as I understand, censored data is different than missing data. I suppose censoring (right censoring, in my case) is most similar to MAR or MNAR because healthier patients live longer and are therefore more likely to survive past the observation period. Although, it is unlikely that all the variables that contribute to survival time, and therefore cause censoring/missingness, are observed in the dataset. This feels fundamentally different than missing data because it's not that censored patients don't have a survival time, its just that they don't have one that is less than the observed period. They either have a survival time longer than the observed period, or they are still alive. Might there not be any general conditional independence test that can handle censored data? It's also worth mentioning that for some cases, linear regression using the imputed survival data yields a similar p-value as the log-rank test and coxph models. So if an algorithm uses linear regression as the conditional independence test (and if I encode the survival variable as continuous and impute the survival times for censored individuals as the censoring time), then that might work fine. I would also like to ask exactly what "the missingness mechanism" refers to. Does that mean I need to know exactly which variables cause individuals to be censored? |
Peter Spirtes also sent along this relevant paper: Strobl, E. V., Visweswaran, S., & Spirtes, P. L. (2018). Fast causal inference with non-random missingness by test-wise deletion. International journal of data science and analytics, 6, 47-62. Testwise deletion is already implemented in Tetrad, but I need to check some details. I found one issue last night. |
If you knew such information, it would be helpful. If not, the idea is to
try to discover it from data, or at least to account for it.
You know, years ago I went to the trouble of implementing an independence
test based on regression but have since deleted it since it was equivalent
to (and gave the same answers as) zero conditional correlation (our Fisher
Z test). Also, our score-based algorithms, especially BOSS and GRaSP
(recent algorithms), are much more accurate than our test-based algorithms,
so I'm motivated to find a solution based on those algorithms if possible.
Andrews, B., Ramsey, J., Sanchez Romero, R., Camchong, J., & Kummerfeld, E.
(2023). Fast scalable and accurate discovery of dags using the best order
score search and grow shrink trees. *Advances in Neural Information
Processing Systems*, *36*, 63945-63956.
What is your idea for a general conditional independence test using, say,
log rank or coxph? When I thought about it, I couldn't think of how to make
it work in a general way. It would have to work for X _||_ Y | Z1,...Zn,
where any subset of these could have censored data.
Also, if you get a chance, have a look at the papers I sent. They're
modeling missingness using missingness indicators, the last paper
implicitly. The idea is that if there is a way that missingness influences
the causal results, these algorithms should account for it in a
theoretically correct way. Whether that pans out in practice is
something that would have to be explored; I've never explored it with
censored data in particular.
My idea, when I get a chance, is first to build a simulation model for MNAR
in Tetrad and try various methods on it. I should be able to simulate
censored data. That way I'd have some sense of it, which I don't currently
have.
I don't know exactly when I'll do this of course, but as I suggested you
posed a well-formed question so I'm interested. Last night I did the first
step, which was to write some code to produce the expanded dataset with the
missingness indicators, which works well. Then I ran BOSS on the result to
see what would happen, found a bug in the testwise deletion code, and fixed
that. It's a first step; I would like to think of a simulation method.
By the way, if you want to help out with any of this, feel free.
…On Tue, Jan 7, 2025 at 11:21 AM Samuel Blechman ***@***.***> wrote:
Totally understand, it's a big ask. Thank you for your response and for
the papers linked.
I would like to clarify a few things. As far as I understand, censored
data is different than missing data. I suppose censoring (right censoring,
in my case) is most similar to MAR or MNAR because healthier patients live
longer and are therefore more likely to survive past the observation
period. Although, it is unlikely that all the variables that contribute to
survival time, and therefore cause censoring/missingness, are observed in
the dataset.
This feels fundamentally different than missing data because it's not that
censored patients don't have a survival time, its just that they don't have
one that is less than the observed period. They either have a survival time
longer than the observed period, or they are still alive.
Might there not be any general conditional independence test that can
handle censored data? It's also worth mentioning that for some cases,
linear regression using the imputed survival data yields a similar p-value
as the log-rank test and coxph models. So if an algorithm uses linear
regression as the conditional independence test (and if I encode the
survival variable as continuous and impute the survival times for censored
individuals as the censoring time), then that might work fine.
I would also like to ask exactly what "the missingness mechanism" refers
to. Does that mean I need to know exactly which variables cause individuals
to be censored?
—
Reply to this email directly, view it on GitHub
<#1839 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLFSR3U7XM63BCVGP3MPHL2JP5IHAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZVG4YTCNZZGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Oh wait, this:
Do you mean to say when you're censoring data, you don't have missing data? Do you have a marker indicating that a variable's value is maxed out beyond the scope of the study? Do you take that to be different from not recording a value for that variable for that patient? Maybe I misunderstand how you're handling it. |
I will read those papers, thank you. Example data below to illustrate what I am describing, using a manually imposed 30-day censoring. Every patient is observed for at least 30 days after treatment (many for much much longer) but we only care about the first 30 days, so we ignore any death beyond 30 days, give them a survival time of 30 days, and indicate that they were censored. If a patient was treated in 2021 and is still living to the point of data collection (e.g., patient 5 in the table below), they are treated the same as a patients 3 and 4. The censoring indicator is called
In this instance, if a variable (e.g., drug treatment) causes patients to live longer, it has a causal influence on survival and on the censoring indicator (since the censoring indicator is a deterministic function of survival time). As for a general test of conditional independence for censored data, I am not sure how that would work. But if only one variable was censored (e.g., survival time) and it is on the lowest temporal tier (cannot be the cause of any other variable), would it ever be any other variable than X or Y in the formula below? I really appreciate you putting so much thought and work into this. I hope my questions/responses have made sense. |
I see; let me think about it. On the face of it, the difference between the
two approaches is that instead of writing "30" for the censored values, you
indicate you don't know what the values are, by putting missing value
markers there (e.g., "*"). The censoring indicator is simply reversed from
the missingness indicator in the Tu et al. paper, which would switch the 1's
and 0's (I think). The raw survival time would be a latent variable, not
available in the data.
You are right about the temporal tiers. That could be made a condition for
doing this sort of analysis. Of course, it would be difficult in the search
to specify that different independence tests should be used for different
conditions, so the test would need to handle more than just the final tier
checks.
I need to think about this more.
…On Tue, Jan 7, 2025 at 1:02 PM Samuel Blechman ***@***.***> wrote:
I will read those papers, thank you.
Example data below to illustrate what I am describing, using a manually
imposed 30-day censoring. Every patient is observed for at least 30 days
after treatment (many for much much longer) but we only care about the
first 30 days, so we ignore any death beyond 30 days, give them a survival
time of 30 days, and indicate that they were censored. If a patient was
treated in 2021 and is still living to the point of data collection (e.g.,
patient 5 in the table below), they are treated the same as a patients 3
and 4.
The censoring indicator is called status in some software (coxph() and
survfit() functions in R).
Patient Survival time (raw) Survival time (censored) censoring indicator
1 5 5 0
2 19 19 0
3 104 30 1
4 788 30 1
5 NA (still alive) 30 1
In this instance, if a variable (e.g., drug treatment) causes patients to
live longer, it has a causal influence on survival and on the censoring
indicator (since the censoring indicator is a deterministic function of
survival time).
As for a general test of conditional independence for censored data, I am
not sure how that would work. But if only one variable was censored (e.g.,
survival time) and it is on the lowest temporal tier (cannot be the cause
of any other variable), would it ever be any other variable than X or Y in
the formula below?
X *||* Y | Z1,...Zn
I really appreciate you putting so much thought and work into this. I hope
my questions/responses have made sense.
—
Reply to this email directly, view it on GitHub
<#1839 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLFSR2PRJMMCXN4MXJPZS32JQJDJAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZVHEYTQOJVGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I did the next step on the censored data project by writing an MNAR simulator and putting it in the Tetrad interface. This generates a graph, augments it with missingness variables for each node, adds extra missingness influences from other variables to the missingness variables, produces LG data over all variables, and thresholds the missingness variables to form indicator variables. The data variables are then set to missing if their missingness variables are set to 0. Then, the missingness variables are removed from the data, and the data with missing values is returned. This is all random. I could give the user more control. Also, as indicated, it's linear Gaussian; I suppose I could allow the user to pick a model class or add this option to any model class. I added this to the Tetrad interface as a new simulator. As usual, the interface lets you pick a graph randomly (or create one), set the parameters for the simulation, and obtain MNAR data to analyze. I tested it by running LV-Lite on the data with testwise deletion and got the right answer on the first example I tried. This uses the idea of the paper by Strobl et al., which recommends you run a latent variable algorithm on the data using testwise deletion. I have not stress-tested it. |
Impressive. Censoring is often in extreme values--high or low or both and
is for a region of values rather than a point value.
…On Wed, Jan 8, 2025 at 3:51 AM Joseph Ramsey ***@***.***> wrote:
I did the next step on the censored data project by writing an MNAR
simulator and putting it in the Tetrad interface. This generates a graph,
augments it with missingness variables for each node, adds extra
missingness influences from other variables to the missingness variables,
produces LG data over all variables, and thresholds the missingness
variables to form indicator variables. The data variables are then set to
missing if their missingness variables are set to 0. Then, the missingness
variables are removed from the data, and the data with missing values is
returned.
This is all random. I could give the user more control. Also, as
indicated, it's linear Gaussian; I suppose I could allow the user to pick a
model class or add this option to any model class.
I added this to the Tetrad interface as a new simulator. As usual, the
interface lets you pick a graph randomly (or create one), set the
parameters for the simulation, and obtain MNAR data to analyze.
I tested it by running LV-Lite on the data with testwise deletion and got
the right answer on the first example I tried. This uses the idea of the
paper by Strobl et al., which recommends you run a latent variable
algorithm on the data using testwise deletion.
I have not stress-tested it.
—
Reply to this email directly, view it on GitHub
<#1839 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD4Y3OODTNWB57DY3L6I3IL2JTRJJAVCNFSM6AAAAABUVW7L6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXGEYTGMJUGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Just to put this here as a note for Cox regression, a suggestion I saw was to try the 'smile' library. Here's a code example:
|
For datasets with a time to event/failure variable, I do not think there is any conditional independence test for censored data (e.g., log rank test, or cox proportional hazards regression). For example, I have a survival variable that is right-censored (at 30 days because > 30 day death is not pertinent to the problem). Would it be possible/feasible to add an independence test to Tetrad that can handle censored data?
I have thought about alternatives: transforming data (e.g., imputing censored times), but believe I will lose statistical power in such cases.
Thanks!
The text was updated successfully, but these errors were encountered: