A constraint-based model of CatVars #22
Replies: 16 comments 31 replies
-
This sketch shows 2 CatVar classes under the current model. |
Beta Was this translation helpful? Give feedback.
-
Under the constraint model, we have a set of constraints that are common to all CatVars. Now, CatCNV and Protein Sequence Variants are profiles which specify core constraints needed for a CatVar of that profile to be queried. |
Beta Was this translation helpful? Give feedback.
-
A Function CatVar, showing constraints with or without non-null data values, and how it is a sequence variant by matching the required constraints of that profile, but we are also able to capture additional information that is more CNV-like, here being a relative change in sensitivity (here I'm deliberately construing "copyChange" as being generalizable to relative quantity changes.) |
Beta Was this translation helpful? Give feedback.
-
So, as a general model, the constraints are applied following this general pattern? |
Beta Was this translation helpful? Give feedback.
-
This seems like a very reasonable approach to take. My biggest concern is whether this approach should be pursued in a 2.x release path because this change is a breaking change with vrsatile which our current 1.x model is based on. If we want this to be our 1.x direction I will need a very quick set of defined constraints for canonicalallele, categorical in, and variantdescriptor (text). Otherwise this will set the GKS pilot work back many months at a time where we have just finally gotten a 1.x pipeline set up after years of trying. That's a tough sell for me. It comes down to how quickly we can draft a viable constraint schema and demonstrate how we discern one subtype of constraints from another. Will there be some standard and naming of constraint profiles? Can we put some sample data together from real world data really quickly. If not then let's pursue this as 2.x and stick with 1.x as is. |
Beta Was this translation helpful? Give feedback.
-
I'd like to understand if we would create subclasses of the abstract constraint class for the various concrete concepts needed. Or if we would leave it to users to inspect a given constraint object to determine what type of catvar it actually is? |
Beta Was this translation helpful? Give feedback.
-
Thank you for laying this out so thoughtfully! I think the 4 figures currently presented in this discussion will be really wonderful visual aids in the next call. This is a very nice framework. I wonder if some of Larry's concerns come from the Connect release being labeled as 1.0.0, instead of leaving the major version to be 0. |
Beta Was this translation helpful? Give feedback.
-
@larrybabb Here's some examples of what updating a |
Beta Was this translation helpful? Give feedback.
-
Thanks @DanielPuthawala for putting together this side-by-side example. While I agree with the comments above that this is a very flexible and extensible framework for defining all types and kinds of constraints to form various sets of variants into different I'd like to the various techniques that a receiving system would have to engineer in order to discern what type of categorical variant they are receiving. If they are all explicitly typed as "CategoricalVariation" then I believe there may be less value in sharing it other than to generalize what is being received. In VRS we wrestled with the idea of I suppose having a generalized schema is fine, but the specificity will come in terms of putting requirements on the constraints themselves. We are essentially pushing the issue of typing into the constraint nodes specification itself. I think this is a fine idea, but one that needs to be done a part of the definition of the various categorical variation that we want to define. Optionality in a specification is great as long as it does not redefine a new concept. Getting the set of concepts that are comparable and contrastable such that they can be reasonably implemented by a system is the key. I understand that it is feasible to create a generalized model and then claim that by applying the constraints of two differently defined set of constraints you can still find the overlapping members, I'm not sure that is what we need to do. What we need is the ability to figure out what the members are in a pragmatic way so that we can discern one instance of a categorical variant from another for the same set of constraints (aka. the categorical variant subtype). Then we can assign identifiers for these subtypes where the subtype itself acts as a kind of namespace. I believe this is a more sensible approach than thinking about all categorical variation as one thing. |
Beta Was this translation helpful? Give feedback.
-
One other observation. Maybe we need to simply define a "VariationConstraint" class that can be subclassed into the various forms, like CanonicalAllele (genomic defining context) or CanonicalLocation (genomic defining location) or ProteinExpression, ... etc.. Then a system can be engineered to reference the Constraint type to now when they are dealing with one type of categorical variation versus another? Just a thought. I'm not sure this makes things more complex vs just putting subtypes on the CategoricalVariation itself (if we presume all different constraint types will share the same exact set of optional attributes (members, expressions, etc..). |
Beta Was this translation helpful? Give feedback.
-
I have some time over the next week to work on some examples for the constraint model, as we discussed during last week's call. Can one of you point me to the current specifications for canonicalAllele and categoricalCNV? For sequence variants, the VRS documentation is robust plus there are examples in both the va-spec and gk-pilot repositories. @DanielPuthawala if this would be easier as a call, we can definitely do that too 🙈 ! |
Beta Was this translation helpful? Give feedback.
-
Here's two more mock-ups that are a bit more exploratory. This first demo is for a protein sequence variant, specifically for a variation in amino acid sequence, rather than being predicted from a nucleotide sequence variant. This example is adapted from variant This second mockup is for a functional variant, here for "BRCA1 Loss-of-function", adapted from this record in CIViC. This mockup incudes an additional |
Beta Was this translation helpful? Give feedback.
-
I've made a minor tweak to the Protein Sequence Consequence variant demo and some more major revisions to the Function Variant demo. The PSC demo was just tweaked to include relations that make more sense for a variant label that denotes a amino acid sequence change. So rather than something like The Function Variant demo has been tweaked more extensively. First, given that Second, the Does this seem reasonable? Is there a better typology of protein function? Is there a better-fitting ontology we should reference than SO? (OR should I talk to SO about getting codes added for Third, since I have removed the Hope to hear your thoughts. I'd like to discuss this at our meeting on 7/3/24 |
Beta Was this translation helpful? Give feedback.
-
As an FYI to myself (since I am still not entirely comfortable with these terms) and folks looking at these models, Daniel (I assume?) has drafted the information model within the readthedocs site for this project. Larry added that this schema, as used in the Genomic Knowledge pilot builds off of the VRSATILE information model (VRSATILE catvars json, thank you, Larry!!). ![]() Current description of the information model in the CatVar read the docs Do people feel that these are fair descriptions of each field?
One question that I have is:
p.s. I can move this into a separate discussion, but I thought that it would be appropriate here to help interpretation with the screenshots of the constraint-based model. |
Beta Was this translation helpful? Give feedback.
-
Hey all, here's a long overdue update to the categoricalCNV demo, updated after conversation with Beth Pitel. Change notes:
|
Beta Was this translation helpful? Give feedback.
-
The constraint model has been implemented in #50. |
Beta Was this translation helpful? Give feedback.
-
The general approach to schemata in VRS, and by extension, the aspects that Cat-VRS inherited through VRSatile, has been to model different types of variants as discrete top-level data classes. Hence, having one data class for canonical variants, another for protein consequence variants, a third for CNVs, etc. There are, I think two potential issues with this approach in the categorical variation space: lots of significant (for clinical analysis and research) CatVars that fall into the cracks, and the need to compute entailments.
(The below rhetorical example is based on the CatCNV spec as per the 1.0.0.connect.2024-04.1 snapshot release)
CatVars in the cracks: If we consider just CNVs and sequence variants for a moment, we can, and have created two different classes for these. Sequence variants are in reference to a canonical allele, and some alternate sequence, while CNVs may also refer to some reference sequence and require some copy number or copy change value. Other than the fact that they both in reference to a stand genome build, there is little overlap in terms of the data types used in this two classes.
Now consider expression/function variants. These have a lot in common with sequence variants, and often contain information on some sort of sequence variation, such as a deletion, gene loss, stop gain, etc. But they also require a quantitative data field that has much in common with that from CNVs, but in this case typically something related to levels of gene product. We could make a third, discrete, and independent data class of variants that contains features from both sequence variants and CNVs, but I don’t like this for a couple reasons. First, it’s not DRY. Second, if a CatVar entry is made, say, as a sequence variant, and then it turns out that there is a functional consequence of that sequence change, this would require re-configuring the CatVar as a function variant, not just adding the new functional information, which doesn’t seem efficient. And third, it doesn’t solve the problem. If we make that third class, there will still be variants falling into the cracks between them,
Entailments: Several of our use cases so far involve matching between categorical variants, and a subset of that involves determining if a given CatVar partially intersects or is a subset of another CatVar. If we have discrete, disparate variant classes, then we can explicitly code that CatVars in one class are subsets of those in another class, but making any other (eg orthogonal) comparisons become nontrivial given the apples-to-oranges nature of the data representations.
An alternative is to make a constraint-based model of categorical variation. In this model, instead of the top-level objects being classes of CatVars, the top level objects here are constraints on variation, with many of these constraints being optional. The top-level CatVar classes in the previous model here instead correspond to profiles which specify certain constraints that must be satisfying in order to belong in that class. Under this model, all CatVars are stored in a similar formal, just with varying degrees of optional constraints having non-null values.
This could solve the gaps problem, because even if a variant doesn’t fit into any given profile, it can still be represented as thoroughly or precisely as any other variant, as long as the spec has been built out enough to include the relevant constraint fields.
This also can solve the entailment/match issue, because computing the relationship between two given caters becomes an exercise of simply comparing their constraints on an apples-to-apples basis, since they all underlying have the same possible constraints fields. I’ll sketch out a visual aid of these two approaches and add it below.
So, in summary, the proposal is to shift from modelling variant classes into modelling CatVar constraint classes, and that under the new model, conventional CatVar classes would correspond to profiles: lists of constraints with particular values indicative of CatVars belonging to that class.
Beta Was this translation helpful? Give feedback.
All reactions