Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement splitting of simplified networks #278

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

maxloeffler
Copy link
Contributor

@maxloeffler maxloeffler commented Jan 29, 2025

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • I have checked whether I need to adjust the showcase file showcase.R with respect to my changes.
  • The pull request is opened against the branch dev.

Description

Changed:

  • Allow splitting of simplified networks (which was previously considered illegal)

Changelog

@maxloeffler maxloeffler marked this pull request as draft January 29, 2025 19:39
@maxloeffler
Copy link
Contributor Author

Hey, I implemented splitting of simplified networks / edges as discussed in our meeting(s). However, there is one case that we should discuss, and I figured that this is a good place to do so. I would be interested in all of your opinions so we can find the best solution.

How should we deal with numeric attributes on simplified edges?

Lets say we have the diff.size attribute set before simplifying two edges into one. The diff.size of the simplified edge will be the sum of each its partial edges. In our meeting we established, that there is no good way to separate such attributes again if the two partial edges are now supposed to end up in different bins when we split, so we just remove the attribute.

Here is the catch though, how do we remove the attribute exactly? If we split and have a bin that has some edges that did not have to be separated (i.e., because they were not simplified in the beginning or because all partial edges belong to the same bin) then these edges all have a valid value of the diff.size attribute. Now lets say we have to add an additional edge that is only partial. This partial cannot have a semantically valid diff.size. So now, do we remove the diff.size attribute from the entire bin, i.e., from all edges or do we keep it on all edges where it is valid and set it to NA where it isn't?

If we remove it entirely for the sake of not keeping a potentially really useless attribute around, do we also remove it from all other bins? Discussion with @hechtlC revealed that in practice it seems unlikely that even a single bin would only have complete (non-partial) edges, so when we go the step of removing it from all edges in a bin, its not so far fetched to say, that we remove it from all bins altogether.

And finally: What do we do with the weight? On first glance it is also a numeric attribute so it should be treated the same. However, we discussed ways to separate it in the last meeting. Having 2 edges simplified into one and the merged edge having a weight of 2 most likely means that both source edges have had weight 1. But this is only an assumption and could also lead to bullshit values as with other numeric attributes. So I personally think that we should treat it as other numeric attributes. However, @hechtlC raised concern that it may be critical to have a weight attribute on edges at any time.

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.89%. Comparing base (80b7a99) to head (50af8f4).

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #278      +/-   ##
==========================================
+ Coverage   81.69%   81.89%   +0.20%     
==========================================
  Files          16       16              
  Lines        5146     5226      +80     
==========================================
+ Hits         4204     4280      +76     
- Misses        942      946       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bockthom
Copy link
Collaborator

bockthom commented Feb 3, 2025

How should we deal with numeric attributes on simplified edges?

[...] So now, do we remove the diff.size attribute from the entire bin, i.e., from all edges or do we keep it on all edges where it is valid and set it to NA where it isn't?

Having partial information is as valuable as having no information at all. Thus, we should remove it from all edges. Consistency should take precedence over partial information. (Especially as we don't lose any information here, since the entire network still is present unless the user deliberately deletes the global network from memory.)

If we remove it entirely for the sake of not keeping a potentially really useless attribute around, do we also remove it from all other bins?

Yes, same reason as above: Consistency should take precedence over partial information. If we remove it from one of the networks that result from splitting, then I (as a user) would expect that all splits contain the same attributes, that is, that it is removed from all networks. I totally agree with @hechtlC here.

And finally: What do we do with the weight? On first glance it is also a numeric attribute so it should be treated the same. However, we discussed ways to separate it in the last meeting. Having 2 edges simplified into one and the merged edge having a weight of 2 most likely means that both source edges have had weight 1. But this is only an assumption and could also lead to bullshit values as with other numeric attributes.

Could you come up with an example where we would get bullshit values? I can't think of such an example right now. But if you can come up with any, we can discuss it.

So I personally think that we should treat it as other numeric attributes. However, @hechtlC raised concern that it may be critical to have a weight attribute on edges at any time.

According to our corresponding section in our README.md, weight is a mandatory edge attribute. Therefore, we cannot remove it if we stick to our own rules. Assuming that this attribute has been created according to our rules, it should be possible to determine the correct weight during splitting.

@maxloeffler
Copy link
Contributor Author

Okay, im fine with removing the attribute altogether 👍. Regarding the weight attribute: One could theoretically set the weight attribute arbitrarily. Let's say I set weight to 2 for edge A and to 0 for edge B. Now I simplify and we get edge AB with weight equal to 2. Splitting this network would result in two edges with weight equal to 1 which is not a true reflection of what the original edges were.

If weight has to be present at all times, we cannot remove it. That means that we have to figure out a strategy for separating in any case. We could for example say that edges always have a weight equal to the amount of partial edges that comprise them.

@bockthom
Copy link
Collaborator

@maxloeffler Would you like us to review this PR already, or do you plan to add additional commits or make changes to the existing commits first?
As soon as you would like us to review this PR, please mark this PR as "Ready for review" 😉
In addition, please don't forget to rebase onto dev

@maxloeffler maxloeffler marked this pull request as ready for review February 16, 2025 19:31
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this challenging PR @maxloeffler.
While I basically agree with all your implementation, I have a couple of questions here and there (mostly regarding comments, but also regarding improving code readability). See further details in the individual comments below.

In addition, I would like you double-check whether the function get.data.sources.from.relations still works in every conceivable situation, after the changes that we have made to the handling of the relation attribute.

## Track whether to remove numeric edge attributes after splitting.
## Numeric attributes cannot be correctly separated when they
## appear in multi-partial edges removing their semantic meaning
numeric.attrs = edge.attr.names[EDGE.ATTR.HANDLING[edge.attr.names] == "sum" & edge.attr.names != "weight"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wonder whether there could also be other handling functions for numeric attributes than "sum".
Imagine that, in distant future, we would add a new numeric attribute that uses a different handling function. Nobody would think about updating this code here. Of course, we cannot anticipate any weird customized handling function one could come up with in the future, but we should make sure that this works, at least, for the default options provided by igraph. If we cannot come up with a general solution, it might be helpful to add a subtle comment to the definition of EDGE.ATTR.HANDLING that reminds one to check this condition when adding new weird stuff...

The attributes assigned to edges in simplified networks (potentially)
contain multiple values for a single edge. Previously, splitting such
networks was not supported as it is not obvious how to deal with such
edges.

Now, we support splitting simplified networks. An edge that comprises of
multiple source-edges before simplification should be partitioned again
and its partials distributed over the corresponding bins. Singular edges
can be handled as prior.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
When separating simplified edges during network splitting, it is not
possible to determine which portion of the a numeric edge attribute
belongs to which partial edge. Therefore, they lose their semantics in
this case and should be removed.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Keeping all values of the 'relation' edge attribute when simplifying
enables a proper separation of the attribute when splitting simplified
networks. This change is not in contradiction with se-sic#251 (in which the
previous simplification strategy was proposed).

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
As discussed in PR#278, removing numeric attributes from some splits
while keeping them for others is not benefitial. Hence, we remove them
from all splits as soon as one contains non-complete edges.

Furthermore, the 'weight' attribute is mandatory and can therefore not
be removed. Now, we assume 'weight' to always be equal to the amount of
partials that comprise an edge.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
When splifying a network without the 'simplify.multiple.relations'
options set, we must still consider that input networks may already
contain multi-relational edges (as shown in 'showcase.R' when the sample
network is simpified again).

In our recent meetings we concluded that edges with a 'mail' relation
can be simplified together with edges that have a list('mail', 'mail')
relation but not with edges that have a list('mail', 'cochange')
relation. Further, we must ensure that any multi-relational edges can
still be simplified together with multi-relational edges that have the
same relation mix.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
The previously used "concat" strategy does not work flawlessly with list
edge attributes. When simplifying already simplified edges, i.e., edges
that have list attributes longer than one, using the "concat" strategy,
the result is a top-level list containing sublists from the merged
attribute values. Instead, we want a single top-level list that contains
all attribute values from the edges that comprise it.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Comment on lines -406 to +412
date = I(as.list(get.date.from.string(c("2016-07-12 15:58:59 UTC", "2016-07-12 16:06:10 UTC",
"2016-07-12 16:00:45 UTC", "2016-07-12 16:05:41 UTC",
"2016-07-12 16:06:32 UTC", "2016-07-12 16:06:32 UTC",
"2013-05-05 21:46:30 UTC", "2013-05-05 21:49:21 UTC",
"2013-05-05 21:49:34 UTC", "2013-05-06 01:04:34 UTC",
"2013-05-25 03:48:41 UTC", "2013-05-25 04:08:07 UTC",
"2016-07-12 14:59:25 UTC", "2016-07-12 16:02:30 UTC",
"2016-07-12 16:06:01 UTC", "2016-07-15 19:55:39 UTC",
"2017-05-23 12:32:39 UTC", "2016-07-12 15:59:59 UTC",
"2016-07-15 20:07:47 UTC", "2016-07-27 20:12:08 UTC",
"2016-07-28 06:27:52 UTC", "2013-05-25 03:25:06 UTC",
"2013-05-25 06:06:53 UTC", "2013-05-25 06:22:23 UTC",
"2013-06-01 06:50:26 UTC", "2016-07-12 16:01:01 UTC",
"2016-07-12 16:02:02 UTC", "2013-04-21 23:52:09 UTC",
"2016-07-12 15:59:25 UTC", "2016-07-12 16:03:59 UTC")))),
artifact.type = I(c(as.list(rep("Feature", 6)), as.list(rep("IssueEvent", 24)))),
date = get.date.from.string(c("2016-07-12 15:58:59", "2016-07-12 16:06:10",
"2016-07-12 16:00:45", "2016-07-12 16:05:41",
"2016-07-12 16:06:32", "2016-07-12 16:06:32",
"2013-05-05 21:46:30", "2013-05-05 21:49:21",
"2013-05-05 21:49:34", "2013-05-06 01:04:34",
"2013-05-25 03:48:41", "2013-05-25 04:08:07",
"2016-07-12 14:59:25", "2016-07-12 16:02:30",
"2016-07-12 16:06:01", "2016-07-15 19:55:39",
"2017-05-23 12:32:39", "2016-07-12 15:59:59",
"2016-07-15 20:07:47", "2016-07-27 20:12:08",
"2016-07-28 06:27:52", "2013-05-25 03:25:06",
"2013-05-25 06:06:53", "2013-05-25 06:22:23",
"2013-06-01 06:50:26", "2016-07-12 16:01:01",
"2016-07-12 16:02:02", "2013-04-21 23:52:09",
"2016-07-12 15:59:25", "2016-07-12 16:03:59")),
artifact.type = c(rep("Feature", 6), rep("IssueEvent", 24)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea why the timezone was included before and why you did change that? I am just curious...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don‘t know why it was there. It was just a check for my curiosity to see whether it makes a difference (it doesn’t), and then I left it in the commit because why not. I can revert it though 😅

When splitting multi-partial edges, we only need to remove numeric
attributes (from all edges in all bins) when there is at least one edge
that has partials that belong to different bins. When for all edges all
partials belong to the same bin, we can keep the numeric attributes.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Nomenclature: An "edge attribute" refers to the values of an attribute
over all edges of a network, while an "edge attribute value" refers to
the value of an edge attribute of a single edge.

Previously we decided that edge attributes by default are of list type,
yet we did not specify of which type edge attribute values should be.
This is especially relevant when an edge attribute value consists of
multiple values, i.e., in simplified edges, and the interplay of
simplified edges and non-simplified edges when simplifying repeatedly.

By default, simplification with the "concat" strategy wraps the values
of all source edges into a list. Therefore, we decided that not only
edge attributes should be lists, but also edge attribute values.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8mloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
The 'date' edge attribute is of list type by default an additional check
is therefore not necessary.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
@maxloeffler
Copy link
Contributor Author

Regarding the last open points for discussion:

  1. Yes, there are already explicit tests for get.date.from.string that check different date formats and check if parsing to POSIXct works as expected. Therefore, it really does not matter if we keep the UTC in the test where I removed it from or if we add it back.

  2. I checked that the new option of passing dates to split.get.bins.time.based as a nested list does not lead to unforeseen consequences in calling contexts.

  3. I tried to find better ways of checking whether we have numeric attributes present on edges than to check if there are attributes that are simplified with the sum strategy:

  • We discussed potentially checking in the network configuration to see the expected types of edge attributes there, but turns out that we there specify all edge attributes as being character.
  • Next I checked if we can check if the actual edge attribute values are numeric using is.numeric. To make it short, yes we can, we would still need the case distinction for weight (and other hypothetical future attributes that are either mandatory or have custom simplification strategies that allow for separation) though.
  • Lastly, I also checked the default simplification strategies provided by igraph. There are 11 options that have been constant for at least 15 years and none of them are separable (other than concat). We can assume that no new options get added, yet I don't know how smart it is to distinguish numeric attributes solely by the simplification strategy if we want to be as precise as we can. We always add custom functions. Furthermore, a default simplification strategy of igraph is not limited to numerics even when it may seem so, e.g., you can apply mean to characters (the result will be NA or NULL in a list).
    In conclusion: yes we can do better than just check for "sum" but it would still requires explicit edge case handling in addition to still potentially requiring change when new attributes get introduced. Personally, I don't think it is likely that we will add new attributes that require a change in this implementation, and even so, the change would be minor. Therefore, I would just leave the current version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants