Implement splitting of simplified networks #278

maxloeffler · 2025-01-29T19:37:41Z

Prerequisites

I adhere to the coding conventions (described here) in my code.
I have updated the copyright headers of the files I have modified.
I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
I have put signed-off tags in all commits.
I have updated the changelog file NEWS.md appropriately.
I have checked whether I need to adjust the showcase file showcase.R with respect to my changes.
The pull request is opened against the branch dev.

Description

Changed:

Allow splitting of simplified networks (which was previously considered illegal)

Changelog

maxloeffler · 2025-01-29T19:57:13Z

Hey, I implemented splitting of simplified networks / edges as discussed in our meeting(s). However, there is one case that we should discuss, and I figured that this is a good place to do so. I would be interested in all of your opinions so we can find the best solution.

How should we deal with numeric attributes on simplified edges?

Lets say we have the diff.size attribute set before simplifying two edges into one. The diff.size of the simplified edge will be the sum of each its partial edges. In our meeting we established, that there is no good way to separate such attributes again if the two partial edges are now supposed to end up in different bins when we split, so we just remove the attribute.

Here is the catch though, how do we remove the attribute exactly? If we split and have a bin that has some edges that did not have to be separated (i.e., because they were not simplified in the beginning or because all partial edges belong to the same bin) then these edges all have a valid value of the diff.size attribute. Now lets say we have to add an additional edge that is only partial. This partial cannot have a semantically valid diff.size. So now, do we remove the diff.size attribute from the entire bin, i.e., from all edges or do we keep it on all edges where it is valid and set it to NA where it isn't?

If we remove it entirely for the sake of not keeping a potentially really useless attribute around, do we also remove it from all other bins? Discussion with @hechtlC revealed that in practice it seems unlikely that even a single bin would only have complete (non-partial) edges, so when we go the step of removing it from all edges in a bin, its not so far fetched to say, that we remove it from all bins altogether.

And finally: What do we do with the weight? On first glance it is also a numeric attribute so it should be treated the same. However, we discussed ways to separate it in the last meeting. Having 2 edges simplified into one and the merged edge having a weight of 2 most likely means that both source edges have had weight 1. But this is only an assumption and could also lead to bullshit values as with other numeric attributes. So I personally think that we should treat it as other numeric attributes. However, @hechtlC raised concern that it may be critical to have a weight attribute on edges at any time.

codecov · 2025-01-29T20:01:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.89%. Comparing base (80b7a99) to head (50af8f4).

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #278      +/-   ##
==========================================
+ Coverage   81.69%   81.89%   +0.20%     
==========================================
  Files          16       16              
  Lines        5146     5226      +80     
==========================================
+ Hits         4204     4280      +76     
- Misses        942      946       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bockthom · 2025-02-03T01:10:11Z

How should we deal with numeric attributes on simplified edges?

[...] So now, do we remove the diff.size attribute from the entire bin, i.e., from all edges or do we keep it on all edges where it is valid and set it to NA where it isn't?

Having partial information is as valuable as having no information at all. Thus, we should remove it from all edges. Consistency should take precedence over partial information. (Especially as we don't lose any information here, since the entire network still is present unless the user deliberately deletes the global network from memory.)

If we remove it entirely for the sake of not keeping a potentially really useless attribute around, do we also remove it from all other bins?

Yes, same reason as above: Consistency should take precedence over partial information. If we remove it from one of the networks that result from splitting, then I (as a user) would expect that all splits contain the same attributes, that is, that it is removed from all networks. I totally agree with @hechtlC here.

And finally: What do we do with the weight? On first glance it is also a numeric attribute so it should be treated the same. However, we discussed ways to separate it in the last meeting. Having 2 edges simplified into one and the merged edge having a weight of 2 most likely means that both source edges have had weight 1. But this is only an assumption and could also lead to bullshit values as with other numeric attributes.

Could you come up with an example where we would get bullshit values? I can't think of such an example right now. But if you can come up with any, we can discuss it.

So I personally think that we should treat it as other numeric attributes. However, @hechtlC raised concern that it may be critical to have a weight attribute on edges at any time.

According to our corresponding section in our README.md, weight is a mandatory edge attribute. Therefore, we cannot remove it if we stick to our own rules. Assuming that this attribute has been created according to our rules, it should be possible to determine the correct weight during splitting.

maxloeffler · 2025-02-03T11:24:19Z

Okay, im fine with removing the attribute altogether 👍. Regarding the weight attribute: One could theoretically set the weight attribute arbitrarily. Let's say I set weight to 2 for edge A and to 0 for edge B. Now I simplify and we get edge AB with weight equal to 2. Splitting this network would result in two edges with weight equal to 1 which is not a true reflection of what the original edges were.

If weight has to be present at all times, we cannot remove it. That means that we have to figure out a strategy for separating in any case. We could for example say that edges always have a weight equal to the amount of partial edges that comprise them.

bockthom · 2025-02-14T22:30:26Z

@maxloeffler Would you like us to review this PR already, or do you plan to add additional commits or make changes to the existing commits first?
As soon as you would like us to review this PR, please mark this PR as "Ready for review" 😉
In addition, please don't forget to rebase onto dev

bockthom

Thanks for working on this challenging PR @maxloeffler.
While I basically agree with all your implementation, I have a couple of questions here and there (mostly regarding comments, but also regarding improving code readability). See further details in the individual comments below.

In addition, I would like you double-check whether the function get.data.sources.from.relations still works in every conceivable situation, after the changes that we have made to the handling of the relation attribute.

util-networks.R

util-split.R

bockthom · 2025-02-18T00:25:41Z

util-split.R

+    ## Track whether to remove numeric edge attributes after splitting.
+    ## Numeric attributes cannot be correctly separated when they
+    ## appear in multi-partial edges removing their semantic meaning
+    numeric.attrs = edge.attr.names[EDGE.ATTR.HANDLING[edge.attr.names] == "sum" & edge.attr.names != "weight"]


Hm, I wonder whether there could also be other handling functions for numeric attributes than "sum".
Imagine that, in distant future, we would add a new numeric attribute that uses a different handling function. Nobody would think about updating this code here. Of course, we cannot anticipate any weird customized handling function one could come up with in the future, but we should make sure that this works, at least, for the default options provided by igraph. If we cannot come up with a general solution, it might be helpful to add a subtle comment to the definition of EDGE.ATTR.HANDLING that reminds one to check this condition when adding new weird stuff...

util-split.R

tests/test-networks-author.R

tests/test-split-network-time-based.R

The attributes assigned to edges in simplified networks (potentially) contain multiple values for a single edge. Previously, splitting such networks was not supported as it is not obvious how to deal with such edges. Now, we support splitting simplified networks. An edge that comprises of multiple source-edges before simplification should be partitioned again and its partials distributed over the corresponding bins. Singular edges can be handled as prior. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

When separating simplified edges during network splitting, it is not possible to determine which portion of the a numeric edge attribute belongs to which partial edge. Therefore, they lose their semantics in this case and should be removed. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Keeping all values of the 'relation' edge attribute when simplifying enables a proper separation of the attribute when splitting simplified networks. This change is not in contradiction with se-sic#251 (in which the previous simplification strategy was proposed). Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

As discussed in PR#278, removing numeric attributes from some splits while keeping them for others is not benefitial. Hence, we remove them from all splits as soon as one contains non-complete edges. Furthermore, the 'weight' attribute is mandatory and can therefore not be removed. Now, we assume 'weight' to always be equal to the amount of partials that comprise an edge. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

When splifying a network without the 'simplify.multiple.relations' options set, we must still consider that input networks may already contain multi-relational edges (as shown in 'showcase.R' when the sample network is simpified again). In our recent meetings we concluded that edges with a 'mail' relation can be simplified together with edges that have a list('mail', 'mail') relation but not with edges that have a list('mail', 'cochange') relation. Further, we must ensure that any multi-relational edges can still be simplified together with multi-relational edges that have the same relation mix. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

The previously used "concat" strategy does not work flawlessly with list edge attributes. When simplifying already simplified edges, i.e., edges that have list attributes longer than one, using the "concat" strategy, the result is a top-level list containing sublists from the merged attribute values. Instead, we want a single top-level list that contains all attribute values from the edges that comprise it. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

util-split.R

bockthom · 2025-02-25T00:35:55Z

tests/test-networks-multi-relation.R

-        date = I(as.list(get.date.from.string(c("2016-07-12 15:58:59 UTC", "2016-07-12 16:06:10 UTC",
-                                                "2016-07-12 16:00:45 UTC", "2016-07-12 16:05:41 UTC",
-                                                "2016-07-12 16:06:32 UTC", "2016-07-12 16:06:32 UTC",
-                                                "2013-05-05 21:46:30 UTC", "2013-05-05 21:49:21 UTC",
-                                                "2013-05-05 21:49:34 UTC", "2013-05-06 01:04:34 UTC",
-                                                "2013-05-25 03:48:41 UTC", "2013-05-25 04:08:07 UTC",
-                                                "2016-07-12 14:59:25 UTC", "2016-07-12 16:02:30 UTC",
-                                                "2016-07-12 16:06:01 UTC", "2016-07-15 19:55:39 UTC",
-                                                "2017-05-23 12:32:39 UTC", "2016-07-12 15:59:59 UTC",
-                                                "2016-07-15 20:07:47 UTC", "2016-07-27 20:12:08 UTC",
-                                                "2016-07-28 06:27:52 UTC", "2013-05-25 03:25:06 UTC",
-                                                "2013-05-25 06:06:53 UTC", "2013-05-25 06:22:23 UTC",
-                                                "2013-06-01 06:50:26 UTC", "2016-07-12 16:01:01 UTC",
-                                                "2016-07-12 16:02:02 UTC", "2013-04-21 23:52:09 UTC",
-                                                "2016-07-12 15:59:25 UTC", "2016-07-12 16:03:59 UTC")))),
-        artifact.type = I(c(as.list(rep("Feature", 6)), as.list(rep("IssueEvent", 24)))),
+        date = get.date.from.string(c("2016-07-12 15:58:59", "2016-07-12 16:06:10",
+                                      "2016-07-12 16:00:45", "2016-07-12 16:05:41",
+                                      "2016-07-12 16:06:32", "2016-07-12 16:06:32",
+                                      "2013-05-05 21:46:30", "2013-05-05 21:49:21",
+                                      "2013-05-05 21:49:34", "2013-05-06 01:04:34",
+                                      "2013-05-25 03:48:41", "2013-05-25 04:08:07",
+                                      "2016-07-12 14:59:25", "2016-07-12 16:02:30",
+                                      "2016-07-12 16:06:01", "2016-07-15 19:55:39",
+                                      "2017-05-23 12:32:39", "2016-07-12 15:59:59",
+                                      "2016-07-15 20:07:47", "2016-07-27 20:12:08",
+                                      "2016-07-28 06:27:52", "2013-05-25 03:25:06",
+                                      "2013-05-25 06:06:53", "2013-05-25 06:22:23",
+                                      "2013-06-01 06:50:26", "2016-07-12 16:01:01",
+                                      "2016-07-12 16:02:02", "2013-04-21 23:52:09",
+                                      "2016-07-12 15:59:25", "2016-07-12 16:03:59")),
+        artifact.type = c(rep("Feature", 6), rep("IssueEvent", 24)),


Do you have an idea why the timezone was included before and why you did change that? I am just curious...

I don‘t know why it was there. It was just a check for my curiosity to see whether it makes a difference (it doesn’t), and then I left it in the commit because why not. I can revert it though 😅

tests/test-networks.R

When splitting multi-partial edges, we only need to remove numeric attributes (from all edges in all bins) when there is at least one edge that has partials that belong to different bins. When for all edges all partials belong to the same bin, we can keep the numeric attributes. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Nomenclature: An "edge attribute" refers to the values of an attribute over all edges of a network, while an "edge attribute value" refers to the value of an edge attribute of a single edge. Previously we decided that edge attributes by default are of list type, yet we did not specify of which type edge attribute values should be. This is especially relevant when an edge attribute value consists of multiple values, i.e., in simplified edges, and the interplay of simplified edges and non-simplified edges when simplifying repeatedly. By default, simplification with the "concat" strategy wraps the values of all source edges into a list. Therefore, we decided that not only edge attributes should be lists, but also edge attribute values. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8mloef@stud.uni-saarland.de>

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

The 'date' edge attribute is of list type by default an additional check is therefore not necessary. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler · 2025-03-03T16:13:10Z

Regarding the last open points for discussion:

Yes, there are already explicit tests for get.date.from.string that check different date formats and check if parsing to POSIXct works as expected. Therefore, it really does not matter if we keep the UTC in the test where I removed it from or if we add it back.
I checked that the new option of passing dates to split.get.bins.time.based as a nested list does not lead to unforeseen consequences in calling contexts.
I tried to find better ways of checking whether we have numeric attributes present on edges than to check if there are attributes that are simplified with the sum strategy:

We discussed potentially checking in the network configuration to see the expected types of edge attributes there, but turns out that we there specify all edge attributes as being character.
Next I checked if we can check if the actual edge attribute values are numeric using is.numeric. To make it short, yes we can, we would still need the case distinction for weight (and other hypothetical future attributes that are either mandatory or have custom simplification strategies that allow for separation) though.
Lastly, I also checked the default simplification strategies provided by igraph. There are 11 options that have been constant for at least 15 years and none of them are separable (other than concat). We can assume that no new options get added, yet I don't know how smart it is to distinguish numeric attributes solely by the simplification strategy if we want to be as precise as we can. We always add custom functions. Furthermore, a default simplification strategy of igraph is not limited to numerics even when it may seem so, e.g., you can apply mean to characters (the result will be NA or NULL in a list).
In conclusion: yes we can do better than just check for "sum" but it would still requires explicit edge case handling in addition to still potentially requiring change when new attributes get introduced. Personally, I don't think it is likely that we will add new attributes that require a change in this implementation, and even so, the change would be minor. Therefore, I would just leave the current version.

maxloeffler marked this pull request as draft January 29, 2025 19:39

maxloeffler force-pushed the dev branch from 6af26ad to 5d278ec Compare January 31, 2025 20:46

bockthom added enhancement help wanted labels Feb 2, 2025

bockthom added this to the v5.1 milestone Feb 3, 2025

maxloeffler marked this pull request as ready for review February 16, 2025 19:31

bockthom reviewed Feb 18, 2025

View reviewed changes

maxloeffler added 12 commits February 18, 2025 09:48

Optimize splitting simplified edges for performance

0ed437c

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Adjust tests to support splitting of simplified edges

98ef831

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Add tests for time-based splitting of simplified networks

7ec4d83

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Adjust tests to match new 'relation' attribute simplification strategy

2c70666

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Adjust tests to remove numeric attrs from all splits but 'weight' never

41788ff

Add test for simplification of complex multi-relational networks

36d23d6

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the dev branch from 5030881 to e2edb97 Compare February 18, 2025 16:04

Miscellaneous code consistency fixes

cf8f2fe

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the dev branch from e2edb97 to 22cb3bd Compare February 24, 2025 18:03

bockthom reviewed Feb 25, 2025

View reviewed changes

maxloeffler added 2 commits February 25, 2025 08:25

maxloeffler added 3 commits February 25, 2025 08:25

Adjust tests to work with edge attribute values of list type

416c817

Signed-off-by: Maximilian Löffler <s8mloef@stud.uni-saarland.de>

Improve documentation

754c21b

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

Remove 'is.list' check for dates retrieved from network edges

50af8f4

The 'date' edge attribute is of list type by default an additional check is therefore not necessary. Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>

maxloeffler force-pushed the dev branch from 22cb3bd to 50af8f4 Compare March 3, 2025 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement splitting of simplified networks #278

Implement splitting of simplified networks #278

maxloeffler commented Jan 29, 2025 •

edited

Loading

maxloeffler commented Jan 29, 2025

codecov bot commented Jan 29, 2025 •

edited

Loading

bockthom commented Feb 3, 2025

How should we deal with numeric attributes on simplified edges?

maxloeffler commented Feb 3, 2025

bockthom commented Feb 14, 2025

bockthom left a comment

bockthom Feb 18, 2025

bockthom Feb 25, 2025

maxloeffler Feb 25, 2025

maxloeffler commented Mar 3, 2025

Implement splitting of simplified networks #278

Are you sure you want to change the base?

Implement splitting of simplified networks #278

Conversation

maxloeffler commented Jan 29, 2025 • edited Loading

Prerequisites

Description

Changelog

maxloeffler commented Jan 29, 2025

How should we deal with numeric attributes on simplified edges?

codecov bot commented Jan 29, 2025 • edited Loading

Codecov Report

bockthom commented Feb 3, 2025

How should we deal with numeric attributes on simplified edges?

maxloeffler commented Feb 3, 2025

bockthom commented Feb 14, 2025

bockthom left a comment

Choose a reason for hiding this comment

bockthom Feb 18, 2025

Choose a reason for hiding this comment

bockthom Feb 25, 2025

Choose a reason for hiding this comment

maxloeffler Feb 25, 2025

Choose a reason for hiding this comment

maxloeffler commented Mar 3, 2025

maxloeffler commented Jan 29, 2025 •

edited

Loading

codecov bot commented Jan 29, 2025 •

edited

Loading