Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #11189 part 00 refactor citation relation tab logic #11845

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

alexandre-cremieux
Copy link
Contributor

@alexandre-cremieux alexandre-cremieux commented Sep 28, 2024

High level description

This contributions aims to provide a local storage solution for citations relations using MVStore. Please see #11189 for more information.

Implementation details

No new dependency added.

Caching strategy

The implemented caching logic is following a 2 levels side-caching strategy with:

  • an in memory caching level powered by an LRU MAP
  • a local storage powered by a an h2.mvstore.MVStore

The interaction between the two levels of cache is backed by a responsibility chain pattern. This helps to:

  • avoid searching on file if the citations/references are already available from RAM
  • avoid purely technical code to leak into the service layer that should only orchestrate fetching, storing, reading
  • ease future optimization for each of the levels of caching
  • should ease change of technologies for this use case in the future

Search caching high level logic

  1. Check if the fetch is possible: if the cache does not contain anything or if the local storage TTL ran out
  2. If the fetch is possible: fetch and insert into cache (in memory and locally)
  3. Read from cache (if the in-memory cache is empty then load from local storage and update in-memory cache)

Serialization

MVStoreDAO uses the canonical representation to serialize a BibEntry and the BibtexParser to deserialize it.

Configuration

  • the search service is a singleton injected in each new Citations Relations tab using the JabRef IoC provider
  • the store path is configured to target the citations folder under JabRef's app directory

Setting translation

The the local storage TTL setting message has been traduced in:

  • DE
  • FR
  • EN
  • PL
  • IT
  • BR

Refactoring

This contributions simplifies the citations/references fetching and caching logic by introducing two layers:

  • service
  • repository

This should help to make this feature more extendable without modifying orchestration logic following open/close principle.

Screen shots

Screenshot 2025-01-26 at 22 02 16 Screenshot 2025-01-26 at 22 03 02

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@alexandre-cremieux alexandre-cremieux marked this pull request as draft September 28, 2024 20:38
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
The tool reviewdog already placed comments on GitHub to indicate the places. See the tab "Files" in you PR.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

You can check review dog's comments at the tab "Files changed" of your pull request.

@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from 9fe8522 to cbe9e96 Compare September 28, 2024 21:15
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
The tool reviewdog already placed comments on GitHub to indicate the places. See the tab "Files" in you PR.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

You can check review dog's comments at the tab "Files changed" of your pull request.

@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from cbe9e96 to 8231340 Compare September 28, 2024 21:51
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from 8231340 to 33967c2 Compare September 28, 2024 22:20
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from 33967c2 to 592d4d7 Compare September 28, 2024 22:47
@koppor koppor changed the title Fix issue 11189 part 00 refactor citation relation tab logic Fix issue #11189 part 00 refactor citation relation tab logic Sep 29, 2024
@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch 2 times, most recently from d94f4d3 to 3155242 Compare September 29, 2024 13:24
Copy link
Contributor Author

@alexandre-cremieux alexandre-cremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add code explanations to the PR

@Siedlerchr
Copy link
Member

Please no force push if not needed. All commits will be squashed when merged

* Move repository, cache, and fetcher to logic package
* Move citations model to model/citations/semanticscholar package
* Introduce service layer
* Rename LRU cache implementation
* Add tests helpers for repository
* Move logic from repository to service
* Refactor repositories
* Update tab configuration
@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from 3155242 to 18db75e Compare September 29, 2024 15:01
@alexandre-cremieux
Copy link
Contributor Author

Please no force push if not needed. All commits will be squashed when merged

Sorry, just re-based main branch locally.

…lation-tab-logic

# Conflicts:
#	src/main/java/org/jabref/gui/entryeditor/citationrelationtab/CitationRelationsTab.java
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good. Some minor comments.

Sorry for delay. Please go ahead with everything.

@koppor koppor marked this pull request as ready for review October 10, 2024 20:29
@koppor
Copy link
Member

koppor commented Oct 10, 2024

Small other comments - IntelliJ proposed to extract a method

private static SearchCitationsRelationsService getSearchCitationsRelationsService(BibEntry cited, List<BibEntry> citationsToReturn, BibEntryRelationsRepository citationsToReturn1) {

in the tests - maybe you can also include that.

@koppor
Copy link
Member

koppor commented Oct 10, 2024

@alexandre-cremieux Please pull before you continue working on it - I merged main for you (and resolved conflicts).

@alexandre-cremieux
Copy link
Contributor Author

@alexandre-cremieux Please pull before you continue working on it - I merged main for you (and resolved conflicts).

Thanks for the review and the merge. I will resume the work on this branch and apply the changes.

@koppor
Copy link
Member

koppor commented Nov 8, 2024

@alexandre-cremieux Sorry for the merge conflicts - can you handle them? I was always happy with IntelliJ's "resolve merge conflicts" dialog. Hope, it works in this case, too.

@alexandre-cremieux
Copy link
Contributor Author

alexandre-cremieux commented Nov 8, 2024

@alexandre-cremieux Sorry for the merge conflicts - can you handle them? I was always happy with IntelliJ's "resolve merge conflicts" dialog. Hope, it works in this case, too.

Hello @koppor . Seems that we have new conflicts to resolve to be able to merge main. But I will do that when the feature will be fully developed. Was quite busy last month, I resumed the work this week. PR comments were addressed.

/**
* TODO: Make the method return a callable and let the calling method create the background task.
*/
private BackgroundTask<List<BibEntry>> createBackGroundTask(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Background (without uppercase G)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The highlighted TODO: Which benefts are this - and why should it implemetned lateR?

Copy link
Contributor Author

@alexandre-cremieux alexandre-cremieux Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koppor

This TODO follows a discussion we had earlier that you can read here: #11845 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what stops you from returning a runnable?

If it's used in many places and that makes you rewrite a lot of code to include BackgroundTask constructor, then maybe returning a BackgroundTask right away was not a bad code decision 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in theory, this could be needed to keep the level of abstration. Should be decided on where the method is actually used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Callable is expected by the BackgroundTask. Also, Callable has less overhead than a BackgroundTask, it is more convenient to use this type in order to:

  • Ensure that the same task is not already being executed (for same tab for instance) => why kill a task that is already running if we want to minimize the application's footprint ?
  • Insert it into a waiting queue if we want to provide a proper task management logic for this tab (see code of this class => we have one executor per tab instance but the tasks are shared between all those instances...)
  • Decouple the callable implementation from the task itself

As mentioned, this improvement could indeed imply several code changes (in this class):

  • refresh the screen properly => is the user focus on the tab for which a task was ended ?
  • set up a queue would need to override equals and hashCode and so provide a proper implementation of a callable
  • proper testing, etc.

At least, I think we agree that this should not be part of this PR. As I saw that the current implementation could be improved (stopping a task that maybe shouldn't be stopped, avoid same task but different executors for each instances of the tab, etc). I proposed to take a look to this after merging this PR. Reason why the TODO is there (after Oliver's accepted that at that time).

But I can remove completely the TODO also if it bother you (and I understand that cause I do not like TODO also). However, I already have another to issue opened for which I volunteered related to same tab...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a link your reasoning in your comment in the to-do?

Comment on lines 25 to 27
public SearchCitationsRelationsService(
CitationFetcher citationFetcher, BibEntryRelationsRepository repository
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codestyle seems off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check style test ran and is green (as reported bellow).

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was late here, and my energy was drawn by checking the CHANGELOG.md. Sorry for that. I will try to review properly the next days.

CHANGELOG.md Outdated
@@ -11,11 +11,14 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv

### Added

- We introduced a settings parameters to manage citations' relations local storage time-to-live. [#11189](https://github.com/JabRef/jabref/issues/11189)
- We added a 'Copy to' context menu option with features for cross-reference inclusion/exclusion, as well as the ability to save user preferences. [#12374](https://github.com/JabRef/jabref/pull/12374)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really in this PR? - Think, this is from a merge conflict and needs to be removed.

(The next line is the feature)

The line at hand is from https://github.com/JabRef/jabref/pull/12374/files#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4ed - but really seems to be reworded to the next one.

"Proof": Same link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be an update merge artifact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank. It is indeed a merge artifact.

CHANGELOG.md Outdated
@@ -11,11 +11,14 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv

### Added

- We introduced a settings parameters to manage citations' relations local storage time-to-live. [#11189](https://github.com/JabRef/jabref/issues/11189)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value? (in case users read the CHANGELOG, it should more-or-less be self contained)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the default value in the description.

CHANGELOG.md Outdated
@@ -25,6 +28,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
- We fixed an issue where a bib file with UFF-8 charset was wrongly loaded with a different charset [forum#5369](https://discourse.jabref.org/t/jabref-5-15-opens-bib-files-with-shift-jis-encoding-instead-of-utf-8/5369/)
- We fixed an issue where new entries were inserted in the middle of the table instead of at the end. [#12371](https://github.com/JabRef/jabref/pull/12371)
- We fixed an issue where removing the sort from the table did not restore the original order. [#12371](https://github.com/JabRef/jabref/pull/12371)
- We fixed the citations relations merge logic. [#11189](https://github.com/JabRef/jabref/issues/11189)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what that means; it is not in the issue, is it?

Either remove that (because we typically have one changelog entry per PR) - or link this PR directly.

If it is kept, it should be more user facing. I think, users cold not merge citations relations, could they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous LRU caching implementation's merging logic was in fact overwriting the cache instead of `cachingAndMerging'.

But, you are right: this is an implementation detail over which the user has no control. I will remove this line.

Here is the original code (that is now updated and covered by a test): #11845 (comment)

@@ -175,6 +176,9 @@ public void initialize() {
dialogService,
taskExecutor);
Injector.setModelOrService(AiService.class, aiService);

var searchRelationsService = new SearchCitationsRelationsService(preferences.getImporterPreferences());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name does not match the class name - i thought, this would be auto generated

(but no need to fix if everyhting else is OK)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I edited the CHANGE_LOG, I am fixing this small flavor of my own ;)


import org.jabref.model.entry.BibEntry;

public class BibEntryRelationsRepositoryHelpersForTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small JavaDoc could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alexandre-cremieux alexandre-cremieux force-pushed the fix-issue-11189-part-00-refactor-citation-relation-tab-logic branch from cbe5acc to eb3dfd8 Compare February 7, 2025 08:28
@koppor koppor marked this pull request as draft February 9, 2025 20:53
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to dive deeper. I rewrote code as soon as I found a "mistake".

The most huge mistakes are:

  • MVStore is opened on each read --> it has to be started upon open of the library tab
  • A caching layer before MVStore --> this is not necesasry. MVSTore can cache. See https://www.h2database.com/html/mvstore.html --> cacheSize: the cache size in MB.
  • The "ubitios langauge" is not used. At org.jabref.logic.importer.fetcher.CitationFetcher.SearchType, we "agreed" to use CITES and CITED_BY, but the code variables use reference and citation, which are confusing at some places.
  • The logic currently only supports DOI-based entries to be handled. There is no handling of entries without DOI.

The last point can be left to future work (as it involves storing citation information in the BibEntry itself and merging this)


Fix required:

  • One global MV store loaded on start of JabRef
  • the names tables in the mvstore are the path of .bib suffixed with -cites and -citing
  • Store closed on shutdown of JabRef

@koppor
Copy link
Member

koppor commented Feb 9, 2025

I also added JavaDoc comments. This is a good practicse in all cases.

@koppor
Copy link
Member

koppor commented Feb 10, 2025

I am thinking, if a global store of citation relations would be suffice somehow? The code assumes that citations are bound to a DOI (*). Maybe, we should store the citations of one DOI centrally and use that information in each library.

IMHO, the Postgres backend can be used to search for the BibEntry belonging to a DOI (in a library).

(*) This is a fine assumption as we do not incoroporate other services.

@alexandre-cremieux
Copy link
Contributor Author

Hello @koppor,

Many thanks for the review. However, I am a bit confused. Let me try to explain my choices based on your comments:

  • MVStore is opened on each read --> it has to be started upon open of the library tab
  • A caching layer before MVStore --> this is not necesasry. MVSTore can cache. See https://www.h2database.com/html/mvstore.html --> cacheSize: the cache size in MB.

Yes indeed, this is to prevent any error in case of interrupt that would lead to data loss and possibly to file corruption (see the documentation of the MVStore and reported error). By the way, it is impossible to handle gracefully a SIGKILL, so closing the file ASAP seems to be the safer way to avoid such kind of error without having to implement all the bells and whistle that come with a database engine.

The first caching layer is here to avoid accessing the disk when citations are already fetched from the store. This avoid losing memory space cause the MVStore cache is fixed. Let's say that we allocate enough RAM to the MVStore to store 10 average entries. If the cache is full and you need to access a new entry that is not cached but does have a size that corresponds to 1 and a half average entry. Then you will remove 2 average entries from the cache to add the new one. Then, instead of having 10 entries in the cache, you will have only 9 (it can be worst than that). From my point of view, interest of the JVM is that it handles and optimise this for you. By the way, the MVStore is using a LRU caching algorithm (LRIS), so there won't be much changes compared to the first layer of cache. By the way, in a database, the entries are often of a fixed size. We are not in this situation here.

Another thing with the MVStore is that it needs serialization/deserialization, so you will loose time with those operations even if it is cached. This is not to say that the MVStore cache is implemented for H2 database needs (reentrant locks, memory paging, thread pools in the background, etc.). Do we really need this level of complexity for this use case ?

All this reasons made me keep the LRU implementation as a first level cache, cause it is a simple solution.

  • The "ubitios langauge" is not used. At org.jabref.logic.importer.fetcher.CitationFetcher.SearchType, we "agreed" to use CITES and CITED_BY, but the code variables use reference and citation, which are confusing at some places.

This is true. However, the reference term was already there (see LRU cache implementation already on main). I just left it. I should have correct it.

By the way, talking about ubiquitous language, it should be in a bounded context. Clearly here we are not respecting the boundaries cause we are mixing citations searching and bibEntries management which are two different domains. We had a long conversation in this PR about it where I proposed to refactor completely the citations in order to clearly separate the contexts by providing an adapted model. You preferred to go with full BibEntries and I already removed the dedicated code for serializing only what was needed.

I am thinking, if a global store of citation relations would be suffice somehow? The code assumes that citations are bound to a DOI (*). Maybe, we should store the citations of one DOI centrally and use that information in each library.

IMHO, the Postgres backend can be used to search for the BibEntry belonging to a DOI (in a library).

(*) This is a fine assumption as we do not incoroporate other services.

Okay, does this means that you are not sure about the technology to use here ?

The most huge mistakes are:

Huge is a big word, so this means that you see other small mistakes. I think it is more related to the fact that you don't like this design. This is fine, it can happen.

But I did this on my free time to help the project. I will not get anything from it as a reward (I do not do that for this). But before I do the changes you asked, I want to be sure that you are not going to push me back. Could you please tell me exactly what you want and if this design is okay for you before I add more code ?

@koppor
Copy link
Member

koppor commented Feb 11, 2025

Another thing with the MVStore is that it needs serialization/deserialization, so you will loose time with those operations even if it is cached. This is not to say that the MVStore cache is implemented for H2 database needs (reentrant locks, memory paging, thread pools in the background, etc.).

Did you do measurements regarding performance? What is the real impact on the users side? Starting from which count of cites are we touching MV Stores limits so that we need to add a custom cache layer in front?

I have worked with terrabytes of data. I don't see that we touch this limit here.

As far as I understood, the citation information needs to be persisted.

@koppor
Copy link
Member

koppor commented Feb 11, 2025

But I did this on my free time to help the project. I will not get anything from it as a reward (I do not do that for this).

Its a free time project for all of us... The reward we get is food and sometimes the money for thr train tickets to our yearly convention. Where we work in our free time in the tool. 😅

Happy to see that you are also joining this club somehow.

But before I do the changes you asked, I want to be sure that you are not going to push me back. Could you please tell me exactly what you want and if this design is okay for you before I add more code ?

Sometimes one sees the implication of a decision when seeing the real code.

Maybe I just experiment myself, since I need that to really describe what I think is best 😅. Seeing something which causes to think: "That can be done better (whatever metrics for that is applied)" is simpler than describing the specification, the software design, and skeletons of methods...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants