Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Feb 4, 2025

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

  • Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be
  rewritten so it doesn't hold the mutex during the callback and the
  mutex should be changed back to a regular mutex."
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.1%. Comparing base (02387fd) to head (9f564bc).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5276   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          790     790           
  Lines        67607   67613    +6     
  Branches      8164    8163    -1     
=======================================
+ Hits         52828   52836    +8     
+ Misses       14779   14777    -2     
Files with missing lines Coverage Δ
src/xrpld/nodestore/DatabaseRotating.h 100.0% <ø> (ø)
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp 62.9% <100.0%> (+2.2%) ⬆️
src/xrpld/nodestore/detail/DatabaseRotatingImp.h 66.7% <ø> (ø)

... and 1 file with indirect coverage changes

Impacted file tree graph

* Use a second mutex to protect the backends from modification
* Remove a bunch of warning comments
@ximinez ximinez requested a review from Bronek February 6, 2025 00:01
Comment on lines 83 to 85
// backendMutex_ is only needed when the *Backend_ members are modified.
// Reads are protected by the general mutex_.
std::mutex backendMutex_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

Copy link
Collaborator

@vvysokikh1 vvysokikh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop:
  Updates Conan dependencies (5256)
- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use
  a shared_lock, and write a unit test to show (as much as possible)
  that it won't deadlock.
* upstream/develop:
  fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275)
  fix: issues in `simulate` RPC (5265)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants