refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez · 2025-02-04T21:47:35Z

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

codecov · 2025-02-04T22:09:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.1%. Comparing base (02387fd) to head (9f564bc).

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5276   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          790     790           
  Lines        67607   67613    +6     
  Branches      8164    8163    -1     
=======================================
+ Hits         52828   52836    +8     
+ Misses       14779   14777    -2

Files with missing lines	Coverage Δ
src/xrpld/nodestore/DatabaseRotating.h	`100.0% <ø> (ø)`
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp	`62.9% <100.0%> (+2.2%)`	⬆️
src/xrpld/nodestore/detail/DatabaseRotatingImp.h	`66.7% <ø> (ø)`

... and 1 file with indirect coverage changes

src/xrpld/nodestore/DatabaseRotating.h

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

bthomee · 2025-02-06T15:51:24Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    // backendMutex_ is only needed when the *Backend_ members are modified.
+    // Reads are protected by the general mutex_.
+    std::mutex backendMutex_;


As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

vvysokikh1

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop: Updates Conan dependencies (5256)

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp

ce650ad

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

ximinez mentioned this pull request Feb 4, 2025

Periodically pause copying ledger nodes during online_delete #4907

Closed

2 tasks

ximinez added this to the 2.4.0 (2025) milestone Feb 4, 2025

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/DatabaseRotating.h Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Review feedback from @Bronek:

b8413ae

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

ximinez requested a review from Bronek February 6, 2025 00:01

Bronek approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'develop' into ximinez/db-lock

063e881

bthomee reviewed Feb 6, 2025

View reviewed changes

vvysokikh1 reviewed Feb 6, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

9f564bc

* upstream/develop: Updates Conan dependencies (5256)

ximinez force-pushed the ximinez/db-lock branch from 913df26 to 9f564bc Compare February 7, 2025 16:04

Review feedback from @bthomee and @vvysokikh1:

d912b50

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

ximinez force-pushed the ximinez/db-lock branch from 13fb47c to d912b50 Compare February 7, 2025 22:18

ximinez added 2 commits February 7, 2025 17:26

Update levelization tracking

3f7fb66

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

4de9be2

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez commented Feb 4, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

bthomee Feb 6, 2025

ximinez Feb 6, 2025

ximinez Feb 7, 2025

vvysokikh1 left a comment

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Are you sure you want to change the base?

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Conversation

ximinez commented Feb 4, 2025

High Level Overview of Change

Context of Change

Type of Change

Test Plan

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

bthomee Feb 6, 2025

Choose a reason for hiding this comment

ximinez Feb 6, 2025

Choose a reason for hiding this comment

ximinez Feb 7, 2025

Choose a reason for hiding this comment

vvysokikh1 left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025 •

edited

Loading