DAOS-16878 pool: Reduce unexpected DER_NO_SERVICEs (#15665) #15775
+89
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It has been observed that pool_svc_step_up_cb may encounter a -DER_NOTLEADER and pass it to ds_pool_failed_add. This error is a replica error and may be transient; it doesn't indicate that the PS is unavailable. This patch addresses the observed scenario by replacing the ds_pool_failed_add call from pool_svc_step_up_cb with a special up-but-with-error mode for the PS, which can only serve requests by returning an error.
Add pool_svc.ps_error for indicating the special up-but-with-error mode. Check and return it in pool_svc_lookup_leader. Handle it specially in callers of pool_svc_lookup.
Use this new mode only for a conservative set of errors. Including an error by mistake is worse than missing an error.
Add pool UUIDs to a few log messages to make future debugging easier.
The ds_pool_failed_add mechanism should be used for replica errors only. And, such errors should not immediately stop PS clients from trying other replicas. This issue is relatively tricky and will not be addressed by the current patch.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: