-
Notifications
You must be signed in to change notification settings - Fork 1.8k
zdb: better handling for corrupt block pointers #17166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
When dumping indirect blocks, attempt to print corrupt block pointers rather than abort the program. Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that assertions are not good for the data we read from disks and that might get corrupted, but I am not sure just complaining to stderr is a good solution either. Error messages printed to stderr might be difficult to correlate to the output printed to stdout. Also since we use zdb as a pool verification tool in some CI tests, I am not sure spamming stderr without returning actual error status would be noticed.
So what would you prefer? I could print the messages to stdout instead, and eventually exit with status 1. |
Exit code would definitely be good. How to better print it I am not sure though. I haven't looked close on that part of the code. But I already saw |
Signed-off-by: Alan Somers <asomers@gmail.com>
The latest commit prints all of the errors to stdout instead of stderr. It wasn't possible to print the L1 block's fill error on the same line as the block pointer itself, at least not without buffering potentially massive amounts of text in RAM, so it's printed afterwards. The output looks like this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite verbose. I wonder what happen with some code that may try to parse the output (I think we have some in ZTS), and without error status returned still it might be difficult to handle properly.
I could change the exit status to 1 if any corruption is found. However, there is precedent for not doing that. zdb will already report corruption and yet exit 0 in the following conditions:
So for consistency's sake, I think I should leave the exit status unchanged. |
ASSERT3U(fill, ==, BP_GET_FILL(bp)); | ||
if (!err) { | ||
if (fill != BP_GET_FILL(bp)) { | ||
(void) printf("%16llx: Block pointer " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print to stderr instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what my first commit did. But I switched to stdout in response to @amotin's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think printing to stderr makes sense only if we exit immediately, so it is the last message before we return error status. But if we continue, then using different stream makes it difficult to understand what's actually wrong. Om the other side, if we print it to stdout, then it has to be formatted in some way nice to possible parsers or readers, unless we expect them to ignore all the output after getting an errors.
As discussed in the OpenZFS leadership meeting today, we agreed to change zdb to always exit non-zero if it detects any corruption. I'll create a distinct error code for that. |
Signed-off-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
@amotin @allanjude with the latest commits, zdb will exit 3 if on-disk corruption was detected. It will also follow the usual convention of exiting 2 due to invalid CLI usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just randomly looking around the handled cases I found couple more.
@@ -439,6 +442,7 @@ metaslab_spacemap_validation_cb(space_map_entry_t *sme, void *arg) | |||
(u_longlong_t)txg, (u_longlong_t)offset, | |||
(u_longlong_t)size, (u_longlong_t)mv->mv_vdid, | |||
(u_longlong_t)mv->mv_msid); | |||
corruption_found = B_TRUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 13 lines up is also a case of double alloc.
@@ -542,6 +546,7 @@ mv_populate_livelist_allocs(metaslab_verify_t *mv, sublivelist_verify_t *sv) | |||
(u_longlong_t)DVA_GET_VDEV(&svb->svb_dva), | |||
(u_longlong_t)DVA_GET_OFFSET(&svb->svb_dva), | |||
(u_longlong_t)DVA_GET_ASIZE(&svb->svb_dva)); | |||
corruption_found = B_TRUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar error 17 lines up.
Signed-off-by: Alan Somers <asomers@gmail.com>
When dumping indirect blocks, attempt to print corrupt block pointers rather than abort the program.
Sponsored by: ConnectWise
Signed-off-by: Alan Somers asomers@gmail.com
Motivation and Context
When trying to print a file's indirect blocks with "zfs -vvbbbb", if that file contains corrupt block pointers zdb will fail assertions and crash.
Description
Remove three assertions, replacing them with warnings printed to stderr. The rest of zdb does not require those block pointers to be intact.
How Has This Been Tested?
Tested with a dataset that had become corrupted, similarly to #17077 .
Types of changes
Checklist:
Signed-off-by
.