Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 8.0.35-30 #1525

Merged
merged 2 commits into from
Dec 4, 2023
Merged

Release 8.0.35-30 #1525

merged 2 commits into from
Dec 4, 2023

Conversation

adivinho
Copy link
Contributor

@adivinho adivinho commented Dec 4, 2023

No description provided.

…numbers mismatch" error

https://jira.percona.com/browse/PXB-3168

TL;DR - During the implementation of new redo log design parser at 8.0.30,
PXB missed a condition that the logs can be reused and blocks read in
buffer can be an old block.

Redo Log Design:

The new design of redo logs at 8.0.30, uses an ever incremental post-fix ID
in the redo log file number. The server can have up to 32 active redo log
files. Once a file is recycled, it gets renamed to the new corresponding ID
(its current number + 32 _tmp). When made active again by the server, it
gets renamed to remove the _tmp from its name. This means the logs can
have data from before the recycle.

PXB redo log copying works in three parts:
1. Read a chunk of 64K (4* page size) into read buffer at `read_log_seg`
2. Based on the last known LSN, parse the new data blocks and check if
the block nr in the block is exactly 1 block ahead of the last block
number. Keep doing this until it finds a block that mismatches. This is
done at `scan_log_recs`.
3. Write the blocks up to the position found at step 2 into `xtrabackup_logfile`

There are two conditions to stop parsing the buffer and identify whether we are
reading the tail of the log recycled log. This happens when we are catch
up with the server (reading the most up to date block written by the server):

** Condition 1 - The next block is lower than what we expected:

For simplicity, we will demonstrate 2 logs instead of 32.
Once log 2 is full, We will re-use log1 (renamed to log3)
and write up to slot 4.

H - head of the log - last position written by the server
T - tail of the log - garbage from the log when it was named log1

```
slots | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 |
log 1 | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
log 2 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
log 3 | 19 | 20 | 21 | 22 | 5  | 6  | 7  | 8  | 9  |
      |    |    |    | H  | T  |    |    |    |    |
```

Here we will read the block and identify that block nr 5
is lower than expected and will consider the parsing as finished
at block 22.

** Condition 2 - The next block is higher than what we expected:

Blocks wrap around at number 1073741824 (1G), meaning they will restart from 1
when we reach a LSN that is 512G (1G block, each block is 512 bytes
- OS_FILE_LOG_BLOCK_SIZE).
For simplicity, let's get the same 9 slots log as above and 2 logs in total
and wrap around after block 7

```
slots | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | S9 |
log 1 | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 1  | 2  |
log 2 | 3  | 4  | 5  | 6  | 7  | 1  | 2  | 3  | 4  |
log 3 | 5  | 6  | 7  | 1  | 5  | 6  | 7  | 1  | 2  |
      |    |    |    | H  | T  |    |    |    |    |
```

Wrap around is identified with the below formula:
* continuos_block_capacity_in_redo - the number of 512-bytes block capacity
we have before we start to overwrite data.
* wrap_around_block_capacity - When our blocks wrap around.
expected_hdr_nr - last successful read block + 1.
read_block_number - block number read from block header.

Formula:

```
((expected_hdr_nr | (continuos_block_capacity_in_redo - 1)) - read_block_number) % wrap_around_block_capacity == 0
```

On the above example when we read the 5 coming from the tail of previous
log data, we have:
* continuos_block_capacity_in_redo - We have 9 slots per redo and 2 redo,
resulting in a continuous of 18 blocks, minus 1 resulting in 17 blocks.
* wrap_around_block_capacity - In the above example we wrap at block 7.
In the server is at 1G
* expected_hdr_nr - 2
* read_block_number - 5

Resulting in:

```
(gdb) p ((2 | (18-1)) - 5) % 7 == 0
$1 = true
```

This means the block number 5 comes from the tail of previous data in the
recycled log.

The Problem:
Xtrabackup was missing the second condition, and only considered old block
from the log buffer in case the `read_block_number` was lower than
the `expected_hdr_nr`.

Fix:

We are using the upstream approach of considering the block as garbage if it
differs from what we expect, with the addition of also validating if the
checksum is correct.
Fixed PXB-3168 - Under high write load, backup fails with "log block …
Copy link
Contributor

@altmannmarcelo altmannmarcelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adivinho adivinho merged commit ef0916b into 8.0 Dec 4, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants