PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620

satya-bodapati · 2024-11-15T13:58:09Z

Problem:

Currently, xtrabackup, with lock-ddl=ON(the default), locks the
server using backup locks (Executes LOCK INSTANCE FOR BACKUP/ LOCK
TABLES FOR BACKUP). This is done at the very start of backup.

After this, a redo thread that copies redo and data copying threads
copy all the datafiles (IBD files, .sdi files, myisam files etc).

Backup lock is released only at the end. This means DDLs are not
possible on Server for the duration of backup.

Solution/Goal:

The goal of this feature is to reduce the time server is locked by
xtrabackup. The new design during backup is as follows:

Copy all redo logs from checkpoint up to the current LSN and start following new entries.
Start the redo log thread.
Track file operations from the redo log.(ie parse MLOG_FILE_* records from redolog)
Copy of all .ibd without taking any lock.
Acquire Backup lock (Lock Instance for Backup/ Lock Tables for Backup). This step ensures no new DDL operations, such as creating or altering tables, will occur.
Query log_status to discover the LSN from after LTFB/LIFB
Copy all non-innodb files
Ensure the redo log has catch up to LSN from step 6
Check the file operations that were tracked and recopying the tablespaces.
Create additional meta files to perform the required actions (deletions or renames) on the already copied files. This approach ensures that the backup remains consistent and accurate without disrupting the streaming process. This step is required for taking streaming backups. The meta files are a. .new -> for files that have to recopied due to encryption or ADD INDEX (Add index skips redo log, so reocpy is a must) b. .del -> file is deleted. If we have copied it, create a space_id.del file c. .ren -> file is renamed after we copied with a different name. Create a space_id.ren file d. .crpt -> file cannot be copied fully because of encryption changes. This will be recopied and for existing file, we have to create .crpt file e. .new.meta: Same as .new, this is for incremental backups d. .new.delta: Same as .new, this is for incremental backups, incremental backups, create t1.ibd.meta and t1.ibd.delta (instead of t1.ibd)
Gather a sync point from all engines (InnoDB LSN, binlog, GTID, etc.) by querying the log_status.
Stop the redo follow thread once it copies at least up to sync point at step 11.
Release LTFB/LIFB.

Prepare phase changes to handle reduced lock:

Process the new metadata files during --prepare phase before crash recovery starts.

.crpt: These files are removed matching the name after stripping the extension. It is important to do this before the IBD scan because these are incompelte files (could be zero size too)
Do a scan to create space_id to file_name mapping
space_id.del -> delete the file matching the space_id. Incase of incremental, we delete the corresponding .new.meta and .new.delta files
space_id.ren -> For the file matching the space_id, rename it to the name contained in the file
.new extension -> replace the file that matches the name without the .new extension
.new.meta/.delta -> Replace the meta and delta files matching the name without the ".new" in the name.

then regular recovery starts.

Limitations:

ALTER INSTANCE ROTATE MASTER KEY is not handled. So applications should block this
Number of open file handles required is the same as number of files in datadir

Other changes fixed as part of this feature:

PXB-3399 : PXB 84 Creating Backup on replica fails

During the 8.4 merge, we mistakenly assumed all master/slave are
replaced/removed in 8.4. there are still few places where server uses
this terminology

relay_master_log_file
exec_master_log_position

We mistakenly used relay_source_log_file and exec_source_log_position instead of the above names.

Revert to the actual names (i.e master in the names)

PXB-3113 : Improve debug sync framework to allow PXB to pause and resume threads

https://perconadev.atlassian.net/browse/PXB-3113

The current debug-sync option in PXB completely suspends PXB process and user can resume by sending SIGCONT signal
This is useful for scenarios where PXB is paused and do certain operations on server and then resume PXB to complete.

But many bugs we found during testing, involves multiple threads in PXB. The goal of this work is to be able to
pause and resume the thread.

Since many tests use the existing debug-sync option, I dont want to disturb these tests. We can convert them to
the new mechanism later.

How to use?
-----------
The new mechanism is used with option --debug-sync-thread="sync_point_name"

In the code place a debug_sync_thread(“debug_point_1”) to stop thread at this place.

You can pass the debug_sync point via commandline --debug-sync-thread=”debug_sync_point1”

PXB will create a file of the debug_sync point name in the backup directory. It is suffixed with a threadnumber.
Please ensure that no two debug_sync points use same name (it doesn’t make sense to have two sync points with same name)

```
2024-03-28T15:58:23.310386-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec.  Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_before_file_copy_4860396430306702017
```
In the test, after activating syncpoint, you can use wait_for_debug_sync_thread_point <syncpoint_name>

Do some stuff now. This thread is sleeping.

Once you are done, and if you want the thread to resume, you can do so by deleting the file 'rm backup_dir/sync_point_name_*`
Please use resume_debug_sync_thread_point <syncpoint_name> <backup_dir>. It dletes the syncpoint file and additionally checks that syncpoint is
indeed resumed.

More common/complicated scenario:
----------------------------------
The scenario is to signal another thread to stop after reaching the first sync point. To achieve this. Do steps 1 to 3 (above)

Echo the debug_sync point name into a file named “xb_debug_sync_thread”. Example:

4. echo "xtrabackup_copy_logfile_pause" > backup/xb_debug_sync_thread

5. send SIGUSR1 signal to PXB process. kill -SIGUSR1 496102

6. Wait for syncpoint to be reached. wait_for_debug_sync_thread <syncpoint_name>

PXB acknowledges it
2024-03-28T16:05:07.849926-00:00 0 [Note] [MY-011825] [Xtrabackup] SIGUSR1 received. Reading debug_sync point from xb_debug_sync_thread file in backup directory
2024-03-28T16:05:07.850004-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: Deleting  file/home/satya/WORK/pxb/bld/backup//xb_debug_sync_thread

and then prints this once the sync point is reached.
2024-03-28T16:05:08.508830-00:00 1 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec.  Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_xtrabackup_copy_logfile_pause_10389933572825668634

At this point, we have two threads sleeping at two sync points. Either of them can be resumed by deleting the filenames mentioned in the error log.
(Or use resume_debug_sync_thread())

Contributions:

Because of squash, some of the commits by other team members are not visible. The other developers are

Aibek Bukabayev
Marcelo Altmann

Problem: -------- Currently, xtrabackup, with lock-ddl=ON(the default), locks the server using backup locks (Executes LOCK INSTANCE FOR BACKUP/ LOCK TABLES FOR BACKUP). This is done at the very start of backup. After this, a redo thread that copies redo and data copying threads copy all the datafiles (IBD files, .sdi files, myisam files etc). Backup lock is released only at the end. This means DDLs are not possible on Server for the duration of backup. Solution/Goal: ------------- The goal of this feature is to reduce the time server is locked by xtrabackup. The new design during backup is as follows: 1. Copy all redo logs from checkpoint up to the current LSN and start following new entries. 2. Start the redo log thread. 3. Track file operations from the redo log.(ie parse MLOG_FILE_* records from redolog) 4. Copy of all .ibd without taking any lock. 5. Acquire Backup lock (Lock Instance for Backup/ Lock Tables for Backup). This step ensures no new DDL operations, such as creating or altering tables, will occur. 6. Query log_status to discover the LSN from after LTFB/LIFB 7. Copy all non-innodb files 8. Ensure the redo log has catch up to LSN from step 6 9. Check the file operations that were tracked and recopying the tablespaces. 10. Create additional `meta` files to perform the required actions (deletions or renames) on the already copied files. This approach ensures that the backup remains consistent and accurate without disrupting the streaming process. This step is required for taking streaming backups. The meta files are a. .new -> for files that have to recopied due to encryption or ADD INDEX (Add index skips redo log, so reocpy is a must) b. .del -> file is deleted. If we have copied it, create a space_id.del file c. .ren -> file is renamed after we copied with a different name. Create a space_id.ren file d. .crpt -> file cannot be copied fully because of encryption changes. This will be recopied and for existing file, we have to create .crpt file e. .new.meta: Same as .new, this is for incremental backups d. .new.delta: Same as .new, this is for incremental backups, incremental backups, create t1.ibd.meta and t1.ibd.delta (instead of t1.ibd) 11. Gather a sync point from all engines (InnoDB LSN, binlog, GTID, etc.) by querying the `log_status`. 12. Stop the redo follow thread once it copies at least up to sync point at step 11. 13. Release LTFB/LIFB. Prepare phase changes to handle reduced lock: --------------------------------------------- Process the new metadata files during `--prepare` phase before crash recovery starts. 1. .crpt: These files are removed matching the name after stripping the extension. It is important to do this before the IBD scan because these are incompelte files (could be zero size too) 2. Do a scan to create space_id to file_name mapping 3. space_id.del -> delete the file matching the space_id. Incase of incremental, we delete the corresponding .new.meta and .new.delta files 4. space_id.ren -> For the file matching the space_id, rename it to the name contained in the file 5. .new extension -> replace the file that matches the name without the .new extension 6. .new.meta/.delta -> Replace the meta and delta files matching the name without the ".new" in the name. then regular recovery starts. Limitations: ------------ 1. ALTER INSTANCE ROTATE MASTER KEY is not handled. So applications should block this 2. Number of open file handles required is the same as number of files in datadir Other changes fixed as part of this feature: PXB-3399 : PXB 84 Creating Backup on replica fails During the 8.4 merge, we mistakenly assumed all master/slave are replaced/removed in 8.4. there are still few places where server uses this terminology relay_master_log_file exec_master_log_position We mistakenly used relay_source_log_file and exec_source_log_position instead of the above names. Revert to the actual names (i.e master in the names) PXB-3113 : Improve debug sync framework to allow PXB to pause and resume threads https://perconadev.atlassian.net/browse/PXB-3113 The current debug-sync option in PXB completely suspends PXB process and user can resume by sending SIGCONT signal This is useful for scenarios where PXB is paused and do certain operations on server and then resume PXB to complete. But many bugs we found during testing, involves multiple threads in PXB. The goal of this work is to be able to pause and resume the thread. Since many tests use the existing debug-sync option, I dont want to disturb these tests. We can convert them to the new mechanism later. How to use? ----------- The new mechanism is used with option --debug-sync-thread="sync_point_name" In the code place a debug_sync_thread(“debug_point_1”) to stop thread at this place. You can pass the debug_sync point via commandline --debug-sync-thread=”debug_sync_point1” PXB will create a file of the debug_sync point name in the backup directory. It is suffixed with a threadnumber. Please ensure that no two debug_sync points use same name (it doesn’t make sense to have two sync points with same name) ``` 2024-03-28T15:58:23.310386-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec. Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_before_file_copy_4860396430306702017 ``` In the test, after activating syncpoint, you can use wait_for_debug_sync_thread_point <syncpoint_name> Do some stuff now. This thread is sleeping. Once you are done, and if you want the thread to resume, you can do so by deleting the file 'rm backup_dir/sync_point_name_*` Please use resume_debug_sync_thread_point <syncpoint_name> <backup_dir>. It dletes the syncpoint file and additionally checks that syncpoint is indeed resumed. More common/complicated scenario: ---------------------------------- The scenario is to signal another thread to stop after reaching the first sync point. To achieve this. Do steps 1 to 3 (above) Echo the debug_sync point name into a file named “xb_debug_sync_thread”. Example: 4. echo "xtrabackup_copy_logfile_pause" > backup/xb_debug_sync_thread 5. send SIGUSR1 signal to PXB process. kill -SIGUSR1 496102 6. Wait for syncpoint to be reached. wait_for_debug_sync_thread <syncpoint_name> PXB acknowledges it 2024-03-28T16:05:07.849926-00:00 0 [Note] [MY-011825] [Xtrabackup] SIGUSR1 received. Reading debug_sync point from xb_debug_sync_thread file in backup directory 2024-03-28T16:05:07.850004-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: Deleting file/home/satya/WORK/pxb/bld/backup//xb_debug_sync_thread and then prints this once the sync point is reached. 2024-03-28T16:05:08.508830-00:00 1 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec. Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_xtrabackup_copy_logfile_pause_10389933572825668634 At this point, we have two threads sleeping at two sync points. Either of them can be resumed by deleting the filenames mentioned in the error log. (Or use resume_debug_sync_thread()) Contributions: -------------- Because of squash, some of the commits by other team members are not visible. The other developers are 1. Aibek Bukabayev 2. Marcelo Altmann

satya-bodapati requested a review from aybek November 15, 2024 13:58

satya-bodapati self-assigned this Nov 15, 2024

satya-bodapati mentioned this pull request Nov 15, 2024

PXB-3269 : Reduce the time the Server is locked by xtrabackup #1616

Closed

satya-bodapati force-pushed the PXB-8.4-3269-reduced-lock branch from ce88322 to 240b4f6 Compare November 15, 2024 22:17

aybek approved these changes Nov 18, 2024

View reviewed changes

satya-bodapati merged commit 330ac17 into percona:8.4 Nov 18, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620

PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620

satya-bodapati commented Nov 15, 2024 •

edited

Loading

PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620

PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620

Conversation

satya-bodapati commented Nov 15, 2024 • edited Loading

Problem:

Solution/Goal:

Prepare phase changes to handle reduced lock:

Limitations:

Contributions:

satya-bodapati commented Nov 15, 2024 •

edited

Loading