-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620
Merged
satya-bodapati
merged 1 commit into
percona:8.4
from
satya-bodapati:PXB-8.4-3269-reduced-lock
Nov 18, 2024
Merged
PXB-3269 : Reduce the time the Server is locked by xtrabackup #1620
satya-bodapati
merged 1 commit into
percona:8.4
from
satya-bodapati:PXB-8.4-3269-reduced-lock
Nov 18, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Problem: -------- Currently, xtrabackup, with lock-ddl=ON(the default), locks the server using backup locks (Executes LOCK INSTANCE FOR BACKUP/ LOCK TABLES FOR BACKUP). This is done at the very start of backup. After this, a redo thread that copies redo and data copying threads copy all the datafiles (IBD files, .sdi files, myisam files etc). Backup lock is released only at the end. This means DDLs are not possible on Server for the duration of backup. Solution/Goal: ------------- The goal of this feature is to reduce the time server is locked by xtrabackup. The new design during backup is as follows: 1. Copy all redo logs from checkpoint up to the current LSN and start following new entries. 2. Start the redo log thread. 3. Track file operations from the redo log.(ie parse MLOG_FILE_* records from redolog) 4. Copy of all .ibd without taking any lock. 5. Acquire Backup lock (Lock Instance for Backup/ Lock Tables for Backup). This step ensures no new DDL operations, such as creating or altering tables, will occur. 6. Query log_status to discover the LSN from after LTFB/LIFB 7. Copy all non-innodb files 8. Ensure the redo log has catch up to LSN from step 6 9. Check the file operations that were tracked and recopying the tablespaces. 10. Create additional `meta` files to perform the required actions (deletions or renames) on the already copied files. This approach ensures that the backup remains consistent and accurate without disrupting the streaming process. This step is required for taking streaming backups. The meta files are a. .new -> for files that have to recopied due to encryption or ADD INDEX (Add index skips redo log, so reocpy is a must) b. .del -> file is deleted. If we have copied it, create a space_id.del file c. .ren -> file is renamed after we copied with a different name. Create a space_id.ren file d. .crpt -> file cannot be copied fully because of encryption changes. This will be recopied and for existing file, we have to create .crpt file e. .new.meta: Same as .new, this is for incremental backups d. .new.delta: Same as .new, this is for incremental backups, incremental backups, create t1.ibd.meta and t1.ibd.delta (instead of t1.ibd) 11. Gather a sync point from all engines (InnoDB LSN, binlog, GTID, etc.) by querying the `log_status`. 12. Stop the redo follow thread once it copies at least up to sync point at step 11. 13. Release LTFB/LIFB. Prepare phase changes to handle reduced lock: --------------------------------------------- Process the new metadata files during `--prepare` phase before crash recovery starts. 1. .crpt: These files are removed matching the name after stripping the extension. It is important to do this before the IBD scan because these are incompelte files (could be zero size too) 2. Do a scan to create space_id to file_name mapping 3. space_id.del -> delete the file matching the space_id. Incase of incremental, we delete the corresponding .new.meta and .new.delta files 4. space_id.ren -> For the file matching the space_id, rename it to the name contained in the file 5. .new extension -> replace the file that matches the name without the .new extension 6. .new.meta/.delta -> Replace the meta and delta files matching the name without the ".new" in the name. then regular recovery starts. Limitations: ------------ 1. ALTER INSTANCE ROTATE MASTER KEY is not handled. So applications should block this 2. Number of open file handles required is the same as number of files in datadir Other changes fixed as part of this feature: PXB-3399 : PXB 84 Creating Backup on replica fails During the 8.4 merge, we mistakenly assumed all master/slave are replaced/removed in 8.4. there are still few places where server uses this terminology relay_master_log_file exec_master_log_position We mistakenly used relay_source_log_file and exec_source_log_position instead of the above names. Revert to the actual names (i.e master in the names) PXB-3113 : Improve debug sync framework to allow PXB to pause and resume threads https://perconadev.atlassian.net/browse/PXB-3113 The current debug-sync option in PXB completely suspends PXB process and user can resume by sending SIGCONT signal This is useful for scenarios where PXB is paused and do certain operations on server and then resume PXB to complete. But many bugs we found during testing, involves multiple threads in PXB. The goal of this work is to be able to pause and resume the thread. Since many tests use the existing debug-sync option, I dont want to disturb these tests. We can convert them to the new mechanism later. How to use? ----------- The new mechanism is used with option --debug-sync-thread="sync_point_name" In the code place a debug_sync_thread(“debug_point_1”) to stop thread at this place. You can pass the debug_sync point via commandline --debug-sync-thread=”debug_sync_point1” PXB will create a file of the debug_sync point name in the backup directory. It is suffixed with a threadnumber. Please ensure that no two debug_sync points use same name (it doesn’t make sense to have two sync points with same name) ``` 2024-03-28T15:58:23.310386-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec. Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_before_file_copy_4860396430306702017 ``` In the test, after activating syncpoint, you can use wait_for_debug_sync_thread_point <syncpoint_name> Do some stuff now. This thread is sleeping. Once you are done, and if you want the thread to resume, you can do so by deleting the file 'rm backup_dir/sync_point_name_*` Please use resume_debug_sync_thread_point <syncpoint_name> <backup_dir>. It dletes the syncpoint file and additionally checks that syncpoint is indeed resumed. More common/complicated scenario: ---------------------------------- The scenario is to signal another thread to stop after reaching the first sync point. To achieve this. Do steps 1 to 3 (above) Echo the debug_sync point name into a file named “xb_debug_sync_thread”. Example: 4. echo "xtrabackup_copy_logfile_pause" > backup/xb_debug_sync_thread 5. send SIGUSR1 signal to PXB process. kill -SIGUSR1 496102 6. Wait for syncpoint to be reached. wait_for_debug_sync_thread <syncpoint_name> PXB acknowledges it 2024-03-28T16:05:07.849926-00:00 0 [Note] [MY-011825] [Xtrabackup] SIGUSR1 received. Reading debug_sync point from xb_debug_sync_thread file in backup directory 2024-03-28T16:05:07.850004-00:00 0 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: Deleting file/home/satya/WORK/pxb/bld/backup//xb_debug_sync_thread and then prints this once the sync point is reached. 2024-03-28T16:05:08.508830-00:00 1 [Note] [MY-011825] [Xtrabackup] DEBUG_SYNC_THREAD: sleeping 1sec. Resume this thread by deleting file /home/satya/WORK/pxb/bld/backup//xb_xtrabackup_copy_logfile_pause_10389933572825668634 At this point, we have two threads sleeping at two sync points. Either of them can be resumed by deleting the filenames mentioned in the error log. (Or use resume_debug_sync_thread()) Contributions: -------------- Because of squash, some of the commits by other team members are not visible. The other developers are 1. Aibek Bukabayev 2. Marcelo Altmann
ce88322
to
240b4f6
Compare
aybek
approved these changes
Nov 18, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem:
Solution/Goal:
The goal of this feature is to reduce the time server is locked by
xtrabackup. The new design during backup is as follows:
meta
files to perform the required actions (deletions or renames) on the already copied files. This approach ensures that the backup remains consistent and accurate without disrupting the streaming process. This step is required for taking streaming backups. The meta files are a. .new -> for files that have to recopied due to encryption or ADD INDEX (Add index skips redo log, so reocpy is a must) b. .del -> file is deleted. If we have copied it, create a space_id.del file c. .ren -> file is renamed after we copied with a different name. Create a space_id.ren file d. .crpt -> file cannot be copied fully because of encryption changes. This will be recopied and for existing file, we have to create .crpt file e. .new.meta: Same as .new, this is for incremental backups d. .new.delta: Same as .new, this is for incremental backups, incremental backups, create t1.ibd.meta and t1.ibd.delta (instead of t1.ibd)log_status
.Prepare phase changes to handle reduced lock:
Process the new metadata files during
--prepare
phase before crash recovery starts.then regular recovery starts.
Limitations:
Other changes fixed as part of this feature:
Contributions:
Because of squash, some of the commits by other team members are not visible. The other developers are