MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

bnestere · 2025-04-11T21:34:39Z

One can have data loss in multi-master setups when 1) both masters update the same table, 2) ATLER TABLE is run on one master which re-arranges the column ordering, and 3) transactions are binlogged in ROW binlog_format. This is because a slave identifies columns to update using column index number. That is, if a transaction updates a table with columns on the master, the binary log ROW event will store its data column-by-column, from the first column, to the th column, in-order. When the slave applies this row to its table, it simply updates each column of its new row using these same values, in the same order. If the slave’s table has its columns in a different order (from some ALTER TABLE, which can have added, removed, or re-arranged) these columns, the data will still be stored in the order that it was done on the master table. This leads to data loss.

This patch adds the ability for a slave to lookup column name when applying ROW events, so if the ordering of columns on the master and slave differs, the slave can still apply the data changes to the correct column. This is limited to when a master binlogs events using option binlog_row_metadata=FULL, as this extends Table_map_log_event metadata to contain column names. When this column name metadata is missing, the lookup behavior is unchanged (i.e. it will use column index, as before).

Patch is currently still being drafted. Extensive MTR testing is still TODO, as well as addressing various TODOs in the code. The current state of MTR tests for this patch is simply reflective of the JIRA description, so it is very minimal, to just show the issue generally.

knielsen · 2025-04-16T15:20:54Z

Brandon Nesterenko ***@***.***> writes:

MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

Hi Brandon, Here is my initial review of the MDEV-36290 draft patch. Overall, it is clear that this is not something that's appropriate for 10.6. It implements a new way to apply row events, based on column names (if available) rather than column number. It is something for next dev version, 12.x.

commit de27bfb Author: Brandon Nesterenko ***@***.***> Date: Thu Apr 3 12:37:09 2025 -0600 MDEV-36290: ALTER TABLE with multi-master can cause data loss

Please do not use the term "data loss" in the description (I know it is not you who introduced the term ;-). The issue here is conflicting queries run in parallel against multiple masters in a multi-master setup. In general, the support of multi-master in asynchronous replication requires the user to ensure that no conflicting queries are run, or to ensure that such conflicts will not cause undesirable behaviour. The behaviour addressed here is just one of many ways that conflicting ALTER TABLE could cause different data on slave and master. Using the term "data loss" can cause misunderstanding about what the problem is and what the change does. Of course, every case that is handled more correctly is in principle better. In this case, we introduce the facility to configure (--binlog_row_metadata=FULL) replication so that it is possible to replicate between tables on master and slave with different order of the columns. Not sure how useful such an option is in practice, seems to be very few users that will really need this, and such need has to be weighed against the increase in complexity in the code and potential for bugs. With that said, the approach in the patch generally looks reasonable. A few initial detail comments below, as you mentioned this is still a draft patch. As you also mentioned to me earlier, the logic in the patch is not trivial. I did not yet try to understand in detail all parts of the logic and the different corner cases.

+ unsigned char *get_optional_metadata_str() + { + return m_optional_metadata; + } + + unsigned int get_optional_metadata_len() + { + return m_optional_metadata_len; + } +

Generally I do not like such do-nothing accessor functions. They just add abstraction and reduce grepability without doing anything useful. That's a style preference of course.

+ /* + Maps column index from master to slave. This is determined using the field + names (provied by optional metadata when the master is configured with + binlog_row_metadata=FULL). + */ + std::map<uint, uint> master_to_slave_index_map; + + /* + When using field names to map from master->slave columns, this keeps track + of column indices on the master which aren't present on the slave. + + It is used to skip columns when checking type-conversions and unpacking + row data. + */ + std::set<uint> master_unmatched_cols;

Is using std::map/std::set going to be expensive here, this is operations that will happen on every row-based event (when --binlog-row-metadata=FULL)? The std:: functions can imply a lot of memory allocations that are not very visible, and we usually try to avoid them in performance-relevant server code. master_to_slave_index_map is just a lookup table of numbers 0..#columns-1, right? So it doesn't need to be an associative array, a normal array is sufficient, and doesn't need to be re-allocated each event. And master_unmatched_cols is again just a set of numbers 0..#columns-1, so a bitmap could be used. - Kristian.

Test to show that the slave doesn't re-binlog correctly if it has extra columns. Note this is only broken in 10.6, 10.11 fixes it (as well as my MDEV-36290 patch). The problem is that the slave doesn't set table->rpl_write_set correctly. Also table->read_set and table->write_set are not correct, but I haven't # yet looked into the implications of that.. To show this, this test uses chain replication with slave-specific columns. The default value for the column is different on each slave. The middle-slave doesn't write its column to the binlog (despite being configured with binlog_row_image=FULL), and so when the final slave in the chain sees the event, it only sees one column being updated, and then uses its default for its extra column, which is different than the middle slaves default. Then, we use diff_tables.inc to show the difference between the tables.

One can have data loss in multi-master setups when 1) both masters update the same table, 2) ATLER TABLE is run on one master which re-arranges the column ordering, and 3) transactions are binlogged in ROW binlog_format. This is because a slave identifies columns to update using column index number. That is, if a transaction updates a table with <n> columns on the master, the binary log ROW event will store its data column-by-column, from the first column, to the <n>th column, in-order. When the slave applies this row to its table, it simply updates each column of its new row using these same values, in the same order. If the slave’s table has its columns in a different order (from some ALTER TABLE, which can have added, removed, or re-arranged) these columns, the data will still be stored in the order that it was done on the master table. This leads to data loss. This patch adds the ability for a slave to lookup column name when applying ROW events, so if the ordering of columns on the master and slave differs, the slave can still apply the data changes to the correct column. This is limited to when a master binlogs events using option binlog_row_metadata=FULL, as this extends Table_map_log_event metadata to contain column names. When this column name metadata is missing, the lookup behavior is unchanged (i.e. it will use column index, as before). General code changes that extend beyond just this patch: 1. In unpack_row(), column "cleanup" for fields that don't exist on the slave is now done as unpacking is done. Prior to this patch, this cleanup would happen only after unpacking the event data, as the only place that the event data could have extra rows would be after the slave fields had been processed. Now, the slave can be missing _any_ field that was logged on the master; and we have to account for that as we see it. That is, for fields in the event that are missing on the slave, we must move forward the both the pointer that tracks the next spot to unpack data, as well as account for our null_mask for tracking null columns. To try and help ensure this new logic is correct, a new assertion is added, assert_master_idx_matches_null_tracking() which verifies that our tracking of master indices aligns with our null field tracking (albeit this can only be done when all fields of a row event are binlogged, e.g when logged with binlog_row_image=FULL). Note that we can't meaningfully validate our master index is consistent with null bit tracking when we have missing columns, as they can be missing in arbitrary places, and if we were to track when a bit is written, we would have to refer to master_idx, which would invalidate the assertion in the first place. 2. table->{read_set, write_set, rpl_write_set} also needed to be updated so its bits would adjust for offset columns on slave. Note that rpl_write_set is set in TABLE::mark_columns_per_binlog_row_image(), which is pulled from 10.11. This fixes an existing bug in 10.6 where all slave columns wouldn't be written if the slave had additional columns (though the bug doesn't exits in 10.11, because it already calls mark_columns_per_binlog_row_image()). 3. There was an issue with the conversion table, where its columns weren't in sync with the table_def generated by the Table_map_log_event. I.e., when the slave didn't have the column, the conv_table generation logic would simply skip the rest of the loop and go on to the next column (without adding a field for the missing column). The compatible_with and unpack_row logic, however, assumed the conv_table was in alignment with the columns from the Rows_log_event. I thought of two ways to fix this: a. Add a dummy field to the conv_table (and later reset it to NULL). b. Have a separate index variable which tracks the conv_table fields. This patch implements the second (b), as it is more efficient. Patch is currently still being drafted. TODOs left: 1. Finish MTR testing current changes 2. Add error-checking logic & MTR tests 3. Optimize master_to_slave_index_map and master_unmatched_cols to use an array and bitmap (per Kristian's review suggestion) 4. Re-test rpl and multi_source MTR suites with binlog_row_metadata=FULL (temporarily change default value) to make sure nothing fails from that option alone 5. Cleanup patch (remove printfs for debugging) 6. Extensive QA testing Reviewed By: ============ <TODO>

Added NULL columns in testing, parameterizd tests by files so testing combinations of options (Add/drop/type conversion/NULL) together is easier. Some TODOs remain to be written.

bnestere added MariaDB Corporation Replication Patches involved in replication labels Apr 11, 2025

bnestere requested a review from knielsen April 11, 2025 21:34

bnestere self-assigned this Apr 11, 2025

bnestere added 3 commits April 17, 2025 16:43

MDEV-36290: Tests to show regression

91c358a

bnestere force-pushed the 10.6-MDEV-36290 branch from 66841d9 to 79aab08 Compare April 17, 2025 23:02

MDEV-36290: Test amendments (Add NULL, and parameterized tests better)

a34ab48

Added NULL columns in testing, parameterizd tests by files so testing combinations of options (Add/drop/type conversion/NULL) together is easier. Some TODOs remain to be written.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

bnestere commented Apr 11, 2025 •

edited

Loading

knielsen commented Apr 16, 2025 via email

MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

Are you sure you want to change the base?

MDEV-36290: ALTER TABLE with multi-master can cause data loss #3967

Conversation

bnestere commented Apr 11, 2025 • edited Loading

knielsen commented Apr 16, 2025 via email

bnestere commented Apr 11, 2025 •

edited

Loading