Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8850] insert merge mode #12735

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Davis-Zhang-Onehouse
Copy link
Contributor

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse commented Jan 29, 2025

Change Logs

MIT delete does not take effect. Run test class TestMergeModeCommitTimeOrdering.

COW, Override payload.
(id 1, ts 100)
incoming record
(id 1, ts 101)

MIT on match id column then delete

We end up with (id 1, ts 100).


Investigation findings

Investigation found that:
shouldIgnore(Schema, Properties):173, for the incoming record returns true. So it is completely not processed.

isDeleteRecord(GenericRecord):87, BaseAvroPayload (org.apache.hudi.common.model), BaseAvroPayload.java
getInsertValue(Schema, Properties):270, ExpressionPayload (org.apache.spark.sql.hudi.command.payload), ExpressionPayload.scala
shouldIgnore(Schema, Properties):173, HoodieAvroRecord (org.apache.hudi.common.model), HoodieAvroRecord.java
writeInsertRecord(HoodieRecord):303, HoodieMergeHandle (org.apache.hudi.io), HoodieMergeHandle.java
writeIncomingRecords():120, HoodieConcatHandle (org.apache.hudi.io), HoodieConcatHandle.java
close():452, HoodieMergeHandle (org.apache.hudi.io), HoodieMergeHandle.java
finish():59, BaseMergeHelper$UpdateHandler (org.apache.hudi.table.action.commit), BaseMergeHelper.java
finish():44, BaseMergeHelper$UpdateHandler (org.apache.hudi.table.action.commit), BaseMergeHelper.java
execute():72, SimpleExecutor (org.apache.hudi.common.util.queue), SimpleExecutor.java
runMerge(HoodieTable, HoodieMergeHandle):149, HoodieMergeHelper (org.apache.hudi.table.action.commit), HoodieMergeHelper.java
runMerge(HoodieMergeHandle, String, String):147, HoodieSparkTable (org.apache.hudi.table), HoodieSparkTable.java
handleUpdateInternal(HoodieMergeHandle, String):351, BaseSparkCommitActionExecutor (org.apache.hudi.table.action.commit), BaseSparkCommitActionExecutor.java
handleUpdate(String, String, Iterator):346, BaseSparkCommitActionExecutor (org.apache.hudi.table.action.commit), BaseSparkCommitActionExecutor.java
handleUpsertPartition(String, Integer, Iterator, Partitioner):312, BaseSparkCommitActionExecutor (org.apache.hudi.table.action.commit), BaseSparkCommitActionExecutor.java
handleInsertPartition(String, Integer, Iterator, Partitioner):325, BaseSparkCommitActionExecutor (org.apache.hudi.table.action.commit), BaseSparkCommitActionExecutor.java

Because "getInsertValue"


  override def getInsertValue(schema: Schema, properties: Properties): HOption[IndexedRecord] = {
    val recordSchema = getRecordSchema(properties)
    val incomingRecord = ConvertibleRecord(bytesToAvro(recordBytes, recordSchema))

    if (super.isDeleteRecord(incomingRecord.asAvro)) {
      HOption.empty[IndexedRecord]()
    } else if (isMORTable(properties)) {
      ...
    } else {
      // For COW table, only the not-matched record will step into the getInsertValue method, So just call
      // the processNotMatchedRecord() here.
      processNotMatchedRecord(incomingRecord, properties) <== returns HOption.of(HoodieRecord.SENTINEL), an indicator that it should be ignored.
    }
  }

Questions

I'm confused by the logic, here we have the incoming record matching with the existing record, yet we are calling getInsertValue, which is a branch intended for " the not-matched record ". It internally poke into the assignment clause and for MIT delete it is totally garbage info and we end up with HoodieRecord.SENTINEL.

I need to learn the right flow of MIT delete first to figure out the problem.

Impact

Risk level (write none, low medium or high below)

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 29, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants