Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8940] Fix Bloom Index Partitioner to distribute keys uniformly across partitions #12741

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vamsikarnika
Copy link
Contributor

@vamsikarnika vamsikarnika commented Jan 30, 2025

Change Logs

Bloom Index causes data skew when bucketized partitioning is used, during repartition and sorting stage when there are hollow buckets created. This happens when there are lot of writes to one partition and few writes to other partitions.

In this pr, we're partitioning based on the fileId + recordKey sort partitioner which distributes the keys equally while keeping same fileIds together.

Screenshot 2025-01-17 at 9 29 00 PM (1)

Impact

Bloom Index can now use fileId and recordkey based partitioner to distribute the comparisons equally across partitions where bucketized partitioning is causing data skew.

Risk level (write none, low medium or high below)

Medium

Added Functional Tests

Documentation Update

public static final ConfigProperty<String> BLOOM_INDEX_FILE_GROUP_ID_KEY_SORT_PARTITIONER = ConfigProperty
      .key("hoodie.bloom.index.fileId.key.sort.partitioner")
      .defaultValue("false")
      .markAdvanced()
      .withDocumentation("Only applies if index type is BLOOM. "
          + "When true, fileId and key sort based partitioning is enabled "
          + "This reduces skew seen in bucket based bloom index lookup");

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@vamsikarnika vamsikarnika force-pushed the fix_bloom_index_partitioner_v3 branch from 81853e0 to 666815b Compare January 30, 2025 11:54
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Jan 30, 2025
@vamsikarnika vamsikarnika changed the title Implement FileId + RecordKey based sort partitioning to reduce skew i… [HUDI-8940] Fix Bloom Index Partitioner to distribute keys uniformly across partitions Jan 30, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants