Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) #14851

EmilyMatt · 2025-02-24T13:16:18Z

Describe the bug

In the GroupedHashAggregateStream, update_memory_reservation() has a try_resize() call, an error returned from this function is usually handled gracefully, as we are usually in the middle of emitting and can just emit again, or are accumulating and can spill(unless not even a single batch fits in memory, in which case this can be a justified panic).

However, when creating a merge stream using the spill data, there is no viable error fallback, this early exits on error if the memory pool denied the reservation.
Using a custom memory pool, it should be possible to implement a "burst" allocation or something similar, and avoid aborting a query, but this requires the aggregate a "declaration of intentions" of sorts, shouldn't resize() be used in such cases, instead of try_resize()?

This is the problematic line:

datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs

Line 1050 in 6c5f214

self.update_memory_reservation()?;

I believe in line 910 as well,

match self.update_memory_reservation() {
            // Here we can ignore `insufficient_capacity_err` because we will spill later,
            // but at least one batch should fit in the memory
            Err(DataFusionError::ResourcesExhausted(_))
                if self.group_values.len() >= self.batch_size =>
            {
                Ok(())
            }
            other => other,
        }

This should check preemptively if we have a possibility of spilling here, and if not, it should be a resize(), not a try_resize() as update_memory_reservation() does.
But this one actually has a possibility of not aborting so seems more agreeable.

It's possible that there are other design ideas taken into consideration here, but this is my belief at least.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2025-02-24T16:30:05Z

Thanks @EmilyMatt -- I agree with the seeming goal of this ticket that when operating in a memory constrained environment it is best to avoid erroring during the spilling process itself

It's possible that there are other design ideas taken into consideration here, but this is my belief at least.

Using a custom memory pool, it should be possible to implement a "burst" allocation or something similar, and avoid aborting a query, but this requires the aggregate a "declaration of intentions" of sorts, shouldn't resize() be used in such cases, instead of try_resize()?

I think another classic solution to this is to reduce the memory requirements (though then also decrease the performance) using a mutli-pass merge, as described on #14692 (comment)

There is some overlap with the discussions that @2010YOUY01 has been having on this ticket (though that one is related to sorting)

A complete solution for stable and safe sort with spill #14692

So it may be that your idea of improving the reservation calls can work / improve your usecase

However, to solve the problem of "being able to merge in an arbitrary amount of data / spill files given limited memory" will require multi-level merge

EmilyMatt added the bug Something isn't working label Feb 24, 2025

alamb mentioned this issue Feb 19, 2025

[Epic] A collection of items related to processing larger than memory datasets (via spilling, externalized algorithm, etc) #14077

Open

5 tasks

alamb changed the title ~~Calling try_resize where error is not acceptible~~ Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) #14851

Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) #14851

EmilyMatt commented Feb 24, 2025 •

edited

Loading

alamb commented Feb 24, 2025

Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) #14851

Ensure GroupByHash does not error when trying to spill (calling try_resize where error is not acceptible) #14851

Comments

EmilyMatt commented Feb 24, 2025 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Feb 24, 2025

EmilyMatt commented Feb 24, 2025 •

edited

Loading