-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of overlapping samples and shuffling #30
Comments
Hi @tcchiao, sorry for the late response one year later! I'm just re-reading this issue in light of the new lazy batch generation feature added in #112 (released for xbatcher v0.2.0), and seeing if this can be resolved in xbatcher.
To be honest, I'm having a little trouble understanding this point 😅 Maybe some example data and code would help? I'm trying to think if xbatcher's lazy batch generation could help with this already, i.e. since lazy batch sampling is cheap, someone could just loop over the same dataset twice if needed? But maybe I'm confused about the original point.
My interpretation of this is that |
@weiji14 I've tried to address this in #132. Here's a MVCE and small diagram explaining the issue: import xarray as xr
import numpy as np
import xbatcher
size = 12
ds = xr.Dataset(
{
"foo": (["x"], np.random.rand(size)),
},
{"x": (["x"], np.arange(size))},
)
bg = xbatcher.BatchGenerator(
ds,
input_dims={'x': 3},
input_overlap={'x': 1},
batch_dims={'x': 6},
concat_input_dims=True
)
bg[0]['foo'] Output from <xarray.DataArray 'foo' (input_batch: 2, x_input: 3)>
array([[0.20381835, 0.57481247, 0.25813155],
[0.25813155, 0.20775249, 0.90049365]])
Coordinates:
x (input_batch, x_input) int64 0 1 2 2 3 4
Dimensions without coordinates: input_batch, x_input Output from #132: <xarray.DataArray 'foo' (input_batch: 3, x_input: 3)>
array([[0.38957171, 0.77933539, 0.69332417],
[0.69332417, 0.173178 , 0.10916075],
[0.10916075, 0.07016787, 0.7544322 ]])
Coordinates:
x (input_batch, x_input) int64 0 1 2 2 3 4 4 5 6
Dimensions without coordinates: input_batch, x_input |
Currently, the BatchGenerator first divide up the input data into batches as specified, then iterate through the batches to generate all possible samples within each batch. This has implications on the output behavior:
input_overlap
argumentBatchGenerator
. However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).For applications I've encountered, the "desired" behavior is different from the current behavior. However, possible samples will need to be generated before being grouped into batches to enable these desired behaviors, which can be memory intensive since overlapping samples are comprised of mostly the same data compared to adjacent samples. This is especially difficult for datasets that are stored in chunks.
Curious to understand whether this is the common desired behavior for the target audience of xbatcher and brainstorm potential implementations!
The text was updated successfully, but these errors were encountered: