Handling of overlapping samples and shuffling #30

tcchiao · 2021-11-02T01:26:18Z

Currently, the BatchGenerator first divide up the input data into batches as specified, then iterate through the batches to generate all possible samples within each batch. This has implications on the output behavior:

For applications like generating samples for training machine learning models, overlapping samples are often desired. For example, consider a time series dataset with 100 time points. In order to predict the value at time point 101, we can build a forecast model using 10 preceding values as the feature. Instead of having 10 non-overlapping samples ([0, 1, 2, ..., 9], [10, 11, 12, ..., 19], ..... , [90, 91, 92, ...., 99]), we would often want to train the model with 91 overlapping samples ([0, 1, 2, ..., 9], [1, 2, 3, ... , 10], ..... , [90, 91, 92, .... , 99]). This behavior can be controlled by the input_overlap argument BatchGenerator. However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).
The samples within a batch are grouped along specific dimensions controlled by how batches are generated, instead of randomly grouped as often desired in a batched training scheme of machine learning models.

For applications I've encountered, the "desired" behavior is different from the current behavior. However, possible samples will need to be generated before being grouped into batches to enable these desired behaviors, which can be memory intensive since overlapping samples are comprised of mostly the same data compared to adjacent samples. This is especially difficult for datasets that are stored in chunks.

Curious to understand whether this is the common desired behavior for the target audience of xbatcher and brainstorm potential implementations!

The text was updated successfully, but these errors were encountered:

weiji14 · 2022-11-26T03:06:38Z

Hi @tcchiao, sorry for the late response one year later! I'm just re-reading this issue in light of the new lazy batch generation feature added in #112 (released for xbatcher v0.2.0), and seeing if this can be resolved in xbatcher.

However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).

To be honest, I'm having a little trouble understanding this point 😅 Maybe some example data and code would help? I'm trying to think if xbatcher's lazy batch generation could help with this already, i.e. since lazy batch sampling is cheap, someone could just loop over the same dataset twice if needed? But maybe I'm confused about the original point.

2. The samples within a batch are grouped along specific dimensions controlled by how batches are generated, instead of randomly grouped as often desired in a batched training scheme of machine learning models.

My interpretation of this is that xbatcher would be able to generate the slices of data, and a separate shuffling routine would then randomize the order of those slices of data. Whether that shuffling algorithm should be in xbatcher or somewhere else is up for debate. I know for example that torchdata has https://pytorch.org/data/0.5/generated/torchdata.datapipes.iter.Shuffler.html, and there might be a similar thing in tensorflow?

maxrjones · 2022-12-20T21:22:14Z

However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).

To be honest, I'm having a little trouble understanding this point 😅 Maybe some example data and code would help? I'm trying to think if xbatcher's lazy batch generation could help with this already, i.e. since lazy batch sampling is cheap, someone could just loop over the same dataset twice if needed? But maybe I'm confused about the original point.

@weiji14 I've tried to address this in #132. Here's a MVCE and small diagram explaining the issue:

import xarray as xr
import numpy as np
import xbatcher
size = 12
ds = xr.Dataset(
    {
        "foo": (["x"], np.random.rand(size)),
    },
    {"x": (["x"], np.arange(size))},
)
bg = xbatcher.BatchGenerator(
    ds,
    input_dims={'x': 3},
    input_overlap={'x': 1},
    batch_dims={'x': 6},
    concat_input_dims=True
)
bg[0]['foo']

Output from 4d8e2c84a2d405e237f60f1df5286dd766e06ff0:

<xarray.DataArray 'foo' (input_batch: 2, x_input: 3)>
array([[0.20381835, 0.57481247, 0.25813155],
       [0.25813155, 0.20775249, 0.90049365]])
Coordinates:
    x        (input_batch, x_input) int64 0 1 2 2 3 4
Dimensions without coordinates: input_batch, x_input

Output from #132:

<xarray.DataArray 'foo' (input_batch: 3, x_input: 3)>
array([[0.38957171, 0.77933539, 0.69332417],
       [0.69332417, 0.173178  , 0.10916075],
       [0.10916075, 0.07016787, 0.7544322 ]])
Coordinates:
    x        (input_batch, x_input) int64 0 1 2 2 3 4 4 5 6
Dimensions without coordinates: input_batch, x_input

Pictorial explanation:

maxrjones mentioned this issue Oct 19, 2022

Generate batches lazily #111

Closed

weiji14 added the question Further information is requested label Nov 26, 2022

maxrjones mentioned this issue Dec 19, 2022

New BatchSchema class to generate patch selectors and combine into batch selectors #132

Merged

6 tasks

maxrjones closed this as completed in #132 Jan 3, 2023

maxrjones mentioned this issue Jan 7, 2023

Release v0.3.0 #145

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of overlapping samples and shuffling #30

Handling of overlapping samples and shuffling #30

tcchiao commented Nov 2, 2021

weiji14 commented Nov 26, 2022

maxrjones commented Dec 20, 2022

Handling of overlapping samples and shuffling #30

Handling of overlapping samples and shuffling #30

Comments

tcchiao commented Nov 2, 2021

weiji14 commented Nov 26, 2022

maxrjones commented Dec 20, 2022