Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of overlapping samples and shuffling #30

Closed
tcchiao opened this issue Nov 2, 2021 · 2 comments · Fixed by #132
Closed

Handling of overlapping samples and shuffling #30

tcchiao opened this issue Nov 2, 2021 · 2 comments · Fixed by #132
Labels
question Further information is requested

Comments

@tcchiao
Copy link
Contributor

tcchiao commented Nov 2, 2021

Currently, the BatchGenerator first divide up the input data into batches as specified, then iterate through the batches to generate all possible samples within each batch. This has implications on the output behavior:

  1. For applications like generating samples for training machine learning models, overlapping samples are often desired. For example, consider a time series dataset with 100 time points. In order to predict the value at time point 101, we can build a forecast model using 10 preceding values as the feature. Instead of having 10 non-overlapping samples ([0, 1, 2, ..., 9], [10, 11, 12, ..., 19], ..... , [90, 91, 92, ...., 99]), we would often want to train the model with 91 overlapping samples ([0, 1, 2, ..., 9], [1, 2, 3, ... , 10], ..... , [90, 91, 92, .... , 99]). This behavior can be controlled by the input_overlap argument BatchGenerator. However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).
  2. The samples within a batch are grouped along specific dimensions controlled by how batches are generated, instead of randomly grouped as often desired in a batched training scheme of machine learning models.

For applications I've encountered, the "desired" behavior is different from the current behavior. However, possible samples will need to be generated before being grouped into batches to enable these desired behaviors, which can be memory intensive since overlapping samples are comprised of mostly the same data compared to adjacent samples. This is especially difficult for datasets that are stored in chunks.

Curious to understand whether this is the common desired behavior for the target audience of xbatcher and brainstorm potential implementations!

@weiji14 weiji14 added the question Further information is requested label Nov 26, 2022
@weiji14
Copy link
Member

weiji14 commented Nov 26, 2022

Hi @tcchiao, sorry for the late response one year later! I'm just re-reading this issue in light of the new lazy batch generation feature added in #112 (released for xbatcher v0.2.0), and seeing if this can be resolved in xbatcher.

  1. However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).

To be honest, I'm having a little trouble understanding this point 😅 Maybe some example data and code would help? I'm trying to think if xbatcher's lazy batch generation could help with this already, i.e. since lazy batch sampling is cheap, someone could just loop over the same dataset twice if needed? But maybe I'm confused about the original point.

2. The samples within a batch are grouped along specific dimensions controlled by how batches are generated, instead of randomly grouped as often desired in a batched training scheme of machine learning models.

My interpretation of this is that xbatcher would be able to generate the slices of data, and a separate shuffling routine would then randomize the order of those slices of data. Whether that shuffling algorithm should be in xbatcher or somewhere else is up for debate. I know for example that torchdata has https://pytorch.org/data/0.5/generated/torchdata.datapipes.iter.Shuffler.html, and there might be a similar thing in tensorflow?

@maxrjones
Copy link
Member

However, since batches are generated before possible samples, samples that straddle two batches are discarded. If we want two batches in our toy example, we would miss the 9 samples from [41, 42, .... ,50], [42, 43, ..., 51] to [49, 50, ...., 58] and end up with 41 samples in each batch (82 batches total).

To be honest, I'm having a little trouble understanding this point 😅 Maybe some example data and code would help? I'm trying to think if xbatcher's lazy batch generation could help with this already, i.e. since lazy batch sampling is cheap, someone could just loop over the same dataset twice if needed? But maybe I'm confused about the original point.

@weiji14 I've tried to address this in #132. Here's a MVCE and small diagram explaining the issue:

import xarray as xr
import numpy as np
import xbatcher
size = 12
ds = xr.Dataset(
    {
        "foo": (["x"], np.random.rand(size)),
    },
    {"x": (["x"], np.arange(size))},
)
bg = xbatcher.BatchGenerator(
    ds,
    input_dims={'x': 3},
    input_overlap={'x': 1},
    batch_dims={'x': 6},
    concat_input_dims=True
)
bg[0]['foo']

Output from 4d8e2c84a2d405e237f60f1df5286dd766e06ff0:

<xarray.DataArray 'foo' (input_batch: 2, x_input: 3)>
array([[0.20381835, 0.57481247, 0.25813155],
       [0.25813155, 0.20775249, 0.90049365]])
Coordinates:
    x        (input_batch, x_input) int64 0 1 2 2 3 4
Dimensions without coordinates: input_batch, x_input

Output from #132:

<xarray.DataArray 'foo' (input_batch: 3, x_input: 3)>
array([[0.38957171, 0.77933539, 0.69332417],
       [0.69332417, 0.173178  , 0.10916075],
       [0.10916075, 0.07016787, 0.7544322 ]])
Coordinates:
    x        (input_batch, x_input) int64 0 1 2 2 3 4 4 5 6
Dimensions without coordinates: input_batch, x_input

Pictorial explanation:
image

@maxrjones maxrjones mentioned this issue Jan 7, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
3 participants