Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic NextGenHDF writer and a jobf for rasr caches dump #394

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Marvin84
Copy link
Contributor

The Returnn's NextGenHDFDataset allows for the storage of multiple datasets in one HDF file. Given also some ongoing dicussions in here, I am proposing a generic NextGenHDFWriter, that writes several datasets (features or labels).
Moreover, there is also one Sisyphus job example that uses this specific writer, as an example.

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training, as @vieting expererienced, recently. Here, the keys of the label_info_dict can be used directly in the softmax layer as target='my_key'.

Current usecase and necessity of having this in i6_core:
This is currently needed for the factored hybrid system for the estimation of context-dependent priors. Moreover, the use of several additional targets in form of multi-task learning is also very common and can be handled easily.

@michelwi
Copy link
Contributor

Let me briefly add some comments without having read the code:

Firstly concerning

NextGenHDFDataset allows for the storage of multiple datasets in one HDF file.
for the factored hybrid system for the estimation of context-dependent priors.

Yes I agree that this is a very neat solution for this special case as your FH multitask approach would always go together hand in hand and never be used separately. However for general multi-task learning as mentioned

additional targets in form of multi-task learning is also very common and can be handled easily.

I would rather suggest to have it in separate datasets which are combined on-demand with one of the meta-datasets in returnn. This way we do not need to redo the dumping when you add an additional task, or load unnecessary ballast when you remove one.

Secondly concerning:

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training

Are you sure it is not possible? I think if one would put 2D data into the HDF file (i.e. [T,B]) then this could be defined as sparse and directly used in returnn. Let me try it:

import numpy as np
from returnn.datasets.hdf import SimpleHDFWriter

w = SimpleHDFWriter("./test.hdf", dim=60,  ndim=1)
data = np.random.randint(min=0, max=59, size=(20, 5))
w.insert_batch(data, [20,20,20,20,20], ["1", "2", "3", "4", "5"])
w.close()

and

$ python3 tools/dump-dataset.py "{'class': 'HDFDataset', 'files': ['./test.hdf']}"
Dataset:
  input: 60 x 1
  output: {'data': [60, 1]}
  HDF dataset, sequences: 5, frames: 100
Epoch: 1
Dataset keys: ['data']
Dataset target keys: []
Dump to stdout
[...]

This seems to be a valid, sparse dataset. (@vieting )

But I agree that a generic NextGenHDF writer would be a useful addition to our codebase. One additional question to your implementation: Do you think it would make sense to have a generic DumpNextGenHDFJob from which the RasrDumpNextGenHDFJob inherits to set the exact layout of the alignment based Hybrid system training?

def __init__(
self,
filename: str,
label_info_dict: Dict,
Copy link
Contributor

@JackTemaki JackTemaki Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments are non-generic, you are assuming there is a feature and a label stream, and for the label stream you even require this exists. This means in its current form the Writer would be unusable for e.g. my TTS setups, or at least very unintuitive in the handling. Following your design, I would imagine something like:

def __init__(
    self,
    filename: str,
    streams: List[Tuple[str, str, type]]
    ):
    """
    :param filename:
    :param streams: Tuple of (name, parser_name, dtype)
    """
    [...]

unfortunately, this does not work for the reason that the sparse stream needs to know the extra size information. This is why I thought explicit groups with their own init params might be nicer... I am little out of ideas here...


return alignment_cache

def _get_label_sequence(self, alignment_cache, file, state_tying):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to move such fuctions to the FileArchive class itself aat some point.


def add_data_to_group(self, group_name, seq_name, data):
if group_name in self.label_info_dict:
data = np.array(data).astype(self.label_data_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be left to the user to be defined from the outside. Then the label_data_type can also be removed.

# root
self.root_group = self.out_hdf.create_group("streams")

for label_name, label_dim in self.label_info_dict.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the init restructuring from above, I think it would be better to just loop here over all streams and have a separate create_group function that you can override as user to add your own custom streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sill, I do not know how to cleanly deal with the "label_dim" argument in that case...

@michelwi
Copy link
Contributor

michelwi commented Aug 3, 2023

Secondly concerning:

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training

Are you sure it is not possible? I think if one would put 2D data into the HDF file (i.e. [T,B]) then this could be defined as sparse and directly used in returnn. Let me try it: [...]

This is implemented now in #434 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants