Generic NextGenHDF writer and a jobf for rasr caches dump #394

Marvin84 · 2023-04-14T14:57:55Z

The Returnn's NextGenHDFDataset allows for the storage of multiple datasets in one HDF file. Given also some ongoing dicussions in here, I am proposing a generic NextGenHDFWriter, that writes several datasets (features or labels).
Moreover, there is also one Sisyphus job example that uses this specific writer, as an example.

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training, as @vieting expererienced, recently. Here, the keys of the label_info_dict can be used directly in the softmax layer as target='my_key'.

Current usecase and necessity of having this in i6_core:
This is currently needed for the factored hybrid system for the estimation of context-dependent priors. Moreover, the use of several additional targets in form of multi-task learning is also very common and can be handled easily.

…ptionally features from rasr caches

michelwi · 2023-04-17T10:15:13Z

Let me briefly add some comments without having read the code:

Firstly concerning

NextGenHDFDataset allows for the storage of multiple datasets in one HDF file.
for the factored hybrid system for the estimation of context-dependent priors.

Yes I agree that this is a very neat solution for this special case as your FH multitask approach would always go together hand in hand and never be used separately. However for general multi-task learning as mentioned

additional targets in form of multi-task learning is also very common and can be handled easily.

I would rather suggest to have it in separate datasets which are combined on-demand with one of the meta-datasets in returnn. This way we do not need to redo the dumping when you add an additional task, or load unnecessary ballast when you remove one.

Secondly concerning:

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training

Are you sure it is not possible? I think if one would put 2D data into the HDF file (i.e. [T,B]) then this could be defined as sparse and directly used in returnn. Let me try it:

import numpy as np
from returnn.datasets.hdf import SimpleHDFWriter

w = SimpleHDFWriter("./test.hdf", dim=60,  ndim=1)
data = np.random.randint(min=0, max=59, size=(20, 5))
w.insert_batch(data, [20,20,20,20,20], ["1", "2", "3", "4", "5"])
w.close()

and

$ python3 tools/dump-dataset.py "{'class': 'HDFDataset', 'files': ['./test.hdf']}"
Dataset:
  input: 60 x 1
  output: {'data': [60, 1]}
  HDF dataset, sequences: 5, frames: 100
Epoch: 1
Dataset keys: ['data']
Dataset target keys: []
Dump to stdout
[...]

This seems to be a valid, sparse dataset. (@vieting )

But I agree that a generic NextGenHDF writer would be a useful addition to our codebase. One additional question to your implementation: Do you think it would make sense to have a generic DumpNextGenHDFJob from which the RasrDumpNextGenHDFJob inherits to set the exact layout of the alignment based Hybrid system training?

JackTemaki · 2023-04-18T15:12:54Z

lib/hdf.py

+    def __init__(
+        self,
+        filename: str,
+        label_info_dict: Dict,


The arguments are non-generic, you are assuming there is a feature and a label stream, and for the label stream you even require this exists. This means in its current form the Writer would be unusable for e.g. my TTS setups, or at least very unintuitive in the handling. Following your design, I would imagine something like:

def __init__( self, filename: str, streams: List[Tuple[str, str, type]] ): """ :param filename: :param streams: Tuple of (name, parser_name, dtype) """ [...]

unfortunately, this does not work for the reason that the sparse stream needs to know the extra size information. This is why I thought explicit groups with their own init params might be nicer... I am little out of ideas here...

JackTemaki · 2023-04-18T15:17:59Z

returnn/hdf.py

+
+        return alignment_cache
+
+    def _get_label_sequence(self, alignment_cache, file, state_tying):


I wonder if it makes sense to move such fuctions to the FileArchive class itself aat some point.

JackTemaki · 2023-04-24T15:39:06Z

lib/hdf.py

+
+    def add_data_to_group(self, group_name, seq_name, data):
+        if group_name in self.label_info_dict:
+            data = np.array(data).astype(self.label_data_type)


This should be left to the user to be defined from the outside. Then the label_data_type can also be removed.

JackTemaki · 2023-04-24T15:40:07Z

lib/hdf.py

+        # root
+        self.root_group = self.out_hdf.create_group("streams")
+
+        for label_name, label_dim in self.label_info_dict.items():


Following the init restructuring from above, I think it would be better to just loop here over all streams and have a separate create_group function that you can override as user to add your own custom streams.

Sill, I do not know how to cleanly deal with the "label_dim" argument in that case...

michelwi · 2023-08-03T10:00:25Z

Secondly concerning:

Differently to the SimpleHDFWriter, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training

Are you sure it is not possible? I think if one would put 2D data into the HDF file (i.e. [T,B]) then this could be defined as sparse and directly used in returnn. Let me try it: [...]

This is implemented now in #434 FYI

Generic NextGenHDF writer and an example job for dumping labels and o…

8802fcb

…ptionally features from rasr caches

Marvin84 requested review from curufinwe, christophmluscher, JackTemaki, DanEnergetics, michelwi, vieting and Atticus1806 April 14, 2023 14:57

JackTemaki requested changes Apr 24, 2023

View reviewed changes

JackTemaki reviewed Apr 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic NextGenHDF writer and a jobf for rasr caches dump #394

Generic NextGenHDF writer and a jobf for rasr caches dump #394

Marvin84 commented Apr 14, 2023

michelwi commented Apr 17, 2023

JackTemaki Apr 18, 2023 •

edited

Loading

JackTemaki Apr 18, 2023

JackTemaki Apr 24, 2023

JackTemaki Apr 24, 2023

JackTemaki Apr 24, 2023

michelwi commented Aug 3, 2023


		return alignment_cache

		def _get_label_sequence(self, alignment_cache, file, state_tying):

Generic NextGenHDF writer and a jobf for rasr caches dump #394

Are you sure you want to change the base?

Generic NextGenHDF writer and a jobf for rasr caches dump #394

Conversation

Marvin84 commented Apr 14, 2023

michelwi commented Apr 17, 2023

JackTemaki Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

JackTemaki Apr 18, 2023

Choose a reason for hiding this comment

JackTemaki Apr 24, 2023

Choose a reason for hiding this comment

JackTemaki Apr 24, 2023

Choose a reason for hiding this comment

JackTemaki Apr 24, 2023

Choose a reason for hiding this comment

michelwi commented Aug 3, 2023

JackTemaki Apr 18, 2023 •

edited

Loading