-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic NextGenHDF writer and a jobf for rasr caches dump #394
base: main
Are you sure you want to change the base?
Conversation
…ptionally features from rasr caches
Let me briefly add some comments without having read the code: Firstly concerning
Yes I agree that this is a very neat solution for this special case as your FH multitask approach would always go together hand in hand and never be used separately. However for general multi-task learning as mentioned
I would rather suggest to have it in separate datasets which are combined on-demand with one of the meta-datasets in returnn. This way we do not need to redo the dumping when you add an additional task, or load unnecessary ballast when you remove one. Secondly concerning:
Are you sure it is not possible? I think if one would put 2D data into the HDF file (i.e. [T,B]) then this could be defined as sparse and directly used in returnn. Let me try it:
and
This seems to be a valid, sparse dataset. (@vieting ) But I agree that a generic NextGenHDF writer would be a useful addition to our codebase. One additional question to your implementation: Do you think it would make sense to have a generic |
def __init__( | ||
self, | ||
filename: str, | ||
label_info_dict: Dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments are non-generic, you are assuming there is a feature and a label stream, and for the label stream you even require this exists. This means in its current form the Writer would be unusable for e.g. my TTS setups, or at least very unintuitive in the handling. Following your design, I would imagine something like:
def __init__(
self,
filename: str,
streams: List[Tuple[str, str, type]]
):
"""
:param filename:
:param streams: Tuple of (name, parser_name, dtype)
"""
[...]
unfortunately, this does not work for the reason that the sparse stream needs to know the extra size information. This is why I thought explicit groups with their own init params might be nicer... I am little out of ideas here...
|
||
return alignment_cache | ||
|
||
def _get_label_sequence(self, alignment_cache, file, state_tying): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to move such fuctions to the FileArchive class itself aat some point.
|
||
def add_data_to_group(self, group_name, seq_name, data): | ||
if group_name in self.label_info_dict: | ||
data = np.array(data).astype(self.label_data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be left to the user to be defined from the outside. Then the label_data_type
can also be removed.
# root | ||
self.root_group = self.out_hdf.create_group("streams") | ||
|
||
for label_name, label_dim in self.label_info_dict.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the init restructuring from above, I think it would be better to just loop here over all streams and have a separate create_group
function that you can override as user to add your own custom streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sill, I do not know how to cleanly deal with the "label_dim" argument in that case...
This is implemented now in #434 FYI |
The
Returnn's
NextGenHDFDataset
allows for the storage of multiple datasets in one HDF file. Given also some ongoing dicussions in here, I am proposing a genericNextGenHDFWriter
, that writes several datasets (features or labels).Moreover, there is also one
Sisyphus
job example that uses this specific writer, as an example.Differently to the
SimpleHDFWriter
, here one can write data in a sparse form and therefore there is no additional squeeze layer needed during training, as @vieting expererienced, recently. Here, the keys of thelabel_info_dict
can be used directly in the softmax layer astarget='my_key'
.Current usecase and necessity of having this in
i6_core
:This is currently needed for the factored hybrid system for the estimation of context-dependent priors. Moreover, the use of several additional targets in form of multi-task learning is also very common and can be handled easily.