add timestamps for each word #113
Replies: 2 comments
-
For sure! I was planning on jumping on it once I finished the v1_0 integrations (the structure may change somewhat for those models anyhow). But you can take a look at the stale branch I was using to experiment with it a bit. You can get the pred_dur from the pytorch versions (not sure how you'd do it with onnx tbh), and then matching that back through the phonemes/tokens back to words. Was a bit tricky with the sampling and scaling/etc which is where I left it https://github.com/remsky/Kokoro-FastAPI/tree/v0.1.2-pre-experimental-subs |
Beta Was this translation helpful? Give feedback.
-
word level timestamps are currently supported by using the dev api (/dev/captioned_speech). This pull adds support for streaming word level time stamps (it is in a different format so just look in the examples in the readme.md) #173 |
Beta Was this translation helpful? Give feedback.
-
I would like to have timestamps for each word in the generated text-to-speech output. This would improve the accuracy of syncing the audio with other media.
I could also submit this as a PR if I get some guidance.
Beta Was this translation helpful? Give feedback.
All reactions