add timestamps for each word #113

merouanezouaid · 2025-01-31T12:52:53Z

merouanezouaid
Jan 31, 2025

I would like to have timestamps for each word in the generated text-to-speech output. This would improve the accuracy of syncing the audio with other media.

I could also submit this as a PR if I get some guidance.

remsky · 2025-01-31T14:00:05Z

remsky
Jan 31, 2025
Maintainer

For sure! I was planning on jumping on it once I finished the v1_0 integrations (the structure may change somewhat for those models anyhow). But you can take a look at the stale branch I was using to experiment with it a bit.

You can get the pred_dur from the pytorch versions (not sure how you'd do it with onnx tbh), and then matching that back through the phonemes/tokens back to words. Was a bit tricky with the sampling and scaling/etc which is where I left it

https://github.com/remsky/Kokoro-FastAPI/tree/v0.1.2-pre-experimental-subs

0 replies

fireblade2534 · 2025-02-16T19:20:59Z

fireblade2534
Feb 16, 2025
Collaborator

I would like to have timestamps for each word in the generated text-to-speech output. This would improve the accuracy of syncing the audio with other media.

I could also submit this as a PR if I get some guidance.

word level timestamps are currently supported by using the dev api (/dev/captioned_speech). This pull adds support for streaming word level time stamps (it is in a different format so just look in the examples in the readme.md) #173

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timestamps for each word #113

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

add timestamps for each word #113

merouanezouaid Jan 31, 2025

Replies: 2 comments

remsky Jan 31, 2025 Maintainer

fireblade2534 Feb 16, 2025 Collaborator

merouanezouaid
Jan 31, 2025

remsky
Jan 31, 2025
Maintainer

fireblade2534
Feb 16, 2025
Collaborator