Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance in similar parameter number? #1

Open
didadida-r opened this issue May 9, 2024 · 2 comments
Open

performance in similar parameter number? #1

didadida-r opened this issue May 9, 2024 · 2 comments

Comments

@didadida-r
Copy link

Hi,

i head the demo voice in 3.00kbps, and it appears that the ESC result isn't as satisfactory as the DAC result. Could you provide a fair comparison when the parameters are similar?

For example, without reducing the model size by a factor of nine, could we compare the results using the same model size?

@yzGuu830
Copy link
Owner

yzGuu830 commented May 9, 2024

Hi,

i head the demo voice in 3.00kbps, and it appears that the ESC result isn't as satisfactory as the DAC result. Could you provide a fair comparison when the parameters are similar?

For example, without reducing the model size by a factor of nine, could we compare the results using the same model size?

Hi, thanks for your comment!

ESC is indeed inferior to the original DAC model (Base-DAC) in terms of reconstruction. In our experiments, what we demonstrate is that ESC is much more efficient than Base-DAC(model size, inference latency, etc.). Meanwhile it has better reconstruction performance than another DAC model reproduced in similar parameter number (Tiny-DAC).

We didn't upscale ESC to match Base-DAC as we want a parameter-efficient codec, Base-DAC is actually very slow when inference on CPUs (making it a bad candidate in real application). However, we do believe that scaling ESC up will yield better reconstruction performance due to transformer's scaling capability.

We will include Tiny-DAC outputs in the demo page as well. Besides, we may consider releasing an online speech coding interface to demonstrate additional features such as codec complexity.

@lifeiteng
Copy link

@didadida-r Interesting point, will you try it?
I'm also interested in the upper bound of the cross scale VQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants