Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio quality degradation in c5a3e13 "Converted the stream writer to use pyav" #206

Closed
garrow opened this issue Feb 27, 2025 · 8 comments
Closed

Comments

@garrow
Copy link

garrow commented Feb 27, 2025

Describe the bug

Somewhere in this window of commits, the audio generated has noticeably degraded in quality.

Producing an MP3 file via the web tool, or via the openai API, the latest version is noticeably muddier.

Both CPU and GPU generation are affected, and on Apple Silicon and on Nvidia GPU.

The test text "The quick brown fox jumps over the lazy dog", with the af_sky voice.

On versions

  • b00c9ec produces a ~35kb mp3 file with better quality
  • 7d73c3c produces a ~12kb "mp3" file with worse "muddier" sounding audio.

The 7d73c3c versions log the following when run through ffprobe.

[mp3 @ 0x137e05b90] Skipping 192 bytes of junk at 44.

Screenshots or console output

Comparing the generated files using ffprobe.

❯ ffprobe -hide_banner i7-mint-b00c9ec.mp3
[mp3 @ 0x139605b90] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'i7-mint-b00c9ec.mp3':
  Metadata:
    encoder         : Lavf60.16.100
  Duration: 00:00:02.70, start: 0.000000, bitrate: 104 kb/s
  Stream #0:0: Audio: mp3 (mp3float), 24000 Hz, mono, fltp, 104 kb/s


❯ ffprobe -hide_banner macos-at-b00c9ec.mp3
[mp3 @ 0x134605b90] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'macos-at-b00c9ec.mp3':
  Metadata:
    encoder         : Lavf61.7.100
  Duration: 00:00:02.74, start: 0.000000, bitrate: 101 kb/s
  Stream #0:0: Audio: mp3 (mp3float), 24000 Hz, mono, fltp, 101 kb/s


❯ ffprobe -hide_banner i7-mint-gpu-7d73c3c.mp3
[mp3 @ 0x137e05b90] Skipping 192 bytes of junk at 44.
[mp3 @ 0x137e05b90] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'i7-mint-gpu-7d73c3c.mp3':
  Metadata:
    encoder         : Lavf61.7.100
  Duration: 00:00:02.95, start: 0.022042, bitrate: 32 kb/s
  Stream #0:0: Audio: mp3 (mp3float), 24000 Hz, mono, fltp, 32 kb/s


❯ ffprobe -hide_banner macos-at-7d73c3c.mp3
[mp3 @ 0x13b805280] Skipping 192 bytes of junk at 44.
[mp3 @ 0x13b805280] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'macos-at-7d73c3c.mp3':
  Metadata:
    encoder         : Lavf61.7.100
  Duration: 00:00:02.95, start: 0.022042, bitrate: 32 kb/s
  Stream #0:0: Audio: mp3 (mp3float), 24000 Hz, mono, fltp, 32 kb/s

Example output

INFO:     127.0.0.1:64767 - "POST /v1/audio/speech HTTP/1.1" 200 OK
07:13:53 PM | DEBUG    | paths:153 | Scanning for voices in path: ~/projects/code/Kokoro-FastAPI/api/src/voices/v1_0
07:13:53 PM | INFO     | openai_compatible:146 | Starting audio generation with lang_code: None
07:13:53 PM | DEBUG    | paths:131 | Searching for voice in path: ~/projects/code/Kokoro-FastAPI/api/src/voices/v1_0
07:13:53 PM | DEBUG    | tts_service:235 | Using single voice path: ~projects/code/Kokoro-FastAPI/api/src/voices/v1_0/af_sky.pt
07:13:53 PM | DEBUG    | tts_service:261 | Using voice path: ~/projects/code/Kokoro-FastAPI/api/src/voices/v1_0/af_sky.pt
07:13:53 PM | INFO     | tts_service:265 | Using lang_code 'a' for voice 'af_sky' in audio stream
07:13:53 PM | INFO     | text_processor:115 | Starting smart split for 43 chars
07:13:53 PM | DEBUG    | text_processor:51 | Total processing took 615.90ms for chunk: 'The quick brown fox jumps over the lazy dog'
07:13:53 PM | INFO     | text_processor:240 | Yielding final chunk 1: 'The quick brown fox jumps over the lazy dog' (52 tokens)
07:13:53 PM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'The quick brown fox jumps over the lazy dog'
07:13:54 PM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([81000])
07:13:54 PM | INFO     | text_processor:246 | Split completed in 1006.91ms, produced 1 chunks
INFO:     127.0.0.1:64768 - "GET /v1/download/tmpji9imjtp.mp3 HTTP/1.1" 200 OK

Branch / Deployment used

Tested on master, between two commits.

  • b00c9ec GOOD
  • 7d73c3c BAD

Git log over period

7d73c3c (HEAD -> master, origin/master, origin/HEAD) 4 days ago remsky Merge pull request #173 from fireblade2534/streaming-word-timestamps
e6feea7 4 days ago Fireblade Testing error
5de3cac 4 days ago Fireblade Fix some tests and allow running the docker container offline
c1207f0 4 days ago Fireblade Merge remote-tracking branch 'upstream/master' into streaming-word-timestamps
39cc056 5 days ago remsky Merge pull request #179 from fireblade2534/normalization-changes
41598eb 11 days ago Fireblade better parsing for times and phone numbers
3290bad 11 days ago Fireblade changes to how money and numbers are handled
3fd37b8 5 days ago remsky Merge pull request #186 from fireblade2534/Add-.gitattribues-file
7f15ba8 8 days ago Fireblade Add a .gitattributes
a6defbf 5 days ago remsky Merge pull request #171 from randombk/pr-no-reload
2b99334 13 days ago David Li Disable --reload on unicorn/fastapi to avoid pegging a CPU core
c5a3e13 7 days ago Fireblade Converted the stream writer to use pyav
4ee4d36 8 days ago Fireblade Fixes a couple of issues with audio triming and prevents errors with single voice weights
f2b2f41 10 days ago Fireblade fixed wrong varible name bug
cb22aab 10 days ago Fireblade Fix streaming a wav file with captions not reaturning any captions (This is only a problem because wav streaming does not acually work)
e3dc959 10 days ago Fireblade Simplify code so erverything uses AudioChunks
9c0e328 10 days ago Fireblade made it skip text normalization when using other languages as it only supports english
4802128 11 days ago Fireblade Replaced default voice with af_heart as af doesn't exist
8c457c3 11 days ago Fireblade fixed final test
1a6e7ab 11 days ago Fireblade fixed a bunch of tests
1a03ac7 12 days ago Fireblade Fixed some tests
353fe79 12 days ago Fireblade fix small error
842d056 12 days ago Fireblade Merge branch 'streaming-word-timestamps' of https://github.com/fireblade2534/Kokoro-FastAPI into streaming-word-timestamps
b71bab4 12 days ago Fireblade2534 Merge branch 'master' into streaming-word-timestamps
b00c9ec (tag: MACOS_HIGH_QUALITY) 13 days ago remsky Update README.md

Let us know if it's the master branch, or the stable branch indicated in the readme, as well as if you're running it locally, in the cloud, via the docker compose (cpu or gpu), or direct docker run commands. Please include the exact commands used to run in the latter cases.

Tested with both start-cpu and start-gpu scripts, using uv and with all dependencies installed.

Operating System
Include the platform, version numbers of your docker, etc. Whether its GPU (Nvidia or other) or CPU, Mac, Linux, Windows, etc.

Reproduced on macOS, and Linux Mint.

Macos on Apple Silicon
Linux Mint running on Core i7 8700k, Nvidia GTX 1080 8GB

Using both cpu and gpu, resulting files are identical.

Additional context

Initially I noticed this difference in quality between the mac cpu outputs, and GPU outputs on the linux box.
When I updated the mac to latest commit 7d73c3c it started producing the same muddy output files.

Originally noticed the usage with the SillyTavern OpenAI compatible TTS extension, but all reproduction steps have been using the Kokoro-FastAPI :8880/web/ interface.

Confirmed not GPU/CPU or OS related.

@garrow
Copy link
Author

garrow commented Feb 27, 2025

Using git-bisect I found that commit c5a3e136708c28f8118cf8555d6fcd3c173f4407 introduces the lower quality output.

c5a3e136708c28f8118cf8555d6fcd3c173f4407 is the first bad commit
commit c5a3e136708c28f8118cf8555d6fcd3c173f4407
Author: Fireblade <fireblade5234@gmail.com>
Date:   Wed Feb 19 23:10:51 2025 -0500

    Converted the stream writer to use pyav

 README.md                                  |   2 +-
 Test copy.py                               |  88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Test.py                                    |  85 +++++++++++++++++++++++++++++++++++++++++++++----------------
 api/src/routers/development.py             |   2 +-
 api/src/services/audio.py                  |   3 ++-
 api/src/services/streaming_audio_writer.py | 225 +++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------------
 api/src/services/tts_service.py            |   8 +++---
 api/tests/test_audio_service.py            |   6 ++---
 output.mp3                                 |   0
 pyproject.toml                             |   1 +
 10 files changed, 198 insertions(+), 222 deletions(-)
 create mode 100644 Test copy.py
 delete mode 100644 output.mp3

Method

I used this curl command to fetch an mp3 file against each commit.

curl 'http://0.0.0.0:8880/v1/audio/speech' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Connection: keep-alive' \
  -H 'Content-Type: application/json' \
  --data-raw '{"input":"The quick brown fox jumps over the lazy dog","voice":"af_sky","response_format":"mp3","download_format":"mp3","stream":true,"speed":1,"return_download_link":true}' \
  --insecure --output  "${HOME}/Downloads/kokoro-comparison/git-bisect/$(git rev-parse HEAD)-macos-cpu.mp3"

If you want to repro, change the curl output path to something that exists on your system.
--output "${HOME}/Downloads/kokoro-comparison/git-bisect/$(git rev-parse HEAD)-macos-cpu.mp3"

You can see this in the generated files.

Downloads/kokoro-comparison/git-bisect on ☁️
❯ l
total 312
-rw-r--r--  1 garrow    34K Feb 26 20:06 4ee4d36822d216f2d8fd1f6be5faf26a6c4014e3-macos-cpu.mp3
-rw-r--r--  1 garrow    34K Feb 26 20:02 8c457c3292f5c1b2828fda334ec98d3fa648f220-macos-cpu.mp3
-rw-r--r--  1 garrow    12K Feb 26 20:04 c5a3e136708c28f8118cf8555d6fcd3c173f4407-macos-cpu.mp3
-rw-r--r--  1 garrow    34K Feb 26 20:04 e3dc9597757489e48323e221cc48cfdae2df794e-macos-cpu.mp3
-rw-r--r--  1 garrow    34K Feb 26 20:05 f2b2f41412dcf36d852378c2af794f360c38dd79-macos-cpu.mp3

Full bisect output

Kokoro-FastAPI on ⑂ HEAD (8c457c3) (BISECTING) [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ git bisect good
Bisecting: 10 revisions left to test after this (roughly 3 steps)
[c5a3e136708c28f8118cf8555d6fcd3c173f4407] Converted the stream writer to use pyav

Kokoro-FastAPI on ⑂ HEAD (c5a3e13) (BISECTING) [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ git bisect bad
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[e3dc9597757489e48323e221cc48cfdae2df794e] Simplify code so erverything uses AudioChunks

Kokoro-FastAPI on ⑂ HEAD (e3dc959) (BISECTING) [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[f2b2f41412dcf36d852378c2af794f360c38dd79] fixed wrong varible name bug

Kokoro-FastAPI on ⑂ HEAD (f2b2f41) (BISECTING) [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[4ee4d36822d216f2d8fd1f6be5faf26a6c4014e3] Fixes a couple of issues with audio triming and prevents errors with single voice weights

Kokoro-FastAPI on ⑂ HEAD (4ee4d36) (BISECTING) [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ git bisect good
c5a3e136708c28f8118cf8555d6fcd3c173f4407 is the first bad commit
commit c5a3e136708c28f8118cf8555d6fcd3c173f4407
Author: Fireblade <fireblade5234@gmail.com>
Date:   Wed Feb 19 23:10:51 2025 -0500

    Converted the stream writer to use pyav

 README.md                                  |   2 +-
 Test copy.py                               |  88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Test.py                                    |  85 +++++++++++++++++++++++++++++++++++++++++++++----------------
 api/src/routers/development.py             |   2 +-
 api/src/services/audio.py                  |   3 ++-
 api/src/services/streaming_audio_writer.py | 225 +++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------------
 api/src/services/tts_service.py            |   8 +++---
 api/tests/test_audio_service.py            |   6 ++---
 output.mp3                                 |   0
 pyproject.toml                             |   1 +
 10 files changed, 198 insertions(+), 222 deletions(-)
 create mode 100644 Test copy.py
 delete mode 100644 output.mp3

@garrow garrow changed the title Audio quality degradation since b00c9ec Audio quality degradation in c5a3e136708c28f8118cf8555d6fcd3c173f4407 (since b00c9ec) Feb 27, 2025
@garrow garrow changed the title Audio quality degradation in c5a3e136708c28f8118cf8555d6fcd3c173f4407 (since b00c9ec) Audio quality degradation in c5a3e13 (since b00c9ec) Feb 27, 2025
@garrow garrow changed the title Audio quality degradation in c5a3e13 (since b00c9ec) Audio quality degradation in c5a3e13 "Converted the stream writer to use pyav" Feb 27, 2025
@fireblade2534
Copy link
Collaborator

yeah I have been investigating this issue through I'm not fully sure why its happening

@garrow
Copy link
Author

garrow commented Feb 27, 2025

Some further digging shows it might just be forcing a lower bitrate on the mp3.

Pre change
Audio: 96 KB/s, 24KHz (mono)

Post change
Audio: 32 KB/s, 24KHz (mono)

> ls -1S | xargs -n 1 mp3info -Fx -r m


File: i7-mint-b00c9ec.mp3
Media Type:  MPEG 2.0 Layer III
Audio:       96 KB/s, 24KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

File: macos-at-b00c9ec.mp3
Media Type:  MPEG 2.0 Layer III
Audio:       96 KB/s, 24KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

File: i7-mint-gpu-7d73c3c.mp3
Media Type:  MPEG 2.0 Layer III
Audio:       32 KB/s, 24KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

File: macos-at-7d73c3c.mp3
Media Type:  MPEG 2.0 Layer III
Audio:       32 KB/s, 24KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

https://www.ibiblio.org/mp3info/mp3info.html

mp3info

-x
Display technical attributes of the MP3 file

-r a|m|v
Report bit rate of Variable Bit Rate (VBR) files as one of the following (See the section below entitled Bit Rates for more information):
m - Median bit rate [integer]

@fireblade2534
Copy link
Collaborator

do wav files behave the same way? like is their bitrate also lower?

@fireblade2534
Copy link
Collaborator

Does this fix it #207 ?

@garrow
Copy link
Author

garrow commented Feb 27, 2025

Sure does!

Kokoro-FastAPI on ⑂ fireblade2534/master:master [$] is 📦 v0.1.4 via 🐍 v3.13.2 on ☁️    @ garrow
❯ mp3info -x 2025.02.26.21:33-macos-cpu-overridebitrate.mp3
2025.02.26.21:33-macos-cpu-overridebitrate.mp3 does not have an ID3 1.x tag.
File: 2025.02.26.21:33-macos-cpu-overridebitrate.mp3
Media Type:  MPEG 2.0 Layer III
Audio:       96 KB/s, 24KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

@garrow
Copy link
Author

garrow commented Feb 27, 2025

I also found that simply removing the rate=self.sample_rate,sample_rate=self.sample_rate arguments lets the encoder decide, and it produces a 64 KB/s file automatically.

❯ mp3info -x 2025.02.26.21:25-macos-cpu-norate.mp3
2025.02.26.21:25-macos-cpu-norate.mp3 does not have an ID3 1.x tag.
File: 2025.02.26.21:25-macos-cpu-norate.mp3
Media Type:  MPEG 1.0 Layer III
Audio:       64 KB/s, 48KHz (mono)
Emphasis:    none
CRC:         No
Copyright:   No
Original:    Yes
Padding:     No
Length:      0:03

@fireblade2534
Copy link
Collaborator

thanks for telling me. I didn't realize that the rate arg was bitrate xD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants