Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface example is broken #234

Open
pbarker opened this issue Feb 25, 2025 · 0 comments
Open

Huggingface example is broken #234

pbarker opened this issue Feb 25, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@pbarker
Copy link

pbarker commented Feb 25, 2025

Backend impacted

The PyTorch implementation

Operating system

Linux

Hardware

GPU with CUDA

Description

Folowing the huggingface example on https://huggingface.co/docs/transformers/en/model_doc/moshi

from datasets import load_dataset, Audio
import torch, math
from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer


librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
device = "cuda"
dtype = torch.bfloat16

# prepare user input audio 
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = librispeech_dummy[-1]["audio"]["array"]
user_input_values = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(device=device, dtype=dtype)

# prepare moshi input values - we suppose moshi didn't say anything while the user spoke
moshi_input_values = torch.zeros_like(user_input_values.input_values)

# prepare moshi input ids - we suppose moshi didn't say anything while the user spoke
num_tokens = math.ceil(moshi_input_values.shape[-1] * waveform_to_token_ratio)
input_ids = torch.ones((1, num_tokens), device=device, dtype=torch.int64) * tokenizer.encode("<pad>")[0]

# generate 25 new tokens (around 2s of audio)
output = model.generate(input_ids=input_ids, user_input_values=user_input_values.input_values, moshi_input_values=moshi_input_values, max_new_tokens=25)

text_tokens = output.sequences
audio_waveforms = output.audio_sequences

This line:

feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")

Fails with:

OSError: kyutai/moshiko-pytorch-bf16 does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/kyutai/moshiko-pytorch-bf16/tree/main' for available files.

Any idea how to use this with huggingface?

Extra information

NA

Environment

Ubuntu 22.04, L40s GPU

@pbarker pbarker added the bug Something isn't working label Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant