Listen, Chat, And Edit on Edge: Text-Guided Soundscape Modification for Real-Time Auditory Experience
Listen, Chat, and Edit (LCE) is a cutting-edge multimodal sound mixture editor designed to modify each sound source in a mixture based on user-provided text instructions. The system features a user-friendly chat interface and the unique ability to edit multiple sound sources simultaneously within a mixture without the need for separation. Using open-vocabulary text prompts interpreted by a large language model, LCE creates a semantic filter to edit sound mixtures, which are then decomposed, filtered, and reassembled into the desired output.
- data/datasets: Contains the scripts used to process dataset and prompts.
- demonstration: A demonstration of an input mixure and the edited version.
- embeddings: The pkl file recieved from the LLM are stored in this folder.
- hparams: Hyperparameters settings for the models.
- llm_cloud: Configuration and scripts for cloud-based language model interactions.
- modules: Core modules and utilities for the project.
- prompts: Handling and processing of text prompts.
- pubsub: Setup for publish-subscribe messaging patterns.
- utils: Utility scripts for general purposes.
- E6692.2022Spring.LCEE.ss6928.pkk2125.presentationFinal.pptx: Final presentation file detailing project overview and results.
- profiling.ipynb: Jupyter notebook for profiling the modules in terms of inference speed and gpu memory usage.
- run_lce.ipynb: Main executable notebook for the LCE system.
- run_prompt_reader.ipynb: Notebook for reading and processing prompts.
- run_prompt_reader_profiling.ipynb: Profiling for the prompt reader.
- run_sound_editor_nosb.ipynb: Notebook for the sound editor module without SpeechBrain.
- Clone the repository:
git clone https://github.com/SiavashShams/Listen-Chat-Edit-on-Edge.git
- Install required dependencies:
pip install -r requirements.txt
To run the main LCE application:
run_lce.ipynb
For a demonstration of the system's capabilities, refer to the demonstration
folder.
- Deploy Conv-TasNet on the Jetson Nano.
- Deploy LLAMA 2 on a GCP server
- Send a prompt to the server. Communication is handled in two methods - one, through SSH and the other, through Pub/Sub service.
- LLM computed the embedding and publishes back the embedding, which is input to the Conv-TasNet model.
- The resulting audio mixture is ready to be played!
Thanks to the authors of Listen, Chat, And Edit for their amazing work.