-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronous and Asynchronous Mic #402
Synchronous and Asynchronous Mic #402
Conversation
I have turned the Mic class into an abstract class and used it to create two new classes, MicSynchronous and MicAsynchronous. I'm hoping to expand it to all the different mic classes, including the local (text) mic and the batch mic. I'm attempting to support both active listen mode (where the computer only starts listening for a command after hearing its wakeword - Siri-like mode) and passive listen mode (where the computer records blocks of audio, then checks for the wake word and then checks the same block of audio for a command - Echo-like). Right now, I am having trouble with the expect function when using passive listen mode with the asynchronous listener. This has to do with the pyaudio device play_file, which returns when it has finished writing to the queue, but before the audio is done playing. This leads to situations where the next audio starts getting queued before the last audio finishes playing. If the audio's have different frame sizes, this leads to a segmentation fault. I have been testing by using the "knock knock joke" and "time" speechhandler plugins. Knock-knock joke uses expect quite a bit. I have been using Pocketsphinx_KWS for my passive STT engine, Pocketsphinx for my special STT engine and VOSK (which is available here: https://github.com/aaronchantrill/Naomi_VOSK_STT) as my active STT engine. VOSK works well, at least in English, but requires some additional training if you have non-standard words in your vocabulary. I'd like to make VOSK officially available through NPE but the last time I trained VOSK to recognize some additional words, it required a computer with 32GiB of ram. I will test on my Raspberry Pi 5 with 8 GiB and see if it can handle it, but have low expectations. I would like to add an option to export the Naomi vocabulary so VOSK can be trained on another computer, as it does run well on the Raspberry Pi 4 and 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
Mic class is a child class of i18n.GettextMixin but was not using that parent class correctly. Edited the __init__ method.
One thing I'm not real happy with is having the listen() and active_listen() methods returning both the transcription and the audio itself. This would be a breaking change, although it probably needs to happen since I also want to add the speaker's identity and may come up with additional needs moving forward. I am planning to create a new Utterance class that will contain additional meta-information that will be made available. I'll define a default property so that referencing the utterance object directly will return the transcription, which should make it work with plugins that call listen() expecting a string. |
I added an Utterance class that returns information from the mic listen() and active_listen() methods. This object returns the transcription when called without an parameters, so it is backwards compatible with plugins that expect to get a string or list of strings back from those methods.
I think this is ready to go now. If anyone is interested, please try it. Let me know if you encounter any issues. If not, I will merge it in a week. |
Description
I have turned the Mic class into an abstract class and used it to create two new classes, MicSynchronous and MicAsynchronous. I'm hoping to expand it to all the different mic classes, including the local (text) mic and the batch mic.
I'm attempting to support both active listen mode (where the computer only starts listening for a command after hearing its wakeword - Siri-like mode) and passive listen mode (where the computer records blocks of audio, then checks for the wake word and then checks the same block of audio for a command - Echo-like).
Right now, I am having trouble with the expect function when using passive listen mode with the asynchronous listener. This has to do with the pyaudio device play_file, which returns when it has finished writing to the queue, but before the audio is done playing. This leads to situations where the next audio starts getting queued before the last audio finishes playing. If the audio's have different frame sizes, this leads to a segmentation fault.
I have been testing by using the "knock knock joke" and "time" speechhandler plugins. Knock-knock joke uses expect quite a bit. I have been using Pocketsphinx_KWS for my passive STT engine, Pocketsphinx for my special STT engine and VOSK (which is available here: https://github.com/aaronchantrill/Naomi_VOSK_STT) as my active STT engine. VOSK works well, at least in English, but requires some additional training if you have non-standard words in your vocabulary. I'd like to make VOSK officially available through NPE but the last time I trained VOSK to recognize some additional words, it required a computer with 32GiB of ram. I will test on my Raspberry Pi 5 with 8 GiB and see if it can handle it, but have low expectations. I would like to add an option to export the Naomi vocabulary so VOSK can be trained on another computer, as it does run well on the Raspberry Pi 4 and 5.
Related Issue
Naomi does not listen while thinking #340
Motivation and Context
The microphone does not currently continue to collect audio while Naomi is processing. This is especially a problem when entering a room, as the VAD still often captures noises as audio to process. If you walk into the room and then address Naomi while it is processing the audio of you walking into the room, it will miss your request.
How Has This Been Tested?
I have tested with both "listen while talking=True" (asynchronous) and "listen while talking=False" (synchronous) modes.
I have tested with both "passive_listen=True" (passive listening) and "passive_listen=False" (active listen) modes
I have been testing by asking Naomi to tell me a knock-knock joke (which uses the "expect" method) and then either allowing it to finish the joke, or asking it to tell me the time before it completes the joke:
User: Tel me a knock knock joke
Naomi: Knock knock
User: Naomi, what time is it?
Naomi: It is 12:15 PM right now
Screenshots (if appropriate):
Types of changes
Checklist: