Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous and Asynchronous Mic #402

Merged

Conversation

aaronchantrill
Copy link
Contributor

Description

I have turned the Mic class into an abstract class and used it to create two new classes, MicSynchronous and MicAsynchronous. I'm hoping to expand it to all the different mic classes, including the local (text) mic and the batch mic.

I'm attempting to support both active listen mode (where the computer only starts listening for a command after hearing its wakeword - Siri-like mode) and passive listen mode (where the computer records blocks of audio, then checks for the wake word and then checks the same block of audio for a command - Echo-like).

Right now, I am having trouble with the expect function when using passive listen mode with the asynchronous listener. This has to do with the pyaudio device play_file, which returns when it has finished writing to the queue, but before the audio is done playing. This leads to situations where the next audio starts getting queued before the last audio finishes playing. If the audio's have different frame sizes, this leads to a segmentation fault.

I have been testing by using the "knock knock joke" and "time" speechhandler plugins. Knock-knock joke uses expect quite a bit. I have been using Pocketsphinx_KWS for my passive STT engine, Pocketsphinx for my special STT engine and VOSK (which is available here: https://github.com/aaronchantrill/Naomi_VOSK_STT) as my active STT engine. VOSK works well, at least in English, but requires some additional training if you have non-standard words in your vocabulary. I'd like to make VOSK officially available through NPE but the last time I trained VOSK to recognize some additional words, it required a computer with 32GiB of ram. I will test on my Raspberry Pi 5 with 8 GiB and see if it can handle it, but have low expectations. I would like to add an option to export the Naomi vocabulary so VOSK can be trained on another computer, as it does run well on the Raspberry Pi 4 and 5.

Related Issue

Naomi does not listen while thinking #340

Motivation and Context

The microphone does not currently continue to collect audio while Naomi is processing. This is especially a problem when entering a room, as the VAD still often captures noises as audio to process. If you walk into the room and then address Naomi while it is processing the audio of you walking into the room, it will miss your request.

How Has This Been Tested?

I have tested with both "listen while talking=True" (asynchronous) and "listen while talking=False" (synchronous) modes.
I have tested with both "passive_listen=True" (passive listening) and "passive_listen=False" (active listen) modes
I have been testing by asking Naomi to tell me a knock-knock joke (which uses the "expect" method) and then either allowing it to finish the joke, or asking it to tell me the time before it completes the joke:
User: Tel me a knock knock joke
Naomi: Knock knock
User: Naomi, what time is it?
Naomi: It is 12:15 PM right now

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

I have turned the Mic class into an abstract class and used it to
create two new classes, MicSynchronous and MicAsynchronous. I'm
hoping to expand it to all the different mic classes, including
the local (text) mic and the batch mic.

I'm attempting to support both active listen mode (where the
computer only starts listening for a command after hearing its
wakeword - Siri-like mode) and passive listen mode (where the
computer records blocks of audio, then checks for the wake word
and then checks the same block of audio for a command - Echo-like).

Right now, I am having trouble with the expect function when using
passive listen mode with the asynchronous listener. This has to do
with the pyaudio device play_file, which returns when it has
finished writing to the queue, but before the audio is done
playing. This leads to situations where the next audio starts
getting queued before the last audio finishes playing. If the
audio's have different frame sizes, this leads to a segmentation
fault.

I have been testing by using the "knock knock joke" and "time"
speechhandler plugins. Knock-knock joke uses expect quite a bit.
I have been using Pocketsphinx_KWS for my passive STT engine,
Pocketsphinx for my special STT engine and VOSK (which is
available here: https://github.com/aaronchantrill/Naomi_VOSK_STT)
as my active STT engine. VOSK works well, at least in English,
but requires some additional training if you have non-standard
words in your vocabulary. I'd like to make VOSK officially
available through NPE but the last time I trained VOSK to
recognize some additional words, it required a computer with
32GiB of ram. I will test on my Raspberry Pi 5 with 8 GiB and
see if it can handle it, but have low expectations. I would
like to add an option to export the Naomi vocabulary so VOSK
can be trained on another computer, as it does run well on the
Raspberry Pi 4 and 5.
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

naomi/mic.py Fixed Show fixed Hide fixed
naomi/mic.py Fixed Show fixed Hide fixed
Mic class is a child class of i18n.GettextMixin but was not using
that parent class correctly. Edited the __init__ method.
@aaronchantrill aaronchantrill marked this pull request as draft March 3, 2024 17:25
@aaronchantrill
Copy link
Contributor Author

aaronchantrill commented Mar 3, 2024

One thing I'm not real happy with is having the listen() and active_listen() methods returning both the transcription and the audio itself. This would be a breaking change, although it probably needs to happen since I also want to add the speaker's identity and may come up with additional needs moving forward. I am planning to create a new Utterance class that will contain additional meta-information that will be made available. I'll define a default property so that referencing the utterance object directly will return the transcription, which should make it work with plugins that call listen() expecting a string.

I added an Utterance class that returns information from the mic
listen() and active_listen() methods. This object returns the
transcription when called without an parameters, so it is
backwards compatible with plugins that expect to get a string or
list of strings back from those methods.
@aaronchantrill aaronchantrill marked this pull request as ready for review March 3, 2024 21:17
@aaronchantrill
Copy link
Contributor Author

I think this is ready to go now. If anyone is interested, please try it. Let me know if you encounter any issues. If not, I will merge it in a week.

@aaronchantrill aaronchantrill merged commit 579a8b3 into NaomiProject:naomi-dev Mar 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant