+---+ +----------------+ +---+ +---+ +---+
|Mic|-->|Audio Processing|-->|KWS|-->|STT|-->|NLU|
+---+ +----------------+ +---+ +---+ +-+-+
+-------+ +---+ +----------------------+ |
+-------+ +---+ +----------------------+
- Audio Processing 即语音前端处理,包括回声消除 (Acoustic Echo Cancellation, AEC), 波束成形 (Beamforming), 降噪 (Noise Suppression, NS)等,现在主流采用阵列麦克风实现远场语音交互,开始有一些将神经网络跟经典语音处理算法结合的研究。
- Keyword Spotting (KWS) 即关键词检测,用于检测OK Google、Hey Siri之类的关键词,然后开启一次对话。运行在终端设备,避免语音一直上传到云端
- Speech To Text (STT) 即语音转文字,现在已经HMM+GMM全面切换到神经网络了
- Natural Language Understanding (NLU) 即自然语言理解,用神经网络实现已经是主流了
- Knowledge/Skill/Action 即知识库和扩展
- Text To Speech 即文本转语音,Google的基于神经网络的WaveNet TTS已经开始应用在Google Assistant上了
在语音、自然语言处理及图像领域,基于神经网络的机器学习非常有效,像Google的AI First和百度的All In AI,他们有能力把AI落地,而不是在建造空中楼阁。
- Mycroft ⭐ -
- dingdang robot - a 🇨🇳 voice interaction robot based on Jasper and built with raspberry pi
Amazon Alexa Voice Service - 拥有最多的用户、开发者和第三方合作商,
Google则有最好的技术,Google Assistant通过 digitalflow.ai 可以快速扩展,其Device Action可以很好联动本地设备
- Snowboy - DNN based hotword and wake word detection toolkit
- Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting
- ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller
- Mozilla DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture
- Kaldi
- PocketSphinx - a lightweight speech recognition engine using HMM + GMM
- Mimic - Mycroft's TTS engine, based on CMU's Flite (Festival Lite)
- manytts - an open-source, multilingual text-to-speech synthesis system written in pure java
- espeak-ng - an open source speech synthesizer that supports 99 languages and accents.
- ekho - Chinese text-to-speech engine
- WaveNet, Tacotron 2
Acoustic Echo Cancellation
Direction Of Arrival (DOA) - Most used DOA algorithms is GCC-PHAT
- BeamformIt - filter&sum beamforming
- CGMM Beamforming - a reference implementation
- MVDR Beamforming
- GSC Beamforming
Noise Suppresion
- NS of WebRTC audio processing
- PortAudio
- libsoundio
- PulseAudio