scribe
is a speech recognition tool that provides real-time transcription using cutting-edge AI models, with the goal of serving as a virtual keyboard on a computer.
It features local, downloadable models with the vosk
and whisper
backends, as well as a client to open AI via openaiapi
backend (API key required).
The package is initially developped for python 3.12 with Ubuntu 24.04 with Gnome + Wayland, but it should work on other platforms as well (feedback welcome). Basically check the pages of the dependencies for more info (i.e. pynput for the keyboard, pystray for the app).
-
python 3.13:
- at the time of writing,
openai-whisper
does not install.
- at the time of writing,
-
Ubuntu:
- see caveats in the use of the keyboard under Wayland keyboard section.
-
MacOS:
- tested on a Macbook Air M1 8Gb RAM, with python 3.12. It runs, but poorly, presumably because of the low memory: prefer the
openaiapi
backend for such machines - I expect better memory specs will have the local models run fine
- tested on a Macbook Air M1 8Gb RAM, with python 3.12. It runs, but poorly, presumably because of the low memory: prefer the
-
Windows:
- not tested yet
Install PortAudio library (required by sounddevice
) and xclip library (required by pyperclip
). E.g. on Ubuntu:
sudo apt-get install portaudio19-dev xclip
(portaudio19-dev
becomes portaudio
with homebrew)
See additional requirements for the icon tray and keyboard options. The python dependencies should be dealt with automatically:
pip install scribe-cli[all]
(note the -cli
suffix for client)
or for local development:
git clone https://github.com/perrette/scribe.git
cd scribe
pip install -e .[all]
You can leave the optional dependencies (leave out [all]
) but must install at least one of vosk
or openai-whisper
or openai
packages (see Usage below).
At the time of writing openai-whisper
does not install on python 3.13
. You can install the packages manually and skip that package. This makes the whisper
API unavailable.
# language models (at least one must be installed !)
pip install vosk
pip install openai soundfile # openaiapi
pip install openai-whisper # FAILS IN PYTHON 3.13 on Ubuntu
# PortAUDIO (sounddevice)
pip install sounddevice # automatically installed as required dependency
sudo apt-get install portaudio19-dev
# MAC OS: brew install portaudio
# clipboard
pip install pyperclip # automatically installed as required dependency
sudo apt-get install xclip
# keyboard
pip install pynput
# app mode
sudo apt install libcairo-dev libgirepository1.0-dev gir1.2-appindicator3-0.1 # Ubuntu ONLY (not needed on MacOS)
pip install PyGObject # Ubuntu ONLY (not needed on MacOS)
pip install pystray
# And finally
pip install scribe-cli
The language models for local backends vosk
and whisper
will download on-the-fly.
The default download folder is $XDG_CACHE_HOME/{backend}
where $XDG_CACHE_HOME
defaults to $HOME/.cache
.
Just type in the terminal:
scribe
and the script will guide you through the choice of backend (whisper
or vosk
or openaiapi
) and the specific language model.
After this, you will be prompted to start recording your microphone and print the transcribed text in real-time (vosk
)
or until after recording is complete (whisper
).
You can interrupt the recording via Ctrl + C and start again or change model.
The default (whisper
) is excellent at transcribing a full-length audio sequences in many languages. It is really impressive,
but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the turbo
model (at least on my laptop with CPU only). The small
model is also excellent and runs much faster. It is selected as default in scribe
for that reason.
With the whisper
model (whisper
and openaiapi
backends) the registration continues for 2 minutes until you stop the registration manually to trigger the transcription (Stop in the app, Ctrl + C in the terminal).
These parameters can be changed. There is also the possibility to interrupt after a silence is detected. You would do: --silence -40 --duration-duration 2
to interrupt the registration when a silence (less than -40 db recorded) lasts for more than 2 seconds. This is experimental, and the default is an exceedingly low silence threshold of -200 db and a silence duration of 120 s to effectively disable that feature and keep full manual control.
The vosk
backend is much faster and very good at doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
It becomes really powerful when used for longer or interactive typing session with the keyboard option, e.g. to make notes or chat with an AI.
There are many vosk models available, and here a few are associated to a handful of languages en
, fr
, it
, de
(so far).
The openaiapi
backend uses whisper-1
model at the time of writing. It requires an API key best passed as an environment variable, e.g. in bash:
export OPENAI_API_KEY=YOURAPIKEY
scribe --backend openaiapi
The openaiapi
backend is lightweight and handy if you have an API (you can create one for free for testing) and a low-spec computer (and don't care too much about privacy, obviously).
By default the transcription is printed on the terminal, but other output media are supported.
The most straightforward is the clipboard:
scribe --clipboard
The content of the (full) transcription is then pasted to the clipboard, and it is up to the user to paste (e.g. Ctrl + V).
Alternatively an output file can be indicated:
scribe -o transcription.txt
With the --keyboard
option scribe
will attempt to simulate a keyboard and send transcribed characters to the application under focus:
scribe --keyboard
This can be extremely useful with the vosk
backend and its realtime transcription, or alternatively with the --restart
option with the whisper
backend.
The --keyboard
option relies on the optional pynput
dependency (installed together with scribe
if you used the [all]
or [keyboard]
option).
Depending on your operating system, pynput
may require additional configuration to work around its limitations.
In my Ubuntu 24.04 + Wayland system the keyboard simulation works out-of-the-box in chromium based applications (including vscode) but it does not in firefox and sublime text and any of the rest (not even in a terminal !). I am told this is because Chromium runs an X server emulator and so is compatible with the default pynput backend.
One workaround is to use the Xorg version of GNOME: in etc/gdm3/custom.conf
uncomment # WaylandEnable=false
and restart your computer.
Another workaround while staying with Wayland is to use the low-level uinput
backend of pynput
, but that requires that scribe
is run as root (sudo), and likely other configurations like activating the uinput
system module (sudo modprobe uinput
for a one-time test, or adding uinput
to /etc/modules-load.d/modules.conf
to make that persistent).
Moreover, the keyboard must be set with an appropriate layout, for example to have the letter é
you'd want a French or Italian layout otherwise the English will drop it or replace with something else. Another caveat I encountered is that the special characters (é
) were inserted at the wrong place. Adding a small delay was enough to fix that with the additional parameter --latency 0.01
Finally if you run as sudo you may need to reset some environment variable so that the list of audio devices (XDG_RUNTIME_DIR
) and the download folder remain the same. To sum-up, that gives something like:
sudo modprobe uinput
sudo HOME=$HOME XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR PYNPUT_BACKEND_KEYBOARD=uinput $(which scribe) --latency 0.01
You're on the right path :)

To avoid switching back and forth with the terminal, it's possible to interact with the program via an icon tray. To activate start with:
scribe --app ...
or toggle the app option in the interactive menu. The scribe icon will show, with Record and other options. The icon will change based on what the app is doing. It is possible to choose from a set
of predefined models (controlled by --vosk-models
and whisper-models
) and options, or to Quit and choose from the terminal before pressing Enter again.
For the vosk model, there are only two states : recording + transcribing or Idle. For the whisper model there are three states visible from the icon: recording/waiting, transcribing and idle.
That option requires pystray
to be installed. This is included with the pip install ...[all]
option.
The --vosk-models
and --whisper-models
allow to predefined the set of available models to choose from in the app manu. E.g.
scribe --app --vosk-models vosk-model-fr-0.22 --whisper-models small turbo ...
In Ubuntu the following dependencies were required to make the menus appear:
sudo apt install libcairo-dev libgirepository1.0-dev gir1.2-appindicator3-0.1
pip install PyGObject
If you run Ubuntu (or else?) with GNOME, the script scribe-install [...]
will create a scribe.desktop
file and place it under $HOME/.local/share/applications
to make it available from the quick launch menu. Any option will be passed on to scribe
, with the additional options --name
and --no-terminal
.
--no-terminal
means no terminal will show up, and it also implies the options --app --no-prompt
.
Consider the following two flavors:
scribe-install --name "Scribe Terminal" --clipboard ...
scribe-install --name "Scribe" --no-terminal --clipboard ...
The first will create an app named Scribe (the default) that simply opens a terminal and execute the command scribe --clipboard ...
.
The second will create an app named Scribe App that executes in a hidden terminal: scribe --no-prompt --app --clipboard ...
, thus leaving the tray icon as only mode of interaction.
There are a number of options to control the silence threshold, duration and more. Best is to check the available options in the online help:
scribe --help