Kaldi engine back-end

The Kaldi engine implementation uses the free, open source, Kaldi speech recognition toolkit. You can read more about the Kaldi project on the Kaldi project site.

This backend relies greatly on the kaldi-active-grammar library, which extends Kaldi’s standard decoding for use in a Dragonfly-style environment, allowing combining many dynamic grammars that can be set active/inactive based on contexts in real-time. It also provides basic infrastructure for compiling, recognizing, and parsing grammars with Kaldi, plus a compatible model. For more information, see its page.

Both this backend and kaldi-active-grammar have been developed by David Zurow (@daanzu). Kaldi-backend-specific issues, suggestions, and feature requests are welcome & encouraged, but are probably better sent to the kaldi-active-grammar repository. If you value this work and want to encourage development of a free, open source engine for Dragonfly as a competitive alternative to commercial offerings, kaldi-active-grammar accepts donations (not affiliated with the Dragonfly project itself).

The Kaldi backend may be used on Windows, macOS, Linux. Dragonfly has good support for desktop enviroments on each of these operating systems. As other sections of the documentation mention, Dragonfly does not support Wayland.

Please note that Kaldi Active Grammar is licensed under the GNU Affero General Public License v3 (AGPL-3.0-or-later). This has an effect on what Dragonfly can be used to do when used in conjunction with the Kaldi engine back-end.

Sections:

Setup

Want to get started quickly & easily on Windows? A self-contained, portable, batteries-included (python & libraries & model) distribution of kaldi-active-grammar + Dragonfly is available at the kaldi-active-grammar project releases page. Otherwise…

Requirements:

  • Python 3.6+; 64-bit required!

  • OS: Windows/Linux/MacOS all supported

  • Only supports Kaldi left-biphone models, specifically nnet3 chain models, with specific modifications

  • ~1GB+ disk space for model plus temporary storage and cache, depending on your grammar complexity

  • ~500MB+ RAM for model and grammars, depending on your model and grammar complexity

  • Python package dependencies (which should be installed automatically by following the instructions below):

Note for Linux: You may need the portaudio headers to be installed in order to be able to install/compile the sounddevice Python package. Under apt-based distributions, you can get them by running sudo apt install portaudio19-dev. You may also need to make your user account a member of the audio group to be able to access your microphone. Do this by running usermod -a -G audio <account_name>.

Installing the correct versions of the Python dependencies can be most easily done by installing the kaldi sub-package of dragonfly using:

pip install 'dragonfly[kaldi]'

If you are installing to develop Dragonfly, use the following instead (from your Dragonfly git repository):

pip install -e '.[kaldi]'

Note: If you have errors installing the kaldi-active-grammar package, make sure you’re using a 64-bit Python, and update your pip by executing pip install --upgrade pip.

You will also need a model to use. You can download a compatible general English Kaldi nnet3 chain model from kaldi-active-grammar. Unzip it into a directory within the directory containing your grammar modules.

Note for Linux: Before proceeding, you’ll need to install the wmctrl, xdotool and xsel programs. Under apt-based distributions, you can get them by running:

sudo apt install wmctrl xdotool xsel

You may also need to manually set the DISPLAY environment variable.

Once the dependencies and model are installed, you’re ready to go!

Getting Started

A simple, single-file, standalone demo/example can be found in the dragonfly/examples/kaldi_demo.py script. Simply run it from the directory containing the above model (or modify the configuration paths in the file) using:

python path/to/kaldi_demo.py

For more structured and long-term use, you’ll want to use a module loader. Copy the dragonfly/examples/kaldi_module_loader_plus.py script into the folder with your grammar modules and run it using:

python kaldi_module_loader_plus.py

This file is the equivalent to the ‘core’ directory that NatLink uses to load grammar modules. When run, it will scan the directory it’s in for files beginning with _ and ending with .py, then try to load them as command-modules.

This file also includes a basic sleep/wake grammar to control recognition (simply say “start listening” or “halt listening”).

A more basic loader is in dragonfly/examples/kaldi_module_loader.py.

Updating To A New Version

When updating to a new version of Dragonfly, you should always rerun pip install 'dragonfly[kaldi]' (or pip install '.[kaldi]', etc.) to make sure you get the required version of kaldi_active_grammar.

Engine Configuration

This engine can be configured by passing (optional) keyword arguments to the get_engine() function, which passes them to the KaldiEngine constructor (documented below). For example:

engine = get_engine("kaldi",
  model_dir='kaldi_model',
  tmp_dir=None,
  audio_input_device=None,
  audio_self_threaded=True,
  audio_auto_reconnect=True,
  audio_reconnect_callback=None,
  retain_dir=None,
  retain_audio=None,
  retain_metadata=None,
  retain_approval_func=None,
  vad_aggressiveness=3,
  vad_padding_start_ms=150,
  vad_padding_end_ms=200,
  vad_complex_padding_end_ms=600,
  auto_add_to_user_lexicon=True,
  allow_online_pronunciations=False,
  lazy_compilation=True,
  invalidate_cache=False,
  expected_error_rate_threshold=None,
  alternative_dictation=None,
)

The engine can also be configured via the command-line interface:

# Initialize the Kaldi engine backend with custom arguments, then load
# command modules and recognize speech.
python -m dragonfly load _*.py --engine kaldi --engine-options " \
    model_dir=kaldi_model_daanzu \
    vad_padding_end_ms=300"
KaldiEngine(model_dir=None, tmp_dir=None, input_device_index=None, audio_input_device=None, audio_self_threaded=True, audio_auto_reconnect=True, audio_reconnect_callback=None, retain_dir=None, retain_audio=None, retain_metadata=None, retain_approval_func=None, vad_aggressiveness=3, vad_padding_start_ms=150, vad_padding_end_ms=200, vad_complex_padding_end_ms=600, auto_add_to_user_lexicon=True, allow_online_pronunciations=False, lazy_compilation=True, invalidate_cache=False, expected_error_rate_threshold=None, alternative_dictation=None, compiler_init_config=None, decoder_init_config=None)[source]

Speech recognition engine back-end for Kaldi recognizer.

Arguments (all optional):

  • model_dir (str|None) – Directory containing model.

  • tmp_dir (str|None) – Directory to use for temporary storage and cache (used for caching during and between executions but safe to delete).

  • audio_input_device (int|str|None|False) – Microphone PortAudio input device: the default of None chooses the default input device, or False disables microphone input. To see a list of available input devices and their corresponding indexes an names, call get_engine('kaldi').print_mic_list(). To select a specific device, pass an int representing the index number of the device, or pass a str representing (part of) the name of the device. If a string is given, the device is selected which contains all space-separated parts in the right order. Each device string contains the name of the corresponding host API in the end. The string comparison is case-insensitive. The string match must be unique.

  • audio_auto_reconnect (bool) – Whether to automatically reconnect the audio device if it appears to have stopped (by not returning any audio data for some period of time).

  • audio_reconnect_callback (callable|None) – Callable to be called every time the audio system attempts to reconnect (automatically or manually). It must take exactly one positional argument, which is the MicAudio object.

  • retain_dir (str|None) – Retains recognized audio and/or metadata in the given directory, saving audio to retain_[timestamp].wav file and metadata to retain.tsv. What is automatically retained it is determined by retain_audio and retain_metadata. If both are False but this is set, you can actively choose to retain a given recognition. See below for more information.

  • retain_audio (bool|None) – Whether to retain audio data for all recognitions. If True, then requires retain_dir to be set. If None, then defaults to True if retain_dir is set to True. See below for more information.

  • retain_metadata (bool|None) – Whether to retain metadata for all recognitions. If True, then requires retain_dir to be set. If None, then defaults to True if retain_dir is set to True. See below for more information.

  • retain_approval_func (Callable) – If retaining is enabled, this is called upon each recognition, to determine whether or not to retain it. It must accept as a parameter the dragonfly.engines.backend_kaldi.audio.AudioStoreEntry under consideration, and return a bool (True to retain). This is useful for ignoring recognitions that tend to be noise, perhaps contain sensitive content, etc.

  • vad_aggressiveness (int) – Aggressiveness of the Voice Activity Detector: an integer between 0 and 3, where 0 is the least aggressive about filtering out non-speech, and 3 is the most aggressive.

  • vad_padding_start_ms (int) – Approximate length of padding/debouncing (in milliseconds) at beginning of each utterance for the Voice Activity Detector. Smaller values result in lower latency recognition (faster reactions), but possibly higher likelihood of false positives at beginning of utterances, and more importantly higher possibility of not capturing the entire beginning of utterances.

  • vad_padding_end_ms (int) – Approximate length of silence (in milliseconds) at ending of each utterance for the Voice Activity Detector. Smaller values result in lower latency recognition (faster reactions), but possibly higher likelihood of false negatives at ending of utterances.

  • vad_complex_padding_end_ms (int|None) – If not None, the Voice Activity Detector behaves differently for utterances that are complex (usually meaning inside dictation), using this value instead of vad_padding_end_ms, so you can attain longer utterances to take advantage of context to improve recognition quality.

  • auto_add_to_user_lexicon (bool) – Enables automatically adding unknown words to the User Lexicon. This will only work if you have additional required packages installed. This will only work locally, unless you also enable allow_online_pronunciations.

  • allow_online_pronunciations (bool) – Enables online pronunciation generation for unknown words, if you have also enabled auto_add_to_user_lexicon, and you have the required packages installed.

  • lazy_compilation (bool) – Enables deferred grammar/rule compilation, which then allows parallel compilation up to your number of cores, for a large speed up loading uncached.

  • invalidate_cache (bool) – Enables invalidating the engine’s cache prior to initialization, possibly for debugging.

  • expected_error_rate_threshold (float|None) – Threshold of “confidence” in the recognition, as measured in estimated error rate (between 0 and ~1 where 0 is perfect), above which the recognition is ignored. Setting this may be helpful for ignoring “bad” recognitions, possibly around 0.1 depending on personal preference.

  • alternative_dictation (callable|None) – Enables alternative an dictation model/engine and chooses the provider. Possible values:

User Lexicon

Kaldi uses pronunciation dictionaries to lookup phonetic representations for words in grammars & language models in order to recognize them. The default model comes with a large dictionary, but obviously cannot include all possible words. There are multiple ways of handling this.

Ignoring unknown words: If you use words in your grammars that are not in the dictionary, a message similar to the following will be printed:

Word not in lexicon (will not be recognized): 'notaword'

These messages are only warnings, and the engine will continue to load your grammars and run. However, the unknown words will effectively be impossible to be recognized, so the rules using them will not function as intended. To fix this, try changing the words in your grammars by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”.

Automatically adding words to User Lexicon: Set the engine parameter auto_add_to_user_lexicon=True to enable. If an unknown word is encountered while loading a grammar, its pronunciation is predicted based on its spelling. This uses either a local library, or a free cloud service if the library is not installed.

The local library (g2p_en) can be installed by running the following on the command line:

pip install g2p_en==2.0.0

Note that the dependencies for this library can be difficult to install, in which case it is recommended to use the cloud service instead. Set the engine parameter allow_online_pronunciations=True to enable it.

Manually editing User Lexicon: You can add a word without specifying a pronunciation, and let it be predicted as above, by running at the command line:

python -m kaldi_active_grammar add_word cromulent

Or you can add a word with a specified pronunciation:

python -m kaldi_active_grammar add_word cromulent "K R OW M Y UW L AH N T"

You can also directly edit your user_lexicon.txt file, which is located in the model directory. You may add words (with pronunciation!) or modify or remove words that you have already added. The format is simple and whitespace-based:

cromulent k r A m j V l V n t
embiggen I m b I g V n

Note on Phones: Currently, adding words only accepts pronunciations using the “CMU”/”ARPABET” phone set (with or without stress), but the model and user_lexicon.txt file store pronunciations using “X-SAMPA” phone set.

When hand-crafting pronunciations, you can look online for examples. Also, for X-SAMPA pronunciations, you can look in the model’s lexicon.txt file, which lists all of its words and their pronunciations (in X-SAMPA). Look for words with similar sounds to what you are speaking.

To empty your user lexicon, you can simply delete user_lexicon.txt, or run:

python -m kaldi_active_grammar reset_user_lexicon

Preserving Your User Lexicon: When changing models, you can (and probably should) copy your user_lexicon.txt file from your old model directory to the new one. This will let you keep your additions.

Also, if there is a user_lexicon.txt file in the current working directory of your initial loader script, its contents will be automatically added to the user_lexicon.txt in the active model when it is loaded.

User Lexicon and Dictation

New pronunciations for existing words that are already in the dictation language model will not be recognized during dictation elements specifically until the dictation model is recompiled. Recompilation is quite time consuming (on the order of 15 minutes), but can be performed by running:

python -m kaldi_active_grammar compile_dictation_graph -m kaldi_model

Entirely new words added to the user lexicon will not be recognized during dictation elements specifically at all currently.

However, you can achieve a similar result for both of these weaknesses with the following: create a rule that recognizes a repetition of alternates between normal dictation and a special rule that recognizes all of your special terminology. An example of this can be seen in this dictation grammar. This technique can also help mitigate dictation recognizing the wrong of similar sounding words by emphasizing the word you want to be recognized, possibly with the addition of a weight parameter.

Experimental: You can avoid the above issues by using this engine’s “user dictation” feature. This also allows you to have separate “spoken” and “written forms” of terms in dictation. Do so by adding any words you want added/modified to the user dictation list (identical spoken and written form) or dictlist (different spoken and written forms), and using the UserDictation element in your grammars (in place of the standard dragonfly Dictation element):

from dragonfly import get_engine, MappingRule, Function
from dragonfly.engines.backend_kaldi.dictation import UserDictation as Dictation
get_engine().add_word_list_to_user_dictation(['kaldi'])
get_engine().add_word_dict_to_user_dictation({'open F S T': 'openFST'})
class TestUserDictationRule(MappingRule):
    mapping = { "dictate <text>": Function(lambda text: print("text: %s" % text)), }
    extras = [ Dictation("text"), ]

Grammar/Rule/Element Weights

Grammars, rules, and/or elements can have a weight specified, where those with higher weight value are more likely to be recognized, compared to their peers, for an ambiguous recognition. This can be used to adjust the probability of them be recognized.

The default weight value for everything is 1.0. The exact meaning of the weight number is somewhat inscrutable, but you can treat larger values as more likely to be recognized, and smaller values as less likely. Note: you may need to use much larger or smaller numbers than you might expect to achieve your desired results, possibly orders of magnitude (base 10).

An example:

class WeightExample1Rule(MappingRule):
    mapping = { "kiss this guy": ActionBase() }
class WeightExample2Rule(MappingRule):
    mapping = { "kiss the sky": ActionBase() }
    weight = 2
class WeightExample3Rule(MappingRule):
    mapping = {
      "start listening {weight=0.01}": ActionBase(),  # Be less eager to wake up!
      "halt listening": ActionBase(),
      "go (north | nowhere {w=0.01} | south)": ActionBase(),
    }

The weight of a grammar is effectively propagated equally to its child rules, on top of their own weights. Similarly for rules propagating weights to child elements.

Retaining Audio and/or Recognition Metadata

You can optionally enable retention of the audio and metadata about the recognition, using the retain_dir engine parameter.

Note: This feature is completely optional and disabled by default!

The metadata is saved TSV format, with fields in the following order:

  • audio_data: file name of the audio file for the recognition

  • grammar_name: name of the recognized grammar

  • rule_name: name of the recognized rule

  • text: text of the recognition

  • likelihood: the engine’s estimated confidence of the recognition (not very reliable)

  • tag: a single text tag, described below

  • has_dictation: whether the recognition contained (in part) a dictation element

Tag: You can mark the previous recognition with a single text tag to be stored in the metadata. For example, mark it as incorrect with a rule containing:

"action whoops": Function(lambda: engines.get_engine().audio_store[0].set('tag', 'misrecognition'))

Or, you can mark it specifically to be saved, even if retain_audio is False and recognitions are not normally saved, as long as retain_dir is set. This also demonstrates that .set() can be chained to tag it at the same time:

"action corrected": Function(lambda: engines.get_engine().audio_store[0].set('tag', 'corrected').set('force_save', True))

This is useful for retaining only known-correct data for later training.

Alternative Dictation

This backend supports optionally using an alternative method of recognizing (some or all) dictation, rather than the default Kaldi model, which is always used for command recognition. You may want to do this for higher dictation accuracy (at the possible cost of higher latency or what would otherwise cause lower command accuracy), dictating in another language, or some other reason. You can use one of:

  • an alternative Kaldi model

  • an alternative local speech recognition engine

  • a cloud speech recognition engine

Note: This feature is completely optional and disabled by default!

You can enable this by setting the alternative_dictation engine option. Valid options:

  • A callable object: Any external engine. The callable must accept at least one argument (for the audio data) and any keyword arguments. The audio data is passed in standard Linear16 (int) PCM encoding. The callable should return the recognized text.

Using Alternative Dictation

To use alternative dictation, you must both pass the alternative_dictation option and use a specialized Dictation element. The standard dragonfly Dictation does not support alternative dictation. Instead, this backend provides two subclasses of it: AlternativeDictation and DefaultDictation. These two subclasses both support alternative dictation; they differ only in whether they do alternative dictation by default.

AlternativeDictation and DefaultDictation can be used as follows. Assume we are defining a variable element that is used by the code:

class TestDictationRule(MappingRule):
  mapping = { "dictate <text>": Text("%(text)s") }
  extras = [ element ]

Examples:

from dragonfly.engines.backend_kaldi.dictation import AlternativeDictation, DefaultDictation

element = AlternativeDictation("text")                    # alternative dictation
element = DefaultDictation("text")                        # no alternative dictation
element = AlternativeDictation("text", alternative=False) # no alternative dictation
element = DefaultDictation("text", alternative=True)      # alternative dictation

# all AlternativeDictation instances instantiated after this (in any file!) will default to alternative=False
AlternativeDictation.alternative_default = False
element = AlternativeDictation("text")                    # no alternative dictation
element = AlternativeDictation("text", alternative=True)  # alternative dictation

# all DefaultDictation instances instantiated after this (in any file!) will default to alternative=True
DefaultDictation.alternative_default = True
element = DefaultDictation("text")                        # alternative dictation
element = DefaultDictation("text", alternative=False)     # no alternative dictation

AlternativeDictation.alternative_default = True
DefaultDictation.alternative_default = False
# all AlternativeDictation and DefaultDictation instances instantiated after this are back to normal

If you want to replace all uses of standard Dictation in a file:

from dragonfly.engines.backend_kaldi.dictation import AlternativeDictation as Dictation
# OR
from dragonfly.engines.backend_kaldi.dictation import DefaultDictation as Dictation

Limitations & Future Work

Please let me know if anything is a significant problem for you.

Known Issues

  • Entirely new words added to the user lexicon will not be recognized during dictation elements specifically at all currently. You can get around this by constructing a rule that alternates between a dictation element and a mapping rule containing your new words, as demonstrated here.

  • Dragonfly Lists and DictLists function as normal. Upon updating a dragonfly list or dictionary, the rules they are part of will be recompiled & reloaded. This will add some delay, which I hope to optimize.

Dictation Formatting & Punctuation

The native dictation only provides recognitions as unformatted lowercase text without punctuation. Improving this generally is multifaceted and complex. However, the alternative dictation feature can avoid this problem by using the formatting & punctuation applied by a cloud provider.

Models: Other Languages, Other Sizes, & Training

The kaldi-active-grammar library currently only supplies a single general English model. Many standard Kaldi models (of varying quality) are available online for various languages. Although such standard Kaldi models must be first modified to work with this framework, the process is not difficult and could be automated (future work).

There are also various sizes of Kaldi model, with a trade-off between size/speed and accuracy. Generally, the smaller and faster the model, the lower the accuracy. The included model is relatively large. Let me know if you need a smaller one.

Training (personalizing) Kaldi models is possible but complicated. In addition to requiring many steps using a specialized software environment, training these models currently requires using a GPU for an extended period. This may be a case where providing a service for training is more feasible.

Engine API

class KaldiEngine(model_dir=None, tmp_dir=None, input_device_index=None, audio_input_device=None, audio_self_threaded=True, audio_auto_reconnect=True, audio_reconnect_callback=None, retain_dir=None, retain_audio=None, retain_metadata=None, retain_approval_func=None, vad_aggressiveness=3, vad_padding_start_ms=150, vad_padding_end_ms=200, vad_complex_padding_end_ms=600, auto_add_to_user_lexicon=True, allow_online_pronunciations=False, lazy_compilation=True, invalidate_cache=False, expected_error_rate_threshold=None, alternative_dictation=None, compiler_init_config=None, decoder_init_config=None)[source]

Speech recognition engine back-end for Kaldi recognizer.

activate_grammar(grammar)[source]

Activate the given grammar.

activate_rule(rule, grammar)[source]

Activate the given rule.

add_word_dict_to_user_dictation(word_dict)[source]

Make UserDictation elements able to recognize each item of given dict of strings word_dict. The key is the “spoken form” (which is recognized), and the value is the “written form” (which is returned as the text in the UserDictation element). Note: all characters in the keys will be converted to lowercase, but the values are returned as text verbatim.

add_word_list_to_user_dictation(word_list)[source]

Make UserDictation elements able to recognize each item of given list of strings word_list. Note: all characters will be converted to lowercase, and recognized as such.

connect()[source]

Connect to back-end SR engine.

deactivate_grammar(grammar)[source]

Deactivate the given grammar.

deactivate_rule(rule, grammar)[source]

Deactivate the given rule.

disconnect()[source]

Disconnect from back-end SR engine. Exits from do_recognition().

ignore_current_phrase()[source]

Marks the current phrase’s recognition to be ignored when it completes, or does nothing if there is none. Returns bool indicating whether or not there was a current phrase being heard.

property in_phrase

Whether or not the engine is currently in the middle of hearing a phrase from the user.

mimic(words)[source]

Mimic a recognition of the given words.

prepare_for_recognition()[source]

Can be called optionally before do_recognition() to speed up its starting of active recognition.

recognize_wave_file(filename, realtime=False, **kwargs)[source]

Does recognition on given wave file, treating it as a single utterance (without VAD), then returns.

recognize_wave_file_as_stream(filename, realtime=False, **kwargs)[source]

Does recognition on given wave file, treating it as a stream and processing it with VAD to break it into multiple utterances (as with normal microphone audio input), then returns.

property saving_adaptation_state

Whether or not the engine is currently automatically saving adaptation state between utterances.

set_exclusiveness(grammar, exclusive)[source]

Set the exclusiveness of a grammar.

speak(text)[source]

Speak the given text using text-to-speech.

start_saving_adaptation_state()[source]

Enable automatic saving of adaptation state between utterances, which may improve recognition accuracy in the short term, but is not stored between runs.

stop_saving_adaptation_state()[source]

Disables automatic saving of adaptation state between utterances, which you might want to do when you expect there to be noise and don’t want it to pollute your current adaptation state.

class UserDictation(name=None, format=True, default=None)[source]

Imitates the standard Dictation element class, using individual chunks of Dictation or the user’s added terminology.

Constructor argument:
  • name (str, default: None) – the name of this element; can be used when interpreting complex recognition for retrieving elements by name.

  • default (object, default: None) – the default value used if this element is optional and wasn’t spoken

value(node)[source]

Determine the semantic value of this element given the recognition results stored in the node.

Argument:
  • node – a dragonfly.grammar.state.Node instance representing this element within the recognition parse tree

The default behavior of this method is to return an iterable containing the recognized words matched by this element (i.e. node.words()).

class AlternativeDictation(*args, **kwargs)[source]
Constructor argument:
  • name (str, default: None) – the name of this element; can be used when interpreting complex recognition for retrieving elements by name.

  • default (object, default: None) – the default value used if this element is optional and wasn’t spoken

class DefaultDictation(*args, **kwargs)[source]
Constructor argument:
  • name (str, default: None) – the name of this element; can be used when interpreting complex recognition for retrieving elements by name.

  • default (object, default: None) – the default value used if this element is optional and wasn’t spoken

Kaldi Recognition Results Class

class Recognition(engine, kaldi_rule, words, words_are_dictation_mask=None)[source]

Kaldi recognition results class.

Kaldi Audio

class AudioStoreEntry(audio_data, grammar_name, rule_name, text, likelihood, tag, has_dictation, force_save=False)[source]
set(key, value)[source]

Sets given key (as str) to value, returning the AudioStoreEntry for chaining; usable in lambda functions.