Word formatting for DNS v11 and higher

The dragonfly.engines.backend_natlink.dictation_format.WordParserDns11 class converts DNS v11 (and higher) recognition results to dragonfly.engines.backend_natlink.dictation_format.Word objects which contain written and spoken forms together with formatting information. For example:

>>> from dragonfly.engines.backend_natlink.dictation_format import WordParserDns11
>>> parser_dns11 = WordParserDns11()
>>> print(parser_dns11.parse_input("hello"))
Word('hello')
>>> print(parser_dns11.parse_input(".\\period\\full-stop"))
Word('.', 'full-stop', no_space_before, two_spaces_after, cap_next, not_after_period)

Nonexistent words can be parsed, but don’t have any formatting info:

>>> print(parser_dns11.parse_input("nonexistent-word"))
Word('nonexistent-word')

This helper function allows for a compact notation of input string tests:

>>> from dragonfly.engines.backend_natlink.dictation_format import WordFormatter
>>> from six import string_types
>>> def format_dictation_dns11(input, spoken_form=False):
...     if isinstance(input, string_types):
...         input = input.split()
...     wf = WordFormatter(parser=WordParserDns11())
...     return wf.format_dictation(input, spoken_form)
>>> format_dictation_dns11("hello world")
u'hello world'

The following tests cover in-line capitalization:

>>> format_dictation_dns11("\\all-caps\\All-Caps hello world")
u'HELLO world'
>>> format_dictation_dns11("\\caps-on\\Caps-On hello world")
u'Hello World'
>>> format_dictation_dns11("\\caps-on\\Caps-On hello of the world")
u'Hello Of The World'
>>> # Note: output above should probably be u'Hello of the World'
>>> format_dictation_dns11("hello \\caps-on\\Caps-On of the world")
u'hello Of The World'
>>> # Note: output above should probably be u'hello Of the World'
>>> format_dictation_dns11("\\caps-on\\Caps-On hello world \\caps-off\\Caps-Off goodbye universe")
u'Hello World goodbye universe'
>>> format_dictation_dns11("hello \\all-caps-on\\All-Caps-On world goodbye \\all-caps-off\\All-Caps-Off universe")
u'hello WORLD GOODBYE universe'
>>> format_dictation_dns11("hello \\all-caps-on\\All-Caps-On world \\caps-on\\Caps-On goodbye universe")
u'hello WORLD Goodbye Universe'
>>> format_dictation_dns11("hello \\all-caps-on\\All-Caps-On world goodbye \\caps-off\\Caps-Off universe")
u'hello WORLD GOODBYE universe'

The following tests cover in-line spacing:

>>> format_dictation_dns11("\\no-space\\No-Space hello world")
u'hello world'
>>> format_dictation_dns11("hello \\no-space\\No-Space world")
u'helloworld'
>>> format_dictation_dns11("\\no-space-on\\No-Space-On hello world")
u'helloworld'
>>> format_dictation_dns11("\\no-space-on\\No-Space-On hello world goodbye \\no-space-off\\No-Space-Off universe")
u'helloworldgoodbye universe'
>>> format_dictation_dns11("\\no-space-on\\No-Space-On hello world \\no-space-off\\No-Space-Off goodbye universe")
u'helloworld goodbye universe'
>>> format_dictation_dns11("\\no-space-on\\No-Space-On hello world \\space-bar\\space-bar goodbye universe")
u'helloworld goodbyeuniverse'

>>> # The following are different from some DNS installations!
>>> format_dictation_dns11("hello \\no-space-on\\No-Space-On world goodbye universe")
u'helloworldgoodbyeuniverse'
>>> format_dictation_dns11("hello \\no-space-on\\No-Space-On world \\space-bar\\space-bar goodbye universe")
u'helloworld goodbyeuniverse'

The following tests cover punctuation and other symbols that influence spacing and capitalization of surrounding words:

>>> format_dictation_dns11("hello \\new-line\\New-Line world")
u'hello\nworld'
>>> format_dictation_dns11("hello \\new-paragraph\\New-Paragraph world")
u'hello\n\nWorld'
>>> format_dictation_dns11("hello .\\period\\full-stop world")
u'hello. World'
>>> format_dictation_dns11("hello ,\\comma\\comma world")
u'hello, world'
>>> format_dictation_dns11("hello .\\period\\full-stop \\new-line\\New-Line world")
u'hello.\nWorld'
>>> format_dictation_dns11("hello -\\hyphen\\hyphen world")
u'hello-world'
>>> format_dictation_dns11("hello (\\left-paren\\left-paren world")
u'hello (world'
>>> format_dictation_dns11("hello )\\right-paren\\right-paren world")
u'hello) world'
>>> format_dictation_dns11("hello \\\\hyphen\\backslash world")
u'hello\\world'

The “.” character at the end of certain words is “swallowed” by following words that start with that same character:

>>> format_dictation_dns11(["hello", "etc.\\\\et cetera", "world"])
u'hello etc. world'
>>> format_dictation_dns11(["hello", "etc.\\\\et cetera", ".\\period\\full-stop", "world"])
u'hello etc. World'
>>> format_dictation_dns11(["hello", "etc.\\\\et cetera", "...\\ellipsis\\ellipsis", "world"])
u'hello etc... world'
>>> # The following are different from some DNS installations!
>>> format_dictation_dns11("hello .\\period\\full-stop ...\\ellipsis\\ellipsis world")
u'hello... world'
>>> format_dictation_dns11("hello ...\\ellipsis\\ellipsis .\\period\\full-stop world")
u'hello... World'

Letters and numbers spoken in line within dictation allow efficient spelling of for example words not present in the dictionary:

>>> format_dictation_dns11(["A\\letter\\alpha", "B\\letter\\bravo",
...                         "C\\letter\\Charlie", "D\\letter\\delta",
...                         "E\\letter\\echo", "F\\letter\\foxtrot",
...                         "X\\letter\\X ray", "Z\\letter\\Zulu",
...                         "Y\\letter\\Yankee"])
u'ABCDEFXZY'
>>> format_dictation_dns11("D\\letter E\\letter F\\letter")
u'DEF'
>>> format_dictation_dns11(["J\\uppercase-letter\\capital J",
...                         "O\\letter", "H\\letter", "N\\letter"])
u'JOHN'
>>> format_dictation_dns11("J\\letter\\Juliett O\\letter\\Oscar"
...                        " H\\letter\\hotel N\\letter\\November")
u'JOHN'

Letters spoken in line within dictation have spaces separating adjacent words:

>>> format_dictation_dns11(["hello", "A\\letter\\alpha",
...                         "B\\letter\\bravo", "C\\letter\\Charlie",
...                         "world"])
u'hello ABC world'
>>> format_dictation_dns11(["hello", "A\\uppercase-letter\\capital A",
...                         "B\\uppercase-letter\\capital B",
...                         "C\\uppercase-letter\\capital C", "world"])
u'hello ABC world'

Words may contain spaces in their written and/or spoken forms. For example a custom word added by the user might have the following form with a space in both spoken and written forms:

>>> format_dictation_dns11(["custom written\\\\custom spoken"])
u'custom written'
>>> format_dictation_dns11(["custom written\\\\custom spoken",
...                         "\\all-caps\\all caps",
...                         "custom written\\\\custom spoken",
...                         "\\cap\\cap",
...                         "custom written\\\\custom spoken"])
u'custom written CUSTOM written Custom written'
>>> format_dictation_dns11(["custom written\\\\custom spoken",
...                         "\\caps-on\\caps on",
...                         "custom written\\\\custom spoken",
...                         "\\all-caps-on\\all caps on",
...                         "custom written\\\\custom spoken",
...                         "\\all-caps-off\\all caps off",
...                         "custom written\\\\custom spoken"])
u'custom written Custom Written CUSTOM WRITTEN custom written'

Certain words, such as numbers, are not formatted according to the same rules as “normal” words, i.e. those which specified written and spoken forms and formatting info:

#    >>> format_dictation_dns11("one two three")
#    u'123'
#    >>> format_dictation_dns11("one\\number two three four")
#    u'1234'
#    >>> format_dictation_dns11("thirty four")
#    u'34'

Spoken words with multiple, context-dependent written forms, such as “point” and “.”, are formatted correctly:

>>> format_dictation_dns11("\cap\cap what is the point of that "
...                         "?\question-mark\question-mark")
u'What is the point of that?'
>>> format_dictation_dns11(".\point\point")
u'.'

The spoken form of words is given instead of the written form, if specified:

>>> format_dictation_dns11(["custom written\\\\custom spoken"], True)
u'custom spoken'
>>> format_dictation_dns11(["custom written\\\\custom spoken",
...                         "\\all-caps\\all caps",
...                         "custom written\\\\custom spoken",
...                         "\\cap\\cap",
...                         "custom written\\\\custom spoken"], True)
u'custom spoken all caps custom spoken cap custom spoken'
>>> format_dictation_dns11(["custom written\\\\custom spoken",
...                         "\\caps-on\\caps on",
...                         "custom written\\\\custom spoken",
...                         "\\all-caps-on\\all caps on",
...                         "custom written\\\\custom spoken",
...                         "\\all-caps-off\\all caps off",
...                         "custom written\\\\custom spoken"], True)
u'custom spoken caps on custom spoken all caps on custom spoken all caps off custom spoken'