Improve suggestions/corrections based on context

mayura · March 20, 2020, 8:31pm

When the suggestions are provided for non-latin languages. lexical model doesn’t know how to suggest based on correct.

here is an example from hunspell “REP” flag for Kannada.

In the below example, hunspell is providing the dictionary to specify how to search for words in the dictionary. These are alternate letters that hunspell will replace to search in the wordlist.

This is very important feature for corrections and also while suggesting the words.

We need to pass this list in the dictionary and lexical model should be aware of possible wrong typing of letters.

SET UTF-8
FLAG num
TRY ಅ
REP 54
REP ಕ ಖ
REP ಖ ಕ
REP ಚ ಛ
REP ಛ ಚ
REP ಟ ಠ
REP ಠ ಟ
REP ಡ ಢ
REP ಢ ಡ
REP ತ ಥ
REP ಥ ತ
REP ದ ಧ
REP ಧ ದ
REP ಿ ೀ
REP ೀ ಿ
REP ು ೂ
REP ೂ ು
REP ೆ ೇ
REP ೇ ೆ
REP ೊ ೋ
REP ೋ ೊ

Marc · March 22, 2020, 9:58pm

Hi @mayura, thank you for the suggestion. I believe we have something like this on our roadmap; not sure if it will be in 14.0 or a future release.

mayura · March 23, 2020, 2:10am

github.com

keymanapp/keyman/blob/master/common/predictive-text/worker/models/trie-model.ts

/*
 * Copyright (c) 2019 National Research Council Canada (author: Eddie A. Santos)
 * Copyright (c) 2019 SIL International
 * Copyright (c) 2015–2017 Conrad Irwin
 * Copyright (c) 2011–2015 Marc Campbell
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy of
 * this software and associated documentation files (the "Software"), to deal in
 * the Software without restriction, including without limitation the rights to
 * use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 * the Software, and to permit persons to whom the Software is furnished to do so,
 * subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in all
 * copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
 * FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 * COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER

This file has been truncated. show original

line number 452: const PARTIAL_NFD_LOOKUP = {…}

can we use this constant to achieve the above?

Marc · March 24, 2020, 4:48am

No, that object is there for normalization, not correction.

mayura · August 2, 2020, 10:23pm

Hello Marc,

Could we revisit this suggestion, please.

Marc · August 3, 2020, 9:58pm

We are doing considerable work in 14.0 on improving suggestions and corrections but I think the functionality you are requesting will have to wait for a future release.

joshua_horton · May 11, 2021, 8:44am

SET UTF-8
FLAG num
TRY ಅ
REP 54
REP ಕ ಖ
REP ಖ ಕ
REP ಚ ಛ
REP ಛ ಚ
REP ಟ ಠ
REP ಠ ಟ
REP ಡ ಢ
REP ಢ ಡ
REP ತ ಥ
REP ಥ ತ
REP ದ ಧ
REP ಧ ದ
REP ಿ ೀ
REP ೀ ಿ
REP ು ೂ
REP ೂ ು
REP ೆ ೇ
REP ೇ ೆ
REP ೊ ೋ
REP ೋ ೊ

You should be able to achieve your goal by using existing functionality. Granted, it could probably be better documented, or at least better highlighted.

From @keymanapp/models-types, index.d.ts:

/**

Indicates a mapping function used by the model to simplify lookup operations

within the lexicon. This is expected to result in a many-to-one mapping, transforming

the input text into a common, simplified ‘index’/‘key’ form shared by all

text forms that a person might reasonably interpret as “the same”.

Example usages:

converting any upper-case characters into lowercase.

For English, ‘CAT’ and ‘Cat’ might be keyed as ‘cat’, since users expect all three to be treated as the same word.

removing accent marks that may be difficult to type on standard keyboard layouts

For French, users may wish to type “jeune” instead of “jeûne” when lazy or if accent marks cannot be easily input.

Providing a function targetted for your language can greatly improve a user’s experience

using your dictionary.

@param text The original input text.

@returns The ‘keyed’ form of that text.
*/
toKey?(text: USVString): USVString;

A bit later:

… When possible,

it is recommended to accomplish this by defining a toKey (searchTermToKey in model

source) instead.

If each of your REP entries above indicates that the two characters should be able to freely replace each other, you’ll want to define a custom method called searchTermToKey in your model’s source that matches the toKey type signature found above.

You can find our default implementations here: https://github.com/keymanapp/keyman/blob/master/developer/js/source/lexical-model-compiler/model-defaults.ts.

Append extra replace statements in your custom implementation in order to accomplish your goals. Turn lines like this:

REP ಛ ಚ

into something like this:

.replace(/ಛ/g, 'ಚ')

Only pick one entry of each pair of lines for this.

This will ensure that use of either letter will look up words that use either letter, even if it’s the opposite of the pair.

mayura · May 17, 2021, 5:07am

Thank you. I will try this.