Standard roman letters in search term, multiple variants in key

rajiv · June 23, 2021, 10:51pm

I’ve been experimenting with a Lexical Model for Tłı̨chǫ and other Dene languages. I would like a standard “L/l” key stroke in the search term to call up both the standard “L/l” (U+004C and U+006C) and the L with a stroke “Ł/ł” (U+0141 and U+0142). I am thinking that if searchTermToKey won’t work for this, unless the key can be expanded by some argument.

Given that the dotless form of the i (U+0131) is used in Tłı̨chǫ, I was able to correctly search the wordlist by simply replacing the typed standard “i” with the dotless “i” in the search term and returning it. However, you can’t do his with multiple forms, so another strategy needs to be used.

Any help would be greatly appreciated.

EUREKA: I may actually have misunderstood this function as it applies to both the search term and wordlist. Below, I’ve adapted the standard case insensitive decomposition to include ı and ł. This seems to work.

    searchTermToKey: function (term) {
      const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;
      let lowercasedTerm = Array.from(term).map(c => c.toLowerCase()).join('');
      let normalizedTerm = lowercasedTerm.normalize('NFKD');
      let termWithoutDiacritics = normalizedTerm.replace(COMBINING_DIACRITICAL_MARKS, '');
      let termExtra = termWithoutDiacritics.replace(/ı/g,'i').replace(/ł/g,'l').replace(/Ł/g,'L');
      return termExtra; }

rajiv · June 24, 2021, 12:28am

Given the use of apostrophes (letter modifier apostrophes U+02BC) as glottals, also have to add that exception into the wordbreaker parameters. I think this is the final version:

const source: LexicalModelSource = {
  format: 'trie-1.0',
  wordBreaker : {
    use: 'default',     // we want to use the default word breaker, BUT!
    // CUSTOMIZE THIS:
    joinWordsAt: ['-','\''], // join words that contain hyphens and apostrophes transformed into their standard for below.
  },
  sources: ['wordlist.tsv'],
  searchTermToKey: function (term) {
        const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;
        let lowercasedTerm = Array.from(term).map(c => c.toLowerCase()).join('');
        let normalizedTerm = lowercasedTerm.normalize('NFKD');
        let termWithoutDiacritics = normalizedTerm.replace(COMBINING_DIACRITICAL_MARKS, '');
        let termExtra = termWithoutDiacritics.replace(/ı/g,'i').replace(/ł/g,'l').replace(/Ł/g,'L').replace(/ʼ/g,'\'');
        return termExtra;
    },
};
export default source;

joshua_horton · June 24, 2021, 1:58am

Yep. We apply it to the wordlist whenever your model is compiled and to any input received when using your model. Applying it both ways provides a nice, efficient ‘diacritic insensitivity’ and allows us to optimize related parts of the predictive-text engine.

Nyny · January 26, 2024, 6:14am

This topic was automatically closed after 14 days. New replies are no longer allowed.