I think this can be solved entirely within the lexical model.
The simplest way to solve this is add a function to ignore the question mark when it is found in the text using the searchTermToKey
function. searchTermToKey
allows us to massage the text to make it easier to find matching strings in the lexical model wordlist. This function works both ways: it works on input text, but also on the words in the wordlist. It does not modify the output or displayed text: it just simplifies the search key.
With this approach, you would type the question mark as normal – immediately after the «ո» letter – and the lexical model will simply ignore it.
The default lexical model searchTermToKey
function ignores diacritic marks. So we can start with that default function, and modify it to meet our needs.
I’ve used a Wikipedia article to select marks which we will ignore, just for the purposes of example. You will of course know better which ones are important!
The following Armenian punctuation marks placed above and slightly to the right of the vowel whose tone is modified, in order to reflect intonation:
Armenian punctuation marks used inside a word:
- [ ֊ ] The yent’amna is used as the ordinary Armenian hyphen.
- [ ՟ ] The pativ was used as an Armenian abbreviation mark, and was placed on top of an abbreviated word to indicate that it was abbreviated. It is now obsolete.
- [ ՚ ] The apat’arts is used as a spacing apostrophe (which looks either like a vertical stick or wedge pointing down, or as an elevated 9-shaped comma, or as a small superscript left-to-right closing parenthesis or half ring), only in Western Armenian, to indicate elision of a vowel, usually /ə/.
The function I have made here is only lightly modified from the default. You may find that you can skip the normalization step – I have not assessed the Armenian Unicode block for its normalization rules. This function should be placed into your .model.ts file; see the documentation for more detail.
searchTermToKey: function (term: string): string {
// Use this pattern to remove various Armenian marks.
// See: https://www.compart.com/en/unicode/block/U+0530
const ARMENIAN_MARKS = /[՜՛՞֊՟՚]/g;
// Lowercase each letter in the string INDIVIDUALLY.
// Why individually? Some languages have context-sensitive lowercasing
// rules (e.g., Greek), which we would like to avoid.
// So we convert the string into an array of code points (Array.from(term)),
// convert each individual code point to lowercase (.map(c => c.toLowerCase())),
// and join the pieces back together again (.join(''))
let lowercasedTerm = Array.from(term).map(c => c.toLowerCase()).join('');
// Once it's lowercased, we convert it to NFKD normalization form
// This does many things, such as:
//
// - separating characters from their accents/diacritics
// e.g., "ï" -> "i" + "◌̈"
// - converting lookalike characters to a canonical ("regular") form
// e.g., ";" -> ";" (yes, those are two completely different characters!)
// - converting "compatible" characters to their canonical ("regular") form
// e.g., "𝔥𝔢𝔩𝔩𝔬" -> "hello"
let normalizedTerm = lowercasedTerm.normalize('NFKD');
// Now, using the pattern defined above, replace each mark with the
// empty string. This effectively removes all the marks!
//
// e.g., "ինչո՞ւ" -> "ինչու"
let termWithoutMarks = normalizedTerm.replace(ARMENIAN_MARKS, '');
// The resultant key is lowercased, and has no additional marks.
return termWithoutMarks.normalize('NFC');
},
As you read this function and the corresponding documentation, you may well think of other ways in which you want to play with the search keys!
You may also find the discussions on word breakers and punctuation in the same Advanced Lexical Model Topics area helpful.