The check is in the maize

Did I get your attention?

I am working on a new keyboard for our language and this will be the first generation with predictive suggestions. We are now in late 2023, and out there AI is running wild. I see the concept of “lexical model” and get exited, but the documentation for Keyman and KAB seems to be all about word-lists only.

There is one page on Keyman help, where they are giving the example of “on my w…” and talk about options like “way”, “website” and “whole”. My brain brings up phrases like “on my watch”. I guess that Keyman would bring up “on my waffles”, purely on one-word-frequencies, which would be nice in a part of the world, where people write a lot about waffle toppings.

We have a certain text corpus (not world record level but substantial) and our language is using a lot of such “set phrases” where an “unlikely candidate” (by total word count) would come up as “the best candidate” if a tool would consider even a humble context of two previous words, like in the example of “on my way”.

This would give us some “educated list” of theoretically, say 3.375.000.000.000 possible three-word-combinations. Sounds frighening at first glance, but we would cull that list to keep only entries that show 10 or more actual occurrences in real-world-texts. Should take my notebook a few hours to prepare such a list? I believe there would be no need to ever actually generate the entire list or have it in memory, just crawl through the text-corpus and grab what is actually there.

Those who know me, have now guessed that this is a bait for discussion, and ultimately a feature request. In some other thread today, I had written that I do not like “automatic” so much, because it often gets it wrong. But in the context of a keyboard, the more we developers provide good data, the better the output or response would feel to the users. I guess this needs a plug-in for Fieldworks and maybe some student could have a go at the coding and do at least a feasibility stud? (pun indented)

For those who made it this far:
We have got one specific question for the present state of things (using Keyman developer 16.0.144): In our word-list as exported (and cleaned-up) from Fieldworks, we got a few handfuls of multi-word-phrases with spaces. This must be, because one of our team-members with know-how must have tagged certain common phrases (which might make more or different “sense” semantically, than the individual words used).
So how do we handle those when preparing a lexical model for Keyman? Can the present system handle multi-word-entries in the wordlist? Do we have to split or delete such entries?

At this stage, we don’t yet have support for multi-word phrases within Keyman predictive text. Even handling bi-grams - two-word sets - would add notable complexity and memory requirements.

As a result, the best way to handle those cases would be to remove such entries. Ideally, you’d add the phrase’s frequency count to that of each of the phrase’s words. (Assuming that they’re not included in the original counts.)

We would like to consider adding some form of support for multi-word phrases in the future, but we have many other things at a higher priority to address at this time.

1 Like

It’s on our internal long-term roadmap, but we have not yet decided when to do this work, because we just don’t have the resources.

Thank you @joshua_horton and @Marc for this clear answer. Doing a first ever keyboard with word-suggestions I did not want to miss anything potentially amazing for the users. Like in carpentry, where we say “measure twice, before you cut once”.

I have a simple regex to find all list entries with any “space” in the word-part, and it can wipe out the entire line, including the tab and the word-count. So this is easy; I have cleared my lists already.

Interesting news (inventing a new microwave oven by accident):

I am presently building a new keyboard for this one language. Our team is working on the clean-up of the wordlist, while I do technology and graphics.

From a small error, I had missed one step in my work-flow and had forgotten to clear out all “words with spaces” from the last list that has come over from the team.

So when testing just now on my phone, I noticed that the local names for the months are working as multi word units (they literally mean “first month”, “second month” etc.:

|aŋɔrɔ akolontaja|3|
|aŋɔrɔ anantaja|2|
|aŋɔrɔ anʊntaja|2|
|aŋɔrɔ anyɩʊtaja|6|
|aŋɔrɔ ariutaja|5|
|aŋɔrɔ asǝbaka|2|

Of course, those entries will not remain in our Lexical model, because only the user can know what he wants to type. There are just too many months for this case to be helpful.

Never mind the months: I have now learnt from this thread that Keyman does not provide extra-clever suggestions, based on multi-word entries in the word-list. But I discovered today that Keyman can still handle multi-word entries as units.

This means for our language, that the keyboard can suggest certain expressions that “belong together” in 99%+ of use cases, like “unbeknownst to”. It is for our local team to identify such beauties, if they want to, while reading through the word-lists anyway.

(Any export from our Fieldworks database is just producing so much “unsuitables”, like phone-numbers, or too-special-proper-names because of the richness of texts in our corpus. This is why we clean up the first-ever-word-list by hand. Later we will only need to check “all new entries since last version”.)

Since we are not using auto-correct for the keyboard in this phase of language-development, there should be no danger, just a little “extra help” from the keyboard with a few such “stable multi-word phrases”.

And I hope there will be readers who will write one or more examples, where my “unbeknownst to” will not work. English is not my first language. And I guess from life experience that there will always be exceptions. But we are talking about suggestions, not auto-correct. I was just exited that multi-work entries in word-lists do not break Keyman, they make it through into the actual app.