Did I get your attention?
I am working on a new keyboard for our language and this will be the first generation with predictive suggestions. We are now in late 2023, and out there AI is running wild. I see the concept of “lexical model” and get exited, but the documentation for Keyman and KAB seems to be all about word-lists only.
There is one page on Keyman help, where they are giving the example of “on my w…” and talk about options like “way”, “website” and “whole”. My brain brings up phrases like “on my watch”. I guess that Keyman would bring up “on my waffles”, purely on one-word-frequencies, which would be nice in a part of the world, where people write a lot about waffle toppings.
We have a certain text corpus (not world record level but substantial) and our language is using a lot of such “set phrases” where an “unlikely candidate” (by total word count) would come up as “the best candidate” if a tool would consider even a humble context of two previous words, like in the example of “on my way”.
This would give us some “educated list” of theoretically, say 3.375.000.000.000 possible three-word-combinations. Sounds frighening at first glance, but we would cull that list to keep only entries that show 10 or more actual occurrences in real-world-texts. Should take my notebook a few hours to prepare such a list? I believe there would be no need to ever actually generate the entire list or have it in memory, just crawl through the text-corpus and grab what is actually there.
Those who know me, have now guessed that this is a bait for discussion, and ultimately a feature request. In some other thread today, I had written that I do not like “automatic” so much, because it often gets it wrong. But in the context of a keyboard, the more we developers provide good data, the better the output or response would feel to the users. I guess this needs a plug-in for Fieldworks and maybe some student could have a go at the coding and do at least a feasibility stud? (pun indented)
For those who made it this far:
We have got one specific question for the present state of things (using Keyman developer 16.0.144): In our word-list as exported (and cleaned-up) from Fieldworks, we got a few handfuls of multi-word-phrases with spaces. This must be, because one of our team-members with know-how must have tagged certain common phrases (which might make more or different “sense” semantically, than the individual words used).
So how do we handle those when preparing a lexical model for Keyman? Can the present system handle multi-word-entries in the wordlist? Do we have to split or delete such entries?