Lexical model not working as expected

sapradhan · March 8, 2020, 6:14am

I am trying to build a model for Nepal bhasa, followed instructions, collected about 21k words (I dont have word frequencies) and built one: https://github.com/sapradhan/lexical-models/tree/newa

However I am getting unexpected predictions.

After typing two characters I was expecting words beginning with those two characters. There such words in the wordlist (on the right) however I get something else
The typed in letters are
U+1140e U+11429
The suggested words are

U+1140e U+11400 U+11443
U+1140e U+11402 U+11410 U+11438
U+1140e U+11411 U+11435 U+11411

Also how does the model suggest words when nothing is typed ? This is what I get currently
cold-start

May be my understanding is wrong, please let me know.

Thanks

Marc · March 8, 2020, 6:43am

The key thing you are seeing here is that the predictive engine has a hard time when there are no word frequencies: every word is as common as every other word.

When you see unexpected suggestions, you are seeing suggested corrections. the keyboard has detected that nearby keys are probably as likely as pressing the right key. You can turn off corrections in the settings for predictive text in-app, although there is no interface for doing so in the test window.

The choices you see when nothing has been typed are essentially random – they will be whatever the compiler places ‘first’ in the compiled model; without frequencies this will not necessarily be useful.

If you can obtain word frequencies – even just a small sample – then your model will become much more useful. If you have text in Newa, you can use PrimerPrep to import this text, analyse it, and export a word frequency list. You can see more about predictive models in workshop videos at https://help.keyman.com/developer/videos and read tutorials at https://help.keyman.com/developer/current-version/guides/lexical-models/

sapradhan · March 11, 2020, 5:19pm

Thank you for the insight.
Can I collect some text and compute word frequencies and append this to this word list (without frequencies) or do I need to have frequency for every word in the list ?

Marc · March 11, 2020, 7:56pm

You should be able to add frequency data to a subset of words in the list. I am unsure as to whether the compiler currently merges duplicates though.

Hebi · March 13, 2020, 7:56pm

Hi Marc,

Is there a value in the lexical model for keyboard developer to disable the corrections setting you mentioned?

Marc · March 18, 2020, 9:56am

Hi @Hebi, welcome to the community!

There is no setting within the model to disable this, only on the end user device at this time.

bennylin · April 2, 2020, 10:57pm

About word duplicates, I got the answer from Joshua for my model here: [lex-jv] Initial version 1.0 by bennylin · Pull Request #74 · keymanapp/lexical-models · GitHub, he said:

For repeating the same word in the wordlist: as far as I can tell, this will probably cause issues with the frequency counts, and I’m not sure what determines which frequency wins. We use that to determine what suggestions are more likely than others, so it may affect suggestion quality.

So I’m planning to delete the duplicates, but still yet to find the right tool to do so.

darcy · April 3, 2020, 4:35am

So I’m planning to delete the duplicates, but still yet to find the right tool to do so.

I suggest using a spreadsheet (Excel or https://sheets.google.com) to open your wordlist as a tab-separated file. Sort by the first column and then you can delete the duplicates.