Optimal lexical model wordlist requirements

Vasyl · April 3, 2024, 7:08pm

Dear Sirs,

I see that it takes several seconds delay to load my 350k lexical model in order to show anything in the box of predicted words upon keyboard launch. At the same time SIL Euro Latin English language model has much less delay.

Please publish and inform link to the source wordlist tsv file of SIL Euro Latin English.tsv in order to see how optimal wordlist should look like. I have not found it here : github . com/keymanapp/lexical-models/tree/master/release/sil
Please inform requirements of the optimal lexical model wordlist, why should it or should not contain :
a) abbreviations like NBA, CIA, FBI, …
b) first and last names like Joe, Biden, …
c) trade marks like Pepsi, Ford, …
d) cities names like London, Paris, …
e) capitalized words or all must be only lowercase
f) size of optimal wordlist like SIL English.tsv

Knowing such requirements will allow users to cut their wordlists for optimal performance and quick loading.

Thanks

drowe · April 3, 2024, 8:09pm

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv at master · keymanapp/lexical-models · GitHub is the word list for the English lexical model which the SIL Euro Latin keyboard uses for English. Note that lexical models are separate from keyboards and are linked by the languae tag (the “.en.” in the center of “nrc.en.mtnt” in this case).
For (a) to (e), it is certainly possible to include names and abbreviations. Since the lexical model format allows for multiple input files, you may want to consider putting these in one or more separate files (for example, a file with geographic names, another with proper names, etc.)
I will let others respond to (f) the question about the impact that the size of the file has on speed.

Nyny · July 12, 2024, 6:46am

This topic was automatically closed after 14 days. New replies are no longer allowed.