Can a lexical model be too big?

Dan_Em · May 6, 2020, 12:05am

Based on a 1.7 million-word corpus (excluding words shorter than 4 characters), my lexical model for a certain language could contain 38,000 unique words (many of which are doubtlessly non-standard spellings, or transliterations of names of people, movies, bands, etc.). But should it contain so much? Is there something of a practical limit to the size of the lexical model? Would it bog down a slower device? Are words ignored if they aren’t in the top 10,000 or whatever? I did notice that the predictive algorithm did not suggest a rare word that I tried to get it to suggest, preferring to suggest higher-frequency words that did not match the letters I’d keyed so far.

Marc · May 6, 2020, 12:57am

Great questions!

At present the algorithms for selecting words are still fairly rudimentary, so there is a practical limit to the set of words that will be suggested. I lean towards including more in a model for the long term, but perhaps whittling the list down for initial releases.

The primary concern for performance is size. The way the model works (a trie), the size of the wordlist should not impact the runtime performance (apart from load time, which may be noticeable on very low end devices). The English model has 24,400 words and we’ve removed any words with fewer than 5 instances in the source data, and manually curated it to remove offensive words and obviously wrong items. The more work is done on data quality, the better the predictions offered!

Proper names probably should stay in a wordlist – they’re going to be used in messaging, etc. Non-standard spellings … well that probably depends on the language use (how standardised is the spelling?) and your own ideology on spelling!

When you saw higher frequency words that didn’t exactly match, it was because it calculated a higher probability of a mistype than the rare word. If you turn off corrections in the settings for the language (in app, not available in the Developer test at present), you’ll see it start to suggest rarer words. The weightings may not be optimum for the difference between corrections and predictions at this point and that’s certainly something that we can look at tweaking over time.

Dan_Em · May 6, 2020, 8:18pm

Thanks, Marc! That’s helpful to know!

joshua_horton · May 8, 2020, 6:48am

Based on a 1.7 million-word corpus (excluding words shorter than 4 characters), my lexical model for a certain language could contain 38,000 unique words (many of which are doubtlessly non-standard spellings, or transliterations of names of people, movies, bands, etc.). But should it contain so much? Is there something of a practical limit to the size of the lexical model? Would it bog down a slower device?

The approaches we’ve taken so far should scale fairly well, though we’ll probably need to better optimize the algorithms for performance and/or better memory use with larger backing wordlists. I have a few ideas for this once we can afford the necessary development time in the future, but that’s not one of our current priorities.

For comparison to something you can experience right now, the default model we’re providing for English has 24400 words and seems to be working perfectly fine. If you’re talking about 38000 unique words, I’d guess that it might be up to 4 times slower… but since we’ve yet to note significant lag with predictions using that model, I’d say that even “4 times slower” than something that feels instant would likely still feel acceptable or nearly so. Also, I feel a more accurate prediction would be “about twice as slow”; the “4 times” guess is a hedge.