Lexical Model Rule

Luke · April 30, 2020, 12:13am

Is there currently a way to set a rule in the lexical model that would suggest a capital word when following sentence ending punctuation like .!? Conversely, currently my lexical model will sometimes suggest capital words in the middle of the sentence, which doesn’t make sense. I believe I can fix this problem by working with my word analysis in Fieldworks or by deleting all the capitalized words in my tsv file, but since we are dealing with 19,000 word forms or so, that is going to be a large task to accomplish. Please let me know if there is something that can be done already.

Thanks,
Luke

Luke · April 30, 2020, 1:51pm

I am now assuming that this is what I read in the latest roadmap where it says:

Predictive Text

Improved corrections
Common optimisations (e.g. Capital first letter, etc)

When is Keyman 14 coming out? I assume that the span of February to December means it might come out anytime in that range. I will include a few other comments on the roadmap post. Thanks.

Marc · April 30, 2020, 9:53pm

The optimisations around capital letters at start of sentence will impact both the lexical model and the keyboard, so it’s something we want to look at holistically.

In the .tsv file currently case is important. This is a tricky one – we can’t ignore case entirely; Sydney should always be capitalised for English. But the casing rules are not the same for all languages – some do not capitalise at start of sentence. So we need to do more research and design on how we support this. We hope to utilise resources such as CLDR.

You may find a text editor more useful for cleaning up a .tsv file – for example, using a decent text editor such as Visual Studio Code, you can search for all capitalised words with a regular expression search: ^[A-Z].

The current plan is to release 14.0 towards the end of this year – it’s in development right now. Per the roadmap, we are doing some significant work under the hood to make future improvements possible, and as our team is very small and time constrained, it takes a lot longer than we’d like.

Currently, the work on optimisations is scheduled for August – plans subject to change!

Luke · May 1, 2020, 2:13am

For languages where the first word of a sentence is capital, the first word as suggested and selected in predictive text needs to come out capitalized. Otherwise suggesting the first word is useless, because you will always have to go back and fix the first letter of the word or type out the entire word because the program doesn’t know to capitalize it. I can’t speak for every language in the world, but I can say that every Filipino language written in a Latin script follows that rule. I would say it is an overwhelming majority of languages in the world, at least those that have a Latin script that follow that rule. I have no problem editing my tsv file to search for capital letters, except that my file is currently over 20,000 words long and it is hard to write a rule that is going to help me make bulk changes to decide which words should really be capital and which are found that way in the corpus just because they start a setence.

Marc · May 1, 2020, 9:09pm

Yes, this is certainly on our agenda.