Thanks for the advice! That should be sufficient to get me started.
Thoughts on precomposed characters?
Thanks for the advice! That should be sufficient to get me started.
The presentation characters for Hebrew should typically only be supported for legacy situations dealing with (very) old data. We would not normally advise including them in a keyboard created today.
can you suggest an existing keyboard that does recursive reordering, whose code I can view?
I haven’t been able to locate any good examples, sorry.
any(b) any(a) > context(2) context(1)
any( c) any(b) > context(2) context(1)
any( c) any(a) > context(2) context(1)
if I enter ab, all is good.
if I enter ba, reorder to ab.
if I enter ca, reorder to ac.
if I enter cb, reorder to bc.
if after I enter cb and it reorders to bc, I then enter a, will it reorder the ca to ac, resulting in bac?
if not, how do I make it go back to the start and reorder ba to ab, resulting in abc?
Yes, you’ve got the right approach. Just to illustrate. Imagine we are swapping the
a from the end to the beginning. This takes 3 steps from beginning to end:
b c d a b c a d b a c d a b c d
The trick here is just to extend what you’ve already done, and make the group recursive with a
c convenience stores store(bcd) outs(b) outs(c) outs(d) store(bc) outs(b) outs(c) store(cd) outs(c) outs(d) group(reorder) c pairwise swaps any(d) any(c) > context(2) context(1) any(cd) any(b) > context(2) context(1) any(bcd) any(a) > context(2) context(1) c swaps 2/3 any(bc) any(a) any(d) > context(2) context(1) context(3) c swap 3/4 any(b) any(a) any(c) any(d) > context(2) context(1) context(3) context(4) match > use(reorder)
So the longest possible series has only one rule, with an additional rule needed for each subsequent level. I realised it’ll be more than one rule per character class but still hopefully not insurmountable!
- ? this can’t be reduced to:
any(b) any(a) any(cd) > context(2) context(1) context(3)
- ? I don’t need also:
any(b) any(cd) any(a) > context(3) context(1) context(2)
I was giving an example with 4 different characters to be sorted. The convenience store
cd is a set of both
d – but
any(cd) still only matches a single character in the context.
You won’t need the
any(b) any(cd) any(a) rule because that is covered with the rule
any(bcd) any(a) > context(2) context(1) rule; then the recursion we implement with
match ensures that the resulting
b a c/d text gets resorted to
a b c/d with
any(bc) any(a) any(d) > context(2) context(1) context(3).
But I did miss a rule in the
swaps 2/3 section:
any(c) any(b) any(d) > context(2) context(1) context(3)
I would love to add a function to the compiler to generate these rules automatically, for example with something like the following hypothetical construct:
group(reorder) defines order reorder > any(a) any(b) any(c) any(d)
That would expand out to the rules shown above, including the
match rule. I suspect this is insufficient for a comprehensive solution. Would anyone want to take a stab at defining this more completely and specifying how it transforms into a set of rules in order to add it to the compiler? As a compiler-only change, it would be easy to test and would not have runtime impacts.
I think what may be confusing me is that any of the characters except any(a) can be repeated randomly in the sequence (although admittedly there are constraints that make some permutations less likely in reality). Therefore, besides a b c d, we need to account for
a c b c
a c c b
a d b b
a c b b
a d d b
a d d c
b a b b
b a b c
b a b d
b b d c
b c b d
b b b a
c c c a
d d d a
does that change the number of rules necessary?
Sadly, yes, that makes the problem significantly more difficult. A completely generalized solution would not be possible because the length of the string could not be determined. In practice, though, how common are the repeated characters – what are the constraints there?
How does the keyboard know to start processing a string? If I type letter vowel vowel vowel letter, how does it know that I intend the first 2 vowels to go with the preceding letter and the third to go with the following letter?
I think we need to say that everything else can be entered randomly, but the letter must precede everything else. Then when the next letter is typed, we can process the entire string. Would that work?
I’ll need to get more details on the constraints. Rule of thumb is that 2 marks of the same class occur not infrequently on a letter, but my guess is that 3 or more are rare.
How it works depends on the keyboard developer.
For some scripts it is safe to assume that all diacritics combine with the base character that precedes them. Although for some language specific keyboards I deliberately do not do that. Is for a sequence like āä it can be typed a + diaeresis + a + dieresis. But I also allow a + a + dieresis to generate ää as a typing shortcut since the sequence aä is meaningless and can not exist in the orthography the layout the keyboard was designed for.
Alternatively some scripts have combining characters that appear visually before the base character but are stored after the base character. For such scripts there is a distinction between visual and logical keyboards where certain class of combining characters are typed before the base (in visually ordered input) but are reorder by keyboard rules to an appropriate position after the base character.
So as I said it’s really up to the keyboard developer. But care needs to be taken when the complexity of the rules increase.
Do I understand correctly that the keyboard must be either visual or logical? I.e., on one the user must enter the combining character before the base character and in the other the user must enter the combining character after the base character?
Since I’m aiming for a comprehensive keyboard to allow entry of all Unicode Hebrew characters, I’d like to make it possible to enter the precomposed characters with the keyboard, but control that with an option, so that entering letter + dagesh is output as letter + dagesh, but if they enter the precomposed (letter+dagesh),
- if opt1, precomposed(letter+dagesh) > precomposed(letter+dagesh)
- else, precomposed(letter+dagesh) > letter + dagesh
would that work?
You really do need to choose a paradigm for a keyboard – either logical, or visual. Otherwise, there will inevitably be irreconcilable ambiguity. For example, if I type
base combining base, should the combining mark be attached to the first or second base character?
Selecting a paradigm like this also helps your keyboard users to know how to get started. There are plenty of rules they already know for writing their language and the idea of having structure won’t be a problem. The problem comes when there are somewhat arbitrary and complex ordering requirements based on technical limitations, which is the challenge you’ve already been trying to solve with the diacritic order :).
I would urge you to consider leaving the precomposed letters out of a generalised keyboard. They should only be used in very exceptional circumstances and increasingly less often. If you really do want to support the precomposed forms, then I think that putting them in a separate keyboard will create less confusion for your users, and directing them away from it with fairly strong phrasing. For example, “unless you know you need these precomposed letters, you should be using instead my generalised Hebrew keyboard <here>” or something of the like!
This should also simplify the logic of the keyboard because the models for character encoding are so different.
I like that! thanks
exactly. I want to always attach combining characters to the preceding base character. It’s only the various combining characters that should be reordered as necessary.
A keyboard developer when developing with combining characters has two options:
- Reorder combining characters as necessary. This creates much more complex rules in the keyboard. But generally makes input easier for the user
- Constrain or force the user to type in the order required by Unicode. Generally this approach require more detailed knowledge of the user and has a much higher learning curve
Personally I prefer the first approach, although more and more I am leaning towards developing orthographically responsive keyboard layouts.
on a string like “dcdc”, does a rule like this handle each pair in the string (the 1st 2 letters are processed to result in cddc, then the 2nd pair which results in cddc, then the 3rd pair which results in cdcd)? (and then after match() it does it again recursively)