Order of typing Hebrew characters

Many years ago Hebrew keyboards required that characters be keyed in a very specific order (e.g., holem before low cantillation marks before pre-positive cantillation marks before high cantillation marks [yes, I know there are rarely 2 cantillation marks on the same letter]).
Is that still the case? Or is it okay to just order as letter - dagesh/raphe - shin/sin dot - vowels in any order - cantillation marks in any order?
TIA from the newbie.

I suspect it will depend on the keyboard. For example, this keyboard clearly indicates an order that is required:
https://help.keyman.com/keyboard/sil_hebrew

This keyboard seems to be smarter and indicates you can use any order:
https://help.keyman.com/keyboard/galaxie_hebrew_mnemonic

Thanks, Lorna. I’m wondering if the SIL Hebrew keyboard says that because it was created for a much older Unicode standard. Is there any way to find out the creation date for keyboards?

Yes, each keyboard has its own history file which you can see on GitHub.

If the keyboard mentioned here is this one (https://keyman.com/keyboards/sil_hebrew), its history should be like so:

Hebrew (SIL) Keyboard Change History

1.7.1 (14 Sept 2018)

  • Rename keyboard
  • Add support for linux as target

1.7 (17 May 2018)

  • Updated language codes adding script subtag for Windows 10

1.6 (23 Apr 2018)

  • Migrated to GitHub
  • Removed patch rules for FieldWorks
  • added macOS as a target

1.5 (23 Feb 2007)

  • All PUA characters above have been replaced by the following, according to an agreement with other font designers and because PUA characters are not really usable in commercial software at this time.
    • d78 Hebrew reversed nun with dot - use Nun Hafukha(05C6)CGJ 0307
    • d170 Hebrew Mark Lower Dot - now defined as 05C5
    • Right Meteg - use 05BD CGJ before the low vowel or other low marks
    • Left Meteg with Hataf - use Hataf ZWNJ 05BD
  • Note that U+05BD Hebrew Point Meteg is coded in the Ezra SIL fonts to always fall to the left of the vowel, except for with hatafs, it falls medially. These are the most common positions of meteg in BHS.

1.4 (22 Sep 2003)

  • Added: U+20AA New Sheqel, U+20AC Euro, U+0024 Dollar, U+00A0 no-break space
  • PUA: PUA F300 d78 Hebrew Reversed Nun, PUA F301 d170 Hebrew Mark Lower Dot, PUA F302 Hebrew Accent Right Meteg (convert d149 to this when it occurs before a vowel), PUA F303 Hebrew Accent Left Meteg (for use to left of hatafs only)

1.3 (25 Jul 2003)

1.2 (30 Sep 2002)

1.1 (7 Aug 2002)

1.0 (19 Dec 2001)

  • Unicode version adapted from legacy SIL Ezra keyboard layout.

thank you, Makara.
very interesting to see that there was such a long hiatus 2007-2018.
the current documentation requires Consonant - Dagesh - Vowel - Low Marks - Pre-positive Marks - High Marks - Post-Positive Marks. It still seems to me like Unicode should be able to manage with less rigidity, but I’m no expert.

The issues with character order are not so much that Unicode should be able to manage (it’s just a standard, not an implementation), but rather that every application that supports Unicode at all has to understand every part of the Unicode specification to be able to cope with inconsistent character ordering – otherwise searching, sorting, and even display can be inconsistent.

The most reliable solution is to enforce a character order in the input method, which is what we encourage for modern keyboards: dynamically reorder marks as needed as the user types them so that they will always be the correct order. The SIL Hebrew keyboard is quite old and this improvement has not been implemented in that particular keyboard, but it would be possible to update it to do so.

Ah. So if a user of my Hebrew keyboard enters vowel - letter the keyboard should input letter - vowel? And I imagine that’s what that store() and group() are for.

That sounds doable, but it’s going to get ugly re-ordering vowel - high cantillation - letter - masora circle - low cantillation - shin dot - dagesh - pre-positive mark into letter - dagesh - shin dot - vowel - low cantillation - pre-positive mark - high cantillation - masora circle.

Do you have recommendations about precomposed alphabetic presentation characters? I had planned to include them as an option, both to be comprehensive and to take advantage of new fonts that are including them.

Yes, use store and group to do this. I recommend doing this work in a secondary context-only group. For example,


group(main) using keys
...
match > use(reorder)

group(reorder)
any(high-cantillation) any(vowel) > context(2) context(1) 
...

You can even use recursive group processing to iteratively improve the order of the characters one swap at a time to reduce the rule set. It could be an interesting solution!

Note that you can simplify the logic somewhat by assuming that only the most recently entered character is out of order – trusting the keyboard to have reordered previous combinations at the time they were entered. This means that with judicious use of stores you may be able to reduce this to one rule per character class.

Thanks for the advice! That should be sufficient to get me started.
Thoughts on precomposed characters?

The presentation characters for Hebrew should typically only be supported for legacy situations dealing with (very) old data. We would not normally advise including them in a keyboard created today.

can you suggest an existing keyboard that does recursive reordering, whose code I can view?

I haven’t been able to locate any good examples, sorry.

any(b) any(a) > context(2) context(1)
any( c) any(b) > context(2) context(1)
any( c) any(a) > context(2) context(1)

if I enter ab, all is good.
if I enter ba, reorder to ab.
if I enter ca, reorder to ac.
if I enter cb, reorder to bc.
if after I enter cb and it reorders to bc, I then enter a, will it reorder the ca to ac, resulting in bac?
if not, how do I make it go back to the start and reorder ba to ab, resulting in abc?

Yes, you’ve got the right approach. Just to illustrate. Imagine we are swapping the a from the end to the beginning. This takes 3 steps from beginning to end:

b c d a
b c a d
b a c d
a b c d

The trick here is just to extend what you’ve already done, and make the group recursive with a match rule:

c convenience stores
store(bcd) outs(b) outs(c) outs(d)
store(bc) outs(b) outs(c)
store(cd) outs(c) outs(d)

group(reorder)
c pairwise swaps
any(d) any(c) > context(2) context(1)
any(cd) any(b) > context(2) context(1)
any(bcd) any(a) > context(2) context(1)

c swaps 2/3
any(bc) any(a) any(d) > context(2) context(1) context(3)

c swap 3/4
any(b) any(a) any(c) any(d) > context(2) context(1) context(3) context(4)

match > use(reorder)

So the longest possible series has only one rule, with an additional rule needed for each subsequent level. I realised it’ll be more than one rule per character class but still hopefully not insurmountable!

  1. ? this can’t be reduced to:
    any(b) any(a) any(cd) > context(2) context(1) context(3)
  2. ? I don’t need also:
    any(b) any(cd) any(a) > context(3) context(1) context(2)

I was giving an example with 4 different characters to be sorted. The convenience store cd is a set of both c and d – but any(cd) still only matches a single character in the context.

You won’t need the any(b) any(cd) any(a) rule because that is covered with the rule any(bcd) any(a) > context(2) context(1) rule; then the recursion we implement with match ensures that the resulting b a c/d text gets resorted to a b c/d with any(bc) any(a) any(d) > context(2) context(1) context(3).

But I did miss a rule in the swaps 2/3 section:

any(c) any(b) any(d) > context(2) context(1) context(3)

I would love to add a function to the compiler to generate these rules automatically, for example with something like the following hypothetical construct:

group(reorder) defines order

reorder > any(a) any(b) any(c) any(d)

That would expand out to the rules shown above, including the match rule. I suspect this is insufficient for a comprehensive solution. Would anyone want to take a stab at defining this more completely and specifying how it transforms into a set of rules in order to add it to the compiler? As a compiler-only change, it would be easy to test and would not have runtime impacts.

I think what may be confusing me is that any of the characters except any(a) can be repeated randomly in the sequence (although admittedly there are constraints that make some permutations less likely in reality). Therefore, besides a b c d, we need to account for
a c b c
a c c b
a d b b
a c b b
a d d b
a d d c

b a b b
b a b c
b a b d
b b d c
b c b d

b b b a
c c c a
d d d a
etc.
does that change the number of rules necessary?

Sadly, yes, that makes the problem significantly more difficult. A completely generalized solution would not be possible because the length of the string could not be determined. In practice, though, how common are the repeated characters – what are the constraints there?

How does the keyboard know to start processing a string? If I type letter vowel vowel vowel letter, how does it know that I intend the first 2 vowels to go with the preceding letter and the third to go with the following letter?
I think we need to say that everything else can be entered randomly, but the letter must precede everything else. Then when the next letter is typed, we can process the entire string. Would that work?
I’ll need to get more details on the constraints. Rule of thumb is that 2 marks of the same class occur not infrequently on a letter, but my guess is that 3 or more are rare.

How it works depends on the keyboard developer.

For some scripts it is safe to assume that all diacritics combine with the base character that precedes them. Although for some language specific keyboards I deliberately do not do that. Is for a sequence like āä it can be typed a + diaeresis + a + dieresis. But I also allow a + a + dieresis to generate ää as a typing shortcut since the sequence aä is meaningless and can not exist in the orthography the layout the keyboard was designed for.

Alternatively some scripts have combining characters that appear visually before the base character but are stored after the base character. For such scripts there is a distinction between visual and logical keyboards where certain class of combining characters are typed before the base (in visually ordered input) but are reorder by keyboard rules to an appropriate position after the base character.

So as I said it’s really up to the keyboard developer. But care needs to be taken when the complexity of the rules increase.