Can you decompose Arabic characters? Arabic words with diacritics not found

Hi there

I have created an app using a LIFT file from FLEx in DAB 5.4, translating from Tudaga to Arabic, French and English.
Using the search function Arabic translations with diacritics (e.g. ko بَاب for “door”) are not found, whereas the same Arabic word is found if it has no diacritics (e.g. the translation of an other Tudaga word meaning “door”: kege باب). See screenshots.
This, although I have the box “accents and tones” unchecked.

I read in the release notes on 3.1 that the issue should be fixed (composed characters are decomposed) and it does work in French for me. But not in Arabic.

Thank you for your help!

Simon

I’m not a FLEx user, but I can add that, unlike the case for Latin script, Unicode doesn’t have composite characters that use the Arabic vowel diacritics such as in the above text, so any FLEx fix related to “composed characters are decomposed” isn’t going to affect such text.

Hi Bob
Not sure what you mean by “I can add that”. So can you only decompose the Arabic characters in DAB but not in FLEx? So the Arabic word WITH diacritics would be found although I search for it writing the Arabic word in the search box WITHOUT diacritics? That would be great for me.

I think what Bob means is that Unicode has never defined any composed characters in Arabic, like it has for Roman script. The Arabic script is always decomposed. And I don’t know if unchecking the box “accents and tones” applies to the fatha character in Arabic, since I believe it is a vowel, not a tone (at least in the description I’ve read of the Arabic alphabet).

Thank you for your comments.
So it’s not a matter of decomposing the Arabic letter and its diacritic sign (indeed, not a accent, but a vowel or absence of a vowel etc.), but rather the solution would be to ignore diacritics in the search function.
Here they are:


064B ARABIC FATHATAN
064C ARABIC DAMMATAN
064D ARABIC KASRATAN
064E ARABIC FATHA
064F ARABIC DAMMA
0650 ARABIC KASRA
0651 ARABIC SHADDA
0652 ARABIC SUKUN
0653 ARABIC MADDAH ABOVE
0654 ARABIC HAMZA ABOVE
0655 ARABIC HAMZA BELOW

There is actually a much longer list of letters that should be ignored, including a number of characters that I’m sure you don’t use. This issue is discussed more completely in the following thread:


My proposal there is to use the standard Unicode “non-spacing mark” category as an indication of what to ignore, which is easier to implement, and may help in languages other than Arabic as well.

1 Like