Understanding Unicode I missing section 5 + question

mmerc · February 10, 2022, 6:44am

Hello,

I’m not sure if I’m in the right place, but my question pertains to the Understanding Unicode I article on the SIL website. I tried posting in the comments section of the article but get an error and was referrred to this forum by the Contact Us page.

According to the article subheading, there should be a Section 5 but it seems to have been left out. There is another version of the article that includes the section here.

I also have a question about the last paragraph of Section 5.5 - Characters, not graphemes, which says:

Note that the definition of a text element depends both upon a given process and a given writing system. The key point in relation to Unicode is that Unicode assumes that the mapping between characters and text elements is, in general, many-to-many.

If, in general, there is a many-to-many relationship between characters and text elements, then what might be an example of a one-character-to-many-text-elements mapping? I think the examples provided only illustrate many-characters-to-one-text-element mapping.

Lastly, thank you for writing these articles and making them readily available. I am new to Unicode and find them very helpful in supplementing my readings of the standard.

bobh · February 11, 2022, 11:49pm

Thanks for pointing out the missing material! Not sure where it went but we’ll see what we can do to correct it. (and thanks for the link to http://www.cs.unibo.it/)

mmerc · February 12, 2022, 4:32am

Awesome, thanks for getting back so quickly @bobh.

bobh · March 2, 2022, 8:41pm

I no longer seem to be able to get to that PDF, but until we fix the current page you can get to the original text via the WayBack machine

bobh · March 2, 2022, 9:28pm

A simple case is the decomposition of composite characters encoded in Unicode. For example U+00C0 LATIN CAPITAL LETTER A WITH GRAVE can be represented by the canonically equivalent sequence U+0041 LATIN CAPITAL LETTER A followed by U+0300 COMBINING GRAVE ACCENT.

More complex cases can be seen in Arabic Presentation Forms-A, for example U+FD51 ARABIC LIGATURE TEH WITH HAH WITH JEEM FINAL FORM which might be perceived as being a single text element but is equivalent to <062A, 062D, 062C>:

mmerc · March 5, 2022, 1:59am

I no longer seem to be able to get to that PDF, but until we fix the current page you can get to the original text via the WayBack machine

The link still works on my end. I’m unable to upload the PDF here, but the text seems to be identical to the original on the WayBack machine link.

A simple case is the decomposition of composite characters encoded in Unicode. For example 00C0 LATIN CAPITAL LETTER A WITH GRAVE can be represented by the canonically equivalent sequence U+0041 LATIN CAPITAL LETTER A followed by U+0300 COMBINING GRAVE ACCENT .

That makes sense! Thank you very much for keeping my question in mind @bobh

Peter_Martin · March 10, 2022, 3:13pm

We have now been able to locate and restore the missing content on that page and two others. Thank you for reporting this, @mmerc !

mmerc · March 13, 2022, 9:37am

That’s awesome, thank you @Peter_Martin