Strange PDF output (for copy/paste) using SIL fonts

sm30 · June 24, 2019, 9:03pm

A Bloom user recently complained about not being able to copy/paste from a PDF file generated by Bloom. Certain composed characters (U+00EE / LATIN SMALL LETTER I WITH CIRCUMFLEX in particular) were coming out garbled when copied from the PDF file even though the displayed text looks just fine. Depending on the PDF viewer, the user saw either U+FFFD (REPLACEMENT CHARACTER) or U+1000020 (undefined character from upper PUA set) plus an extraneous space character and U+0302 (COMBINING CIRCUMFLEX ACCENT). This happens whenever the font used for the text was Andika New Basic, Andika, or Charis SIL. If the font chosen by the user was a standard system font such as Arial, Tahoma, or Times New Roman, the copied text came out looking okay. Is this a known “feature” of these fonts?
Again, the PDF page looks okay in all these fonts. It’s just the underlying data that is mangled if the user tries to copy from the PDF and paste elsewhere.

Lorna · June 24, 2019, 9:42pm

The default a- and g-based characters in Andika have funny postscript names so I would expect cut and paste to have some problems with those. I wouldn’t expect problems with U+00EE unless the user turned on the i-tail alternates. Then I would expect copy and paste to be a problem. I wouldn’t expect the same issues with Charis SIL because Charis has normal postscript names for a- and g- based characters. However, if alternate glyphs were turned on, then we would again expect problems. We probably can’t give a certain answer without seeing the pdf.

sm30 · June 24, 2019, 10:10pm

Lorna, I can’t upload a PDF file to this forum, so I’ll email you one. Assuming our email system allows that…

Lorna · June 25, 2019, 7:05pm

I’m not seeing any obvious reason this would be happening. Do you know how the user input the character? Is it possible they used Insert/Symbol or something? Sometimes that causes odd behavior.

sm30 · June 25, 2019, 8:01pm

I don’t know how the characters were input. They look fine in the original HTML file, and examining the bytes shows standard Unicode characters encoded as UTF-8. I’ll mail a simple HTML file with the corresponding PDF file produced by Firefox with its “Print to File” mechanism. (That’s essentially what Bloom uses to produce PDF files for printing.)

bobh · June 25, 2019, 9:26pm

I just tried your html file, from FireFox 67.0.4 printing to PDF on Windows 10. I have Acrobat installed. Like you say, it looks ok. However, when copied:

some characters decomposed (including U+00EE)
The sequence of characters is incorrect, and includes added spaces

One of the potential reasons for this is that “print to pdf” processes don’t [necessarily] get the underlying character stream into the PDF. All they get is a set of positioned glyphs. So the PDF viewer has to guess both the character codes and their sequence from the glyph names and their positions. This can be problematic.

(Still need to research why our fonts give different results)

bobh · June 25, 2019, 11:22pm

I can’t find any significant reason in our fonts why this would be different.

But in any case, I guess I should have said at first that in general, given the vagaries of PDF creation, text extraction from PDF files is often unreliable. What aspect of your workflow is requiring this capability?

paul_frank · June 26, 2019, 1:59pm

Hi Bob–this started with me. I was combining two Bloom books. Since you can’t run two instances of Bloom at the same time, I created a PDF of the standalone (new language) book and opened the master book in Bloom. Then I added the new language in Bloom, giving me empty text boxes ready for pasting in the text for the new language. My plan had been to copy and paste the text from the PDF that Bloom had just created, but that’s when this problem arose.

One the one hand, I’m sure there’s another way to achieve my task, but on the other hand, what happens is unusual, unexpected, and non-standard. Unicode characters in the input should give the corresponding Unicode characters in the output, but that’s not happening.

bobh · June 26, 2019, 6:46pm

but on the other hand, what happens is unusual, unexpected, and non-standard.

Hi Paul,

I humbly disagree with your assessment. It is often the case that copying text from a PDF doesn’t work, and we shouldn’t be surprised – especially when non-ASCII text is involved. This is true even with PDFs created with Adobe products, but all the more so with those created with non-Adobe products.

Perhaps this is the first time you’ve encountered this, but I and my colleagues have seen this problem many times. In my opinion, it is unwise to create workflows that depend on reliably extracting text from PDF files.

I hope you can find an alternate method to merge two Bloom books.

Bob

paul_frank · June 26, 2019, 7:13pm

Well, good points. On the other hand, we’re talking about our software (Bloom) and our fonts, not some third party over whom we have no control. (Although I’m sure the PDF creation tool Bloom uses is from a third party.) I can confirm that other fonts don’t generate this problem, but ours do. To the degree that we are able to correct this problem, I think we should.

sm30 · June 26, 2019, 7:56pm

I think an alternative approach that should work would be to open one of the books in a browser (which should work for displaying the content), and copying from that to the book opened in Bloom itself.

JohnHatton · June 26, 2019, 9:55pm

Steve there is a pesky style in the book preview that fades everything, preventing copying. But Paul you could export to Word/Libre Office, then copy from there.

paul_frank · June 28, 2019, 7:47pm

Thanks for the ideas, guys. For this case, I ended up exporting to XML for InDesign and copied from there. So we’ve got several work-arounds.