Extracting non-unicode text from pdf (South Sudanese language)

GRScott · March 31, 2021, 2:39am

I would like to extract text from a series of pdf publications produced for the Ma’di community of South Sudan. The materials I have access to are old, and so the fonts used will be legacy.
This is in support of a diaspora language project in Australia, in which the community would like to update and offer correction to old materials, and to adapt them for Bloom and other platforms.
We would really like not to have to re-type the text if at all possible.
Thanks for any suggestions.
Graham Scott (SIL Australia)

Ken_Keyes · March 31, 2021, 10:57pm

Hi Graham, I would like to recommend the AABBY “Finereader” OCR program. It is my “go-to” program for getting documents scanned accurately.
You can create a “language” by defining a set of symbols, and you can train the system to recognize irregular or distorted characters.
Cheers, Ken

GRScott · March 31, 2021, 11:39pm

Thank you Ken.
I’ll look into that.
blessings,
Graham

Nate_Marti · April 2, 2021, 4:09pm

If the document was exported to PDF rather than scanned to PDF (i.e. you’re in non-OCR territory), then a good Linux command-line program is “pdftotext”. If you’re on a new enough version of WIndows 10 you could make use of it through the Windows Subsystem for Linux, or you could use a machine running Wasta-Linux (wastalinux.org). It finds the text in the PDF file and spits it out into a file of the same name but ending in .txt. Note that its manual entry does say the following:

Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

So I don’t know how it would handle your legacy font situation.

Nate