Convert multi-page table to single spreadsheet

hasso · February 2, 2023, 2:59pm

Posting on behalf of a field worker:

I’ve got this journal article from 1997 with pages and pages of word list comparisons: 267 columns (vocab items) across about fifty (probably scanned copies of original paper) pages. I need only two rows from this table, and ideally would like to flip the orientation so that the rows become columns and vice-versa. But first I need to convert the PDF scanned journal article tables into a spreadsheet. I’m sure the technology to do that is available somewhere so that it won’t involve manually copying items one at a time, but I don’t know what it would be. Can someone help?

Here’s the document I’m working with: [Dropbox - West Lakes Plains Wordlists.pdf - Simplify your life]

drowe · February 2, 2023, 5:05pm

In the past I’ve used Convert PDF to Word online for free for short text documents. I tried the Excel option at https://www.onlineocr.net/ (choosing English for language) and got results for the first ten pages (since there’s a limit on the free option), though the results are less than stellar (gamma becomes “y”, “^” becomes “A”, etc). Perhaps choosing a different language would give you better results.

Which two rows of the table do you need?

Possible work flow:

Break the PDF document into 7 or more smaller chunks in order to get below the 10-page limit.
Process each file with the converter
Remove unneeded rows and clean up the rest (so that does mean some retyping)
Manually copy and paste the data from each spreadsheet into one
Use a Paste Special with Transpose option to get the orientation you want.

But retyping might be quicker.

drowe · February 2, 2023, 5:07pm

Here’s a zipped copy of the .xlsx file that the online converter generated.

West Lakes Plains Wordlists.zip (14.0 KB)

Alex_Larkin · February 4, 2023, 1:52pm

I use ilovepdf.com. Their PDF to Excel feature worked really well for me. However, anytime you go into OCR territory, you run into trouble. The OCR takes it for granted that the text is going to be in one mejor language or another, so special characters get turned into something else and indigenous words can morph into English words that look similar, for example.

Also, OCR tends to add a lot of strange formatting.

Sometimes the work of scanning and correcting is more than the work of transcribing.

xofweber · February 8, 2023, 7:22pm

I tried Adobe Acrobat’s export tool. The results were not useful. I also tried simply cutting and pasting from the PDF into Excel. Also not helpful.

I then tried a free trial of Abby Finereader. It didn’t do much better. However, it has the option to add other languages (my scan used English and the English phonetic alphabet). Maybe you can create a “new language” by adding in all the symbols that the PDF contains and then “training” the OCR to recognize them using the “new language”. Here’s a link to an article that may be helpful. The free trial lasts only 7 days, but maybe it would be worth giving it a shot.

JDWood · February 9, 2023, 6:49pm

I use Okular as my main PDF viewer, and it has an option to copy tabular data as a table. It pastes well into LibreOffice Writer, and I suspect it would do reasonably well into Excel. I don’t do much with Excel, but I believe you can then use a pivot table to swap columns/rows.

It does require copying one page at a time, but it also allows for tweaking the selection in case the automatic selection misses some column/row divisions.