Correcting text formatting after import from Word

Despite all the new features added to every new release of RAB, I still find that the basic conversion of text and images from a Word document remains poor.

The supporting documentation (02-Building-Apps.pdf) gives almost no instructions on how to correctly prepare a Word document (1.1.1) but claims that “basic formatting will be preserved such as character styles (bold, italic, underline), numbered lists, bullet points, hyperlinks and simple tables.” In my experience this is not the case as often formatting appears quite differently.

For example, I have a document with text that mixes bold and italic sometimes within the same sentence. Often I find words or complete lines of text with the wrong formatting and sometimes text is even underlined which is never used in the document. The Viewer tab in RAB does not show the final formatting, only when I use View in Browser. I’m not a programmer but checking the HTML files things look very strange to me. For example a single line (2 sentences) which should just be all italic has multiple tags for italic:
<div class="p"><span class="it">Ahmed </span><span class="it">harjooni </span><span class="it">o nyawɗo. </span><span class="it">E mo</span><span class="it"> </span><span class="it">dow leeso</span><span class="it">. </span></div>

I’ve also given up including illustrations as they can appear in the wrong place or often a different picture appears in the same place.

If there are really no additional steps that are needed when creating a Word document then please can there be some way to edit/correct the format and layout inside RAB before building the app and without having to manually edit HTML files (which I presume will be overwritten anytime the book is Updated from Original Source).

You understand that the Word document is converted to SFM (the primary input source). So starting in a Word document is a compromise source.

Word support was added since many people don’t understand SFM. But this was primarily aimed at simple picture books (one picture a page). Users have started using the Word import for many other documents types, hence the pictures issues.

If you want to have more control, then learn SFM (USFM) or use ePub if you want to have complex documents with chapters and illustrations. The result is much more in your control. If you only have a one chapter books you can just use an HTML page.

There is an alternative if you start in Word, once you import your Word DOCX file. You can Export the SFM form of that DOCX from that book’s Source tab. You can then remove your DOCX imported book, then modify the SFM file and import that.

The internal viewer is not working if your Java JRE does not support JavaFX. Corretto Java unfortunately does not have JavaFX. It is in the long term plans to make the viewer not dependent on JavaFX.

The HTML may not look like what you expect but what does the Word look like underneath that it came from? Unless you are very careful, Word files underneath look like a rats nest. The HTML you show is probably ten times cleaner than the way it is stored in Word.

For the same surface form, you can have dozens of underneath forms to represent that surface form. For example:
This is all italic words.
This is all italic words.
But under the surface form is this:

*This is all italic words.*
*This* *is all* *italic* *words.*

In Markdown markup.
The second one when converted to HTML would look similar to your HTML illustration.

If you want help and want to stay in Word then please share your Word docx with me and I will help you work through this. Use a Personal Message by clicking on my head icon and choose Message.

Thank you for your response and helpful explanations. The particular book I have been having problems with is a textbook that was already prepared for print publication using Microsoft Publisher. The formatting was already a little complex and further underlying issues were probably created when I copied and pasted in Word.

I am a little familiar with SFM formatting and your tip on exporting from the Source tab was helpful. I was then able to manually modify the markers that are wrong and the result look fine. Since the book is already getting quite large I think we are going to abandon adding any pictures for now as they are just decorative.

Going forward, I guess it might be better to avoid Word and rather copy and paste the text from Publisher into a text editor and then manually add the markers? Thought that be quite time consuming.

My Publisher 2010 can save as Word docx (or HTML). That may or may not be better than copy and paste into a Word docx.

@Ian_McQuay @MarkP
I understand why it is necessary to do final formatting in the SFM, but providing a decent editing environment for partners/users is difficult. Ideally they need a WYSIWYG to editor for USFM. PT is an obvious choice, but I suspect we are not supposed to be creating projects for non-scripture, or registering people for PT when they only work on non-scripture projects. Correct?

Are there any other editing options available?

@MikeB Paratext has several sharing options besides the UBS servers. USB, Chorus Hub server or Network shared folder. I don’t know if you need to have registered projects to use those.

Any other option has issues with syncing all version.

SIL had Translation Editor as part of Flex before they joined with Paratext. I hav enot used it to know if it is flexible enough.

Bibledit, but it looks totally tied to Scripture.

There is Biblelator · PyPI and Usfm Editor Style Guide

Toolbox can give you partly formatted SFM. It would be totally unrestricted in what you wrote and the different books. It is more a database than a document editor.

You could use Libre Office and save as RTF. Then use the ancient RTF2SFM command line tool to get the SFM. There is quite a bit of teaching so that people used styles and not hard formatting.