Page 1 of 1

docx to xhtml and back

PostPosted: Wed Mar 22, 2017 11:00 pm
by robert
Hello. I am trying to do a round trip with docx to xhtml and back. As a sample file I am using
https://github.com/plutext/docx4all/blo ... reSet.docx

The code I'm using to generate the xhtml is, more or less, this:
https://github.com/plutext/docx4j/blob/ ... tHtml.java

Is there a ConvertInHtml example which does the exact opposite of the previous file? I am trying to see whether we can use docx4j for the purpose stated above. I am aware of the docx4j importer project, but I have not found any useful code examples there, because all of them are years old (while the convertouthtml is quite new).

EDIT
https://github.com/plutext/docx4j-Impor ... LFile.java
Using something similar to the above, I have manged some sort of partial reverse function. I had to use urls (file:///) to make images work, and I had to use a template blank document to generate some headers and footers.

Unfortunately there seem to be bugs with:
1. numbering (bullets work fine)
2. Spacing. For some reason, the generated document has quite some extra spacings before and after paragraphs. I'll need to see whether I can set it to zero.
3. Tables are recreated too big after exporting to xhtml. They don't fit on screen.
4. Because we use a template for headers and footers, I'll need to hardcode remove headers and footers in the export (headers and footers don't work in the xhtml import, right?)
5. Deletions in docx are recreated as red text with a strikethrough line ...
6. Insertions don't work.
7. Fonts such as Calibri (Body) don't seem to work, on the docx to XHTML conversion. They get converted to standard Calibri.
8. Image scaling is wrong.

It's a bit odd there are so many bugs I noticed immediately when using these APIS that seem rather mature.

Re: docx to xhtml and back

PostPosted: Sun Mar 26, 2017 10:02 am
by jason
We expect soon to release a high quality component for editing Word documents in a web browser. I'm assuming in what follows that your application is web-based editing.

In the meantime, for docx -> XHTML -> edit XHTML -> docx, you can see https://github.com/plutext/docx-html-editor

To be clear, this approach has its limitations and we're not actively developing it. As mentioned, we're working on a completely new approach.

Regarding some of your specific comments, I think a number of them are handled by the sample github project referenced above, by maintaining state through the round trip.

Because Import-XHTML is entirely open source, it relies on contributions for its improvement (unlike the stuff in Docx4j Enterprise). Most users don't actively contribute improvements, so it improves only slowly. That said:-

1. numbering (bullets work fine): There is some numbering support.

2. Spacing. For some reason, the generated document has quite some extra spacings before and after paragraphs: would need to look at this in the context of your XHTML.

3. Tables are recreated too big after exporting to xhtml: see comment about maintaining state above.

4. Because we use a template for headers and footers, I'll need to hardcode remove headers and footers in the export (headers and footers don't work in the xhtml import, right?): feel free to contribute...

5. Deletions in docx are recreated as red text with a strikethrough line ... : you mean deletions in XHTML? If you are talking about docx to XHTML to docx here, it is helpful if you workout which step is an issue for you, and post separately about that specific issue.

Your issues 6 to 8: seem to be talking about docx to XHTML?

Re: docx to xhtml and back

PostPosted: Tue Mar 28, 2017 12:15 am
by robert
Thanks for the reply. If we're going to decide to invest more time in this feature, I'll take a look at plutext and see how it works.

As per my post, I have used the sample github projects (docx4j & importer) to test all of the above features, and my findings are based on the results using the template provided in docx4all.
I have uploaded the test file and the result file so you can see where I'm coming from. In *Convert*.xml you can see the above described bugs, as well as the fact that the spacing is way off.
If you want to see the exact code I can upload that as well, though it is mainly copy paste from the github links I pasted above (with a small change - I used a template docx file with a header and footer to generate those).

I'm assuming some of them can be fixed with the way the docx to html is being done, though some of them are almost certainly html to docx bugs.
1. If the numbering is not nested, as per your sample docx that you provided inside docx4all, I believe it works (it just doesn't seem to know the format of the numbering itself - number ,letter, etc).
2,3. Maintaining state of the settings, basically save them as a preset, then reload it. If we'll decide to invest more time in this, I'll certainly try that.
5. Yes, some things I have already found out that it's a docx to html conversion (fonts such as 'Calibri (body)' are converted to 'Calibry') issue. I'll just need to invest time and debug some things.
8. This might actually also be a state issue .... maybe. Will have to check it out.

Thanks for the reply and I'll keep this post bookmarked if we decide to use any of these libraries for our requirements. Have a nice day.