Page 1 of 1

Library question

PostPosted: Thu Aug 20, 2015 2:28 pm
by stevej
Hi,

I am looking at using docx4j with the ImportXHTML library in a large project to convert html to docx. We currently have code that takes a dom tree representing a html document and converts it to PDF using flying saucer. It works quite well and we'd like to do the same thing for docx.

I notice that the ImportXHTML library comes with a copy of the flying saucer library under the org.docx4j.org.xhtmlrenderer package. I assume this is so that it doesn't conflict with the standard flying saucer library, and that you have made some changes to the source. Can you tell me what version of flying saucer your library is based on, and what level of modifications have been made (eg. just a few small changes, more substantial changes across the library, etc).

Would it be difficult to make docx4j use the standard flying saucer library? (I assume yes).
When there are new releases of flying saucer, do you update the docx4j library to match?

Thanks,
Steve

Re: Library question

PostPosted: Thu Aug 20, 2015 10:36 pm
by jason
Hi Steve

You can find our copy of flying saucer at https://github.com/plutext/flyingsaucer

If you look at https://github.com/plutext/flyingsaucer/commits/master you can see the last commit before our fork was in April 2011, namely https://github.com/plutext/flyingsaucer ... 279a29462c

Most of what we added is in https://github.com/plutext/flyingsaucer ... derer/docx
but there were a few other changes which you can see in the commit log.

It would've been good to:

(a) understand which if any of those other changes were actually necessary in the end, and

(b) to offer the docx package stuff back to the upstream project,

but this hasn't happened yet.

It would be quite straightforward to try the standard flying saucer library:

(1) repackage org/docx4j/org/xhtmlrenderer/docx to org/xhtmlrenderer/docx and copy it in to your copy of FS proper; build your new FS.

(2) change https://github.com/plutext/docx4j-ImportXHTML/ to use FS in (1) above. Mostly, this is the imports in https://github.com/plutext/docx4j-Impor ... rImpl.java

Maybe the docx package will make it upstream after all!

Re: Library question

PostPosted: Fri Aug 21, 2015 12:03 pm
by stevej
Hi Jason,

Thanks for all the info.
I probably won't have a chance to try this for a few weeks, but hopefully will have some success when I do.

Thanks,
Steve

Re: Library question

PostPosted: Wed Nov 04, 2015 5:09 pm
by stevej
Hi,

I have finally had a bit more time to look at this. I am working on a POC to ensure it will work for our needs.. I have come across an issue and I'm not sure if it's a bug or a problem with my usage or expected behaviour.

I am using a custom XHTMLImageHandler based on the XHTMLImageHandlerDefault class, which includes this code:

Code: Select all
Inline inline;
if (cx == null && cy == null) {
   inline = imagePart.createImageInline(null, e.getAttribute("alt"), 0, 1, false);
} else {
   inline = imagePart.createImageInline(null, e.getAttribute("alt"), 0, 1, cx, cy, false);
}


If I have an image in my html that has no size specified, such as:
Code: Select all
<img src="image.png" />

then cx and cy are null. The image in the resulting docx file is sized correctly.

If I have an image in my html that has a specified size, such as:
Code: Select all
<img src="image.png" width="306" height="336" />

or
Code: Select all
<img src="image.png" style="width: 306px; height: 336px;" />

then cx and cy are not null, but they seem to be too large. The image in the resulting docx file is too big.

Please note that I come from a web world where everything is specified in pixels, and have confused myself multiple times trying to convert between pixels, inches, twips, points, emus, etc etc. I don't know if I need to be using images with a specific DPI to get the results I want.

Can I just ignore the cx and cy values that are passed in to the addImage() function, and calculate them myself based on the image data?

Thanks,
Steve

Re: Library question

PostPosted: Wed Nov 04, 2015 8:30 pm
by jason
You could, but the problem is going to be where the image has been scaled in the CSS.

There was https://github.com/plutext/docx4j-Impor ... dad0e489c8

Do you have that?

And the related unit tests may be helpful:

https://github.com/plutext/docx4j-Impor ... eTest.java

Re: Library question

PostPosted: Thu Nov 05, 2015 2:10 pm
by stevej
Hi Jason,

Yeah I thought that would be the case.

I do have that commit included in my build.

I have tracked it down to this code in XHTMLImporterImpl.addImage():

Code: Select all
      Long cy = contentBounds.height==0 ? null :
         UnitsOfMeasurement.twipToEMU( dotsToTwip(contentBounds.height) );
      Long cx = contentBounds.width==0 ? null :
         UnitsOfMeasurement.twipToEMU( dotsToTwip(contentBounds.width) );


My image file is 306px wide and 336px high.
contentBounds contains width=6120 and height=6720 (exactly 20 times my pixel size).
I think the problem may be the dotsToTwip() method, which multiplies the value by 1.3333. My understanding is that there are 1440 twips to an inch, so twips = pixels * 1440 / dpi, or twips = dots * 72 / dpi.

If I change the calculation in dotsToTwip() accordingly, and use a dpi of 72 (meaning twips == dots), my images come out looking correct in my docx.

I don't really understand how MS Word handles dpi, or how docx4j generally converts css pixel lengths into Word units.

Re: Library question

PostPosted: Thu Nov 05, 2015 4:27 pm
by stevej
I had a look at the junit tests and I don't understand this one in the ImageResizeTest class:

Code: Select all
   public void testFixedSizeImage() throws Exception {
      Inline inline1 = getInline("<div><img src='" + PNG_IMAGE_DATA + "'/></div>");
      Inline inline2 = getInline("<div><img src='" + PNG_IMAGE_DATA + "' width='40px' height='20px' /></div>");
      Assert.assertTrue(inline2.getExtent().getCx() / inline1.getExtent().getCx() == 26);
      Assert.assertTrue(inline2.getExtent().getCy() / inline1.getExtent().getCy() == 13);
   }


The commit you linked to changed the expected ratios from 20 and 10 to 26 and 13 - I don't understand why that is. The unscaled image is 2px by 2px, so I would expect the ratios to be 20 and 10 when the second image is scaled to 40px by 20px.

After making the changes I mentioned to the dotsToTwip() method, this test fails and the ratios are in fact 20 and 10.

I think I'm just missing a piece of the puzzle here..

Re: Library question

PostPosted: Mon Nov 16, 2015 10:00 am
by stevej
Hi,

I have made some changes to allow docx4j to work with a "standard" flying saucer library (a custom flying saucer build is still required, but these changes could conceivably make it back into the main tree at some point). I'm not sure if this will be useful to anyone but the changes are below:

Flying Saucer:
https://github.com/YellowfinBI/flyingsa ... yf-fs-docx
Changes from the upstream tree:
  • Includes changes from the docx4j project
  • Added code under org.xhtmlrenderer.docx package
  • A couple of small additions required for our project

docx4j:
https://github.com/YellowfinBI/docx4j/t ... -yf-dpifix
Changes from the upstream tree:
  • Made some small changes to image size calculations to respect the configured DPI setting
  • Modified some test classes to fix a problem when running multiple tests as part of the build script

docx4j-ImportHTML:
https://github.com/YellowfinBI/docx4j-I ... mlrenderer
Changes from the upstream tree:
  • Changed references to org.docx4j.org.xhtmlrenderer to org.xhtmlrenderer
  • Small changes to respect the configured DPI setting
  • Modified test classes so that they succeed with different configured DPI settings

Any comments on these would be welcome.

Thanks,
Steve

Re: Library question

PostPosted: Sun Jan 28, 2018 11:24 am
by jason
Hi Steve

After some delay (ahem, sorry about that..), status is as follows:

A version of FlyingSaucer current against upstream (ie to Jan 2018), but with docx code from your branch, may be found at https://github.com/plutext/flyingsaucer ... ellowfinBI

A version of docx4j-ImportXHTML which uses this, may be found at https://github.com/plutext/docx4j-Impor ... ree/FS2018

Your image size fixes are in that branch of docx4j-ImportXHTML and at https://github.com/plutext/docx4j/commi ... 25a800c53a

I would like to release a new docx4j-ImportXHTML v3.4.0 based on the FS2018 branch.

But to do that the changes in plutext/flyingsaucer/tree/FS2018-YellowfinBI need to be accepted upstream, and pushed to Maven Central as a new version (which we can use as our dependency).

So next step is to make a pull request.

Now tracking this at https://github.com/plutext/docx4j-ImportXHTML/issues/41