Page 1 of 1

docx to pdf

PostPosted: Thu Feb 05, 2009 1:12 pm
by fiorenzo
Hi Jason,

with docx4j-2.1.0, and this code:
Code: Select all
               String inputfilepath = "/home/fiorenzo/test_docx.docx";
      WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
            .load(new java.io.File(inputfilepath));

      // Create temp file.
      java.io.File temp = java.io.File.createTempFile("output", ".pdf");

      OutputStream os = new java.io.FileOutputStream(temp);

      OutputStream os2 = new java.io.FileOutputStream(
            "/home/fiorenzo/test_docx2.html");

      javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(
            os2);
      org.docx4j.convert.out.html.HtmlExporter.html(wordMLPackage, result,
            inputfilepath + "_files");

      wordMLPackage.pdf(os);

      os.close();


I attached the results of my conversion tests (docx to pdf and html), with different programs (docx4j/word 2007/batch program in windows):

http://liverockmedia.com/docx/test_docx.zip

Bye

fiorenzo

Re: docx to pdf

PostPosted: Sat Feb 07, 2009 12:50 am
by jason
Looking at your test case...

At the HTML stage, in Firefox:

- there is too much space between paragraphs
- bullets are missing
- probably, text flow adjacent to image could be improved

In IE, errant "A" characters with a hat are shown.

In PDF, the image at the end of the document is lost.

These things could definitely be improved. I need to remove the spam from docx4j's ticket tracker before this can usefully be tracked.

cheers,

Jason

Re: docx to pdf

PostPosted: Tue Mar 17, 2009 3:04 pm
by Mesni
Haay Jason,

( http://www.tony-franks.co.uk/UTF-8.htm table)
I'm having problem with the Slovenian letter č (in UTF-8 num č) . The letters š and ž ( š and ż ) are shown correctly but the other one isn't displayed at all... I don't know. Is that a problem with fop-fonts? Or can you set the encoding here? I don't know where to look.

And another problem with the converting. When i convert a docx that has a header and a footer both aren't added to the pdf. Is that not yet built in?

oh and i almost forgot. Page brake isn't recognized in pdf. How do i create a page break in docx so that the pdf will do one too?

Lp, Aleš

Re: docx to pdf

PostPosted: Wed Mar 18, 2009 1:48 pm
by jason
I've looked at your PDF problems:

- re fonts, see detail below

- re header/footer - these don't seem to be handled in DocX2HTML.xslt ; if it was, it looks like xhtmlrenderer could deal with it

- re <w:br w:type="page"/>, this should and does translate into <br style="page-break-after:always"> line 3439 (or -before?), and there are xhtmlrenderer posts which say that this is honoured. So I'm not sure where things are going wrong.

As I think I've said before, I don't like using DocX2HTML.xslt to create PDF. I didn't write that xslt - someone at Microsoft did - and I find it difficult to follow. That said, the font problems are mine ...

So, tomorrow morning I'll look at creating a second PDF output method using iText. I'll see where I get with a couple of hours; and it won't try to do anything smart with fonts. It may be enough for you or someone else to think worth expanding.

Back to the fonts ... There are two things to note up front:

1. PDF output is via HTML, so it is useful to look at the intermediate HTML output
(one way to do this is to open the document in docx4all, then export as HTML)

2. there is a font substitution mechanism which tries to use the closest font available on the local system, where closeness is measured by Panose. Sometimes this yields an imperfect result.
The font is supposed to be substituted at the HTML stage, and then embedded at the PDF stage.
WordprocessingMLPackage.pdf() does font embedding following https://xhtmlrenderer.dev.java.net/guid ... tml#xil_32

In my example document in Word, I used font Arial Unicode MS, calibri and times new roman

I explicitly applied the font to the text run, so the hard coded default at line 6299 of DocX2Html.xsl isn't used.

On both Windows and Linux, I saw only š ( š ) in PDF output.

Although all three letters appear in the intermediate HTML output, I think this only tells you that
the UTF-8 character made it into HTML unscathed.

In my case (Linux), Lucida Sans Typewriter replaced arialunicodems in the HTML, and was embedded in the PDF.
Similarly DejaVuSerif replaced timesnewroman.
No substitute was available for Calibri.

Do you know whether your letters are available in these fonts? If not, that would explain the behaviour.

Anyway, I've made a couple of minor fixes, so you could try latest SVN (on Windows), and see whether it helps

Re: docx to pdf

PostPosted: Thu Mar 19, 2009 9:33 am
by Mesni
I have to say that this is good work but i have some problems with the converter you committed.
Have you committed everything? Because the transformer you called requires inputStream but you gave him some source.
The problem is it can't be casted to inputStream.
Sorry for bothering you and thank you.

Lp, Mesni

Re: docx to pdf

PostPosted: Thu Mar 19, 2009 10:11 am
by Mesni
Oh and one more thing if this would help you. I checked out fop source code and fixed it so it displays Slovenian signs. It need's a little fixing but that is no problem. The main thing is to create you own font and add it to setting and then use it by the conversion to PDF.

LP, Mesni

Re: docx to pdf

PostPosted: Thu Mar 19, 2009 11:18 am
by jason
As I think I've said before, I don't like using DocX2HTML.xslt to create PDF. I didn't write that xslt - someone at Microsoft did - and I find it difficult to follow. That said, the font problems are mine ...

So, tomorrow morning I'll look at creating a second PDF output method using iText. I'll see where I get with a couple of hours; and it won't try to do anything smart with fonts. It may be enough for you or someone else to think worth expanding.


Today's changes are explained in the CreatePdf sample:

Code: Select all

public class CreatePdf {
      
       public static void main(String[] args)
               throws Exception {
         
         String inputfilepath = "/home/dev/workspace/docx4all/sample-docs/docx4all-CurrentDocxFeatures.docx";
         
         WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

         /* Choose which of the three methods you want to use...
          *
          * .. viaHTML uses docX2HTML.xslt and xhtmlrenderer, <--------old approach
          *    and supports numbering, images,
          *    and tables, but is pretty hard to understand
          *   
          * .. viaXSLFO uses docx2fo.xslt and FOP.  It is  <------------------- new option
          *    rudimentary right now, but should be
          *    easy enough to extend to include a basic
          *    feature set
          *   
          * .. viaItext - for developers who don't like xslt <------------------- new option
          *    at all! Or want to use iText's features..
          *    Displays images, but as at 2009 03 19.
          *    doesn't try to scale them.
          */
         org.docx4j.convert.out.pdf.PdfConversion c
//            = new org.docx4j.convert.out.pdf.viaHTML.Conversion(wordMLPackage);
//            = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
            = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);

         // Show the PDF
         c.view();                     
       }
      
      
   }


So now there are 3 ways to create a PDF, each with different capabilities.

The viaXSLFO conversion uses extension functions to access docx4j's model of the document to get paragraph and run properties etc. This makes sense (given that this is docx4j :) ), and I think is easier to understand and maintain than a pure xslt implementation.
(The HTML approach does this as well, but to a lesser extent ie mainly for images and numbering) The viaIText approach naturally
uses the docx4j document model (since in its current implementation it doesn't use XSLT at all).

For my own needs, PDF output is currently a nice to have, rather than core requirement. So it would be ideal if community members could step up and enhance their preferred approach (which I imagine is FOP or iText, as opposed to the old HTML based approach?).

Have you committed everything? Because the transformer you called requires inputStream but you gave him some source.
The problem is it can't be casted to inputStream.


Everything should be committed now.

I still need to have a look to see how fop-fonts jar interacts with a real fop jar.

Re: docx to pdf

PostPosted: Thu Mar 19, 2009 11:35 am
by Mesni
Yesterday i was playing with the fop source code and since then I have this problems:

19.03.2009 11:47:28 *WARN * TTFFile: Ascender and descender together are larger than the em box. This could lead to a wrong baseline placement in Apache FOP. (TTFFile.java, line 1264)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:237)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:219)
at org.apache.fop.fonts.truetype.FontFileReader.init(FontFileReader.java:45)
at org.apache.fop.fonts.truetype.FontFileReader.<init>(FontFileReader.java:74)
at org.apache.fop.fonts.truetype.TTFFontLoader.read(TTFFontLoader.java:88)
at org.apache.fop.fonts.truetype.TTFFontLoader.read(TTFFontLoader.java:75)
at org.apache.fop.fonts.FontLoader.getFont(FontLoader.java:171)
at org.apache.fop.fonts.FontLoader.loadFont(FontLoader.java:120)
at org.apache.fop.fonts.FontLoader.loadFont(FontLoader.java:98)
at org.apache.fop.fonts.autodetect.FontInfoFinder.find(FontInfoFinder.java:254)
at org.docx4j.fonts.Substituter.setupPhysicalFont(Substituter.java:256)
at org.docx4j.fonts.Substituter.setupPhysicalFonts(Substituter.java:214)
at org.docx4j.fonts.Substituter.<clinit>(Substituter.java:141)
at org.docx4j.convert.out.html.HtmlExporter.html(HtmlExporter.java:190)
at org.docx4j.convert.out.html.HtmlExporter.html(HtmlExporter.java:112)
at org.docx4j.convert.out.pdf.viaHTML.Conversion.output(Conversion.java:33)
at org.docx4j.convert.out.pdf.PdfConversion.view(PdfConversion.java:70)
at org.docx4j.samples.CreatePdf.main(CreatePdf.java:55)

Is it known to you? I don't know why this is doing because the thing is that i don't use the fop that i was reprograming. I downloaded a fresh jar the version 0.95 and still... I increased the java Heap space but still nothing.
This error happened once when i was reprograming the fop source with my own TTF that i wanted to use (ArialUNI).
After I increased the Heap space the fop worked č š ž all letters were there but the compiling the code from docx4j doesn't work... Are you able to help me?
Thank you very much

LP, Mesni

Re: docx to pdf

PostPosted: Thu Mar 19, 2009 12:21 pm
by jason
There could be issues if you are trying to use a fop.jar and fop-fonts-0.2.0.jar. However:

1. I wrote the org.docx4j.convert.out.pdf.viaXSLFO.Conversion stuff with both those present, with no ill effects

2. I wouldn't expect this to manifest as an OutOfMemoryError.

You mentioned a compile error .. what was that?

Given that you want to manage fonts yourself, you could disable/delete the org.docx4j.fonts classes, and remove fop-fonts.jar

Or it may be simpler to use viaXSLFO.

cheers

Jason

Re: docx to pdf

PostPosted: Fri Mar 20, 2009 5:51 am
by jason
Given that you want to manage fonts yourself, you could disable/delete the org.docx4j.fonts classes, and remove fop-fonts.jar


Just to note I've found myself refactoring the font stuff this morning.

I should have something later today which works better for you - I'll post again here when it is done.

cheers,

Jason

Re: docx to pdf

PostPosted: Fri Mar 20, 2009 9:09 am
by Mesni
Thank you very much. First i have to find out why i am getting the error about the heap space..


Lp, Mesni

Re: docx to pdf

PostPosted: Fri Mar 20, 2009 10:01 am
by Mesni
heh... Found the problem. Fop ist trying to load the font mingliu.ttc but the size of this file is 28 MB and it runs out of memory... that ain't good...

Re: docx to pdf

PostPosted: Fri Mar 20, 2009 2:13 pm
by jason
Ok, there is now in SVN a org.docx4j.fonts.SubstituterWindowsPlatformImpl class, which is intended to map fonts used in the document to Microsoft's actual font (provided it is in the Windows/fonts directory).

This class should only be used on Windows platform. If you are on another platform, you can use SubstituterImplPanose (the existing approach), or extend Substituter yourself (which may appeal if your documents have predictable fonts, and you know what is available on your system).

If you use SubstituterWindowsPlatformImpl in combination with pdf.viaHTML, I'd expect your glyphs to work now.

There are a few things still to tidy up:

- there isn't an easy way to set your desired Substituter right now; at present its buried in HtmlExporter

- I don't think bold or italic will work yet

I haven't made fonts work with pdf.viaIText or pdf.viaXSLFO (though I've made some notes on how to do the latter, if anyone is feeling brave).

You'll need this fop.jar (which is a complete build from fop's svn of today, with docx4j's minor extensions) - the stuff previously in fop-fonts . You should remove fop-fonts from your class path.

cheers

Jason

Re: docx to pdf

PostPosted: Mon Mar 23, 2009 10:08 am
by Mesni
Hello Jason,

Nice work with the fop, but there is still one problem... the application is trying to load to many fonts. It just goes into the Windows/fonts and starts to map them all. I tried to increase the Heap size and it just loads a lot more fonts until it runs out of heap size again. Did i miss something? If i did I'm sorry. Is there any configuration needed?

If you have an answer i would be very grateful. Until then i will debug your application a little more.

Thank you very much

Lp, Mesni

Re: docx to pdf

PostPosted: Mon Mar 23, 2009 10:55 am
by Mesni
Ok i have an solution:

I add something to your fonts. I removed the searching and using of TTC files because they are to big (18MB) and i removed the using of the ARIALUNI (24MB) so the loading can be done.

and now the ŠČĆŽĐ all of these work :D

Nice work mate. I like it :)

Lp, Mesni

Re: docx to pdf

PostPosted: Mon Mar 23, 2009 12:25 pm
by jason
Nice work with the fop, but there is still one problem... the application is trying to load to many fonts. It just goes into the Windows/fonts and starts to map them all. I tried to increase the Heap size and it just loads a lot more fonts until it runs out of heap size again. Did i miss something? If i did I'm sorry. Is there any configuration needed?


PhysicalFonts represents the fonts on the system which are known to docx4j.

Its discoverPhysicalFonts() method is what maps all the fonts in Windows/fonts.

But you don't have to call discoverPhysicalFonts(); you can just use addPhysicalFont(URL fontUrl) if there are a few specific fonts of interest to you.

However, IdentityPlusMapper does do PhysicalFonts.discoverPhysicalFonts(); if this is a problem, you could create your own mapper which doesn't do that. Or we could take that out of its static initialiser, and make it the responsibility of the programmer to set up the physical fonts before calling populateFontMappings.

Re: docx to pdf

PostPosted: Tue Apr 28, 2009 4:49 pm
by Leigh
Hi,

I was not sure whether to open a new thread or not ... But I just setup a local development environment and am receiving the same OutOfMemoryError when using the pdf.viaIText method. Do you have any suggestions or ideas for making this work with the iText method?

Re: docx to pdf

PostPosted: Wed Apr 29, 2009 5:53 am
by jason
Obviously you can give your runtime environment more memory. For example in Eclipse's run dialog, you can add VM argument:

Code: Select all
-Xmx512M -Xss1024K


Alternatively, as noted previously, you can extend org.docx4j.fonts.Mapper. Actually you need to do 3 things:

1. set up your mapper so docx4j knows that if fontX is used in the document, physical fontX2 should be used in the pdf. On Windows (assuming standard Office fonts are used in the document and available on your system), typically you just use the font specified in the document. This is what org.docx4j.fonts.IdentityPlusMapper does.

What you want is something similar to IdentityPlusMapper

2. In your mapper, add the fonts on your system you want to use to org.docx4j.fonts.PhysicalFonts

Where IdentityPlusMapper does:

Code: Select all
   static {
      
      try {
         
         PhysicalFonts.discoverPhysicalFonts();
         
      } catch (Exception exc) {
         throw new RuntimeException(exc);
      }
   }


instead you want some other strategy, for example manually add the fonts you are using. (Here docx4j is using concepts/objects from FOP). Another strategy would be to look at the size of the font file, and just ignore large ones.

3. Tell docx4j to use your font mapper. WordprocessingMLPackage has a method setFontMapper(Mapper fm) which does this.

Hope this helps .. Jason

Re: docx to pdf

PostPosted: Wed Apr 29, 2009 7:13 pm
by Leigh
Yes, that helps a lot. Thank you!

Re: docx to pdf

PostPosted: Tue May 19, 2009 9:40 pm
by td16
I am trying to use the iText conversion for the pdf. Here are the issues I am facing...

I get the following exception when I generate the wordMLPackage and then use the same package for conversion into PDF:

java.lang.NullPointerException
at org.docx4j.model.HeaderFooterPolicy.<init>(HeaderFooterPolicy.java:69)
at org.docx4j.openpackaging.packages.WordprocessingMLPackage.getHeaderFooterPolicy(WordprocessingMLPackage.java:117)
at org.docx4j.convert.out.pdf.viaIText.Conversion.<init>(Conversion.java:56)


In my second attempt I tried to write the the wordMLPackage to a docx file and reloaded the file back to convert to PDF. I get the exception below:

java.lang.NullPointerException
at org.docx4j.convert.out.pdf.viaIText.Conversion$EndPage.onEndPage(Conversion.java:453)
at com.lowagie.text.pdf.PdfDocument.newPage(Unknown Source)
at com.lowagie.text.pdf.PdfDocument.close(Unknown Source)
at com.lowagie.text.Document.close(Unknown Source)
at org.docx4j.convert.out.pdf.viaIText.Conversion.output(Conversion.java:152)

I have a footer in my document with some text and page numbering.

I guess without the page numbering in the footer, I get out of memory error for the fontmapper.

Re: docx to pdf

PostPosted: Wed May 20, 2009 7:00 am
by jason
I think the NPE at org.docx4j.model.HeaderFooterPolicy.<init>(HeaderFooterPolicy.java:69) was fixed in changeset 803 two weeks ago.

For the out of memory error, see the immediately preceding post in this thread.

Re: docx to pdf

PostPosted: Fri May 22, 2009 11:44 pm
by td16
how can i get this latest code? i tried to compile the latest code that I got through svn with eclipse and it breaks. really appreciate if i can get the latest jar file.

also is there a simple example of how to write a mapper with few basic fonts like times, arial etc?

thanks

Re: docx to pdf

PostPosted: Sat May 23, 2009 1:00 am
by jason
i tried to compile the latest code that I got through svn with eclipse and it breaks.


What error do you get?

Re: docx to pdf

PostPosted: Wed May 27, 2009 8:21 pm
by td16
So here the weird problem. I am able to build it successfully with maven but when I try to run a sample program (from the package), eclipse throws an error saying main class not found. I also tried exporting the project as a jar file but when I use the jar file in my program, I get several errors. My guess is that I am not compiling or exporting the jar properly. any help here would be appreciated.

Re: docx to pdf

PostPosted: Wed May 27, 2009 9:33 pm
by jason
Sounds like you don't have your project (or the particular run) configured correctly in Eclipse.

Please create a new discussion thread, since that's off topic from this pdf discussion.