Page 1 of 1

Problem converting docx to pdf (missing parts)

PostPosted: Thu Dec 15, 2016 7:17 pm
by Lostman
Hi guys,

I am using docx4j to generate a report, I can generate it in .docx but I have problems when converting it to pdf.

Let me explain what I do, first I have a template and I have some keywords to replace, some are text, but some others are images and even tables

To search for keywords I do this:

Code: Select all
template = WordprocessingMLPackage.load(stream);
   
    HashMap<String, String> values = setValuesMap(route);
   
    replacePlaceholders(template, values);
   
    List<Object> texts = template.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
   
   for (Object obj : texts) {

      List<?> objContent = getAllElementFromObject(obj, Text.class);
     
      for (Object obj1 : objContent){
       
        Text text = (Text)  obj1;

        String textValue = text.getValue();

        if (textValue.contains("KEYWORD_IMAGE")){
        // Clear node content
          List<Object>  content=((R) obj).getContent();
          content.clear();
       
        // Add image replacement
       
        // Get a string with base 64 value from an image
        String image = getBase64Image();
           
          String imageDataBytes = image.substring(image.indexOf(",")+1);
                   
          byte[] decoded = org.apache.commons.codec.binary.Base64.decodeBase64(imageDataBytes.getBytes());
           
          R r = newImageR(template, decoded, null, null, 0, 1, 2000);
           
          content.add(r);
        }
      
      if (textValue.contains("KEYWORD_TABLE")){
          List<Object>  content=((R) obj).getContent();
          content.clear();
          content.addAll(createTableFromData(data.getTableData()));
        }
      }
   }



The function to create image element is this

Code: Select all
  public static R newImageR( WordprocessingMLPackage wordMLPackage,
      byte[] bytes,
      String filenameHint, String altText,
      int id1, int id2, long cx) throws Exception {
 
    BinaryPartAbstractImage imagePart = BinaryPartAbstractImage.createImagePart(wordMLPackage, bytes);
   
    Inline inline = imagePart.createImageInline( filenameHint, altText,
            id1, id2, cx, false);
   
    // Now add the inline in w:p/w:r/w:drawing
    org.docx4j.wml.ObjectFactory factory = Context.getWmlObjectFactory();
    org.docx4j.wml.R  run = factory.createR();
    org.docx4j.wml.Drawing drawing = factory.createDrawing();       
    run.getContent().add(drawing);     
    drawing.getAnchorOrInline().add(inline);
   
    return run;
   
  }


And functions to create table is

Code: Select all
   private List<Object> createTableFromData(Set<Data> tableData) throws Exception{
      List<Object> elements = new ArrayList<Object>();
   
     Tbl tbl = createTable();
    
     for (Data row : tableData){
        // Create new row
        Tr tr = createTrData(row);
        tbl.getContent().add(tr);
      }
    
     elements.add(tbl);
    
     return elements;
   }
   
  private Tbl createTable() throws JAXBException{
    String text =
      "<w:tbl xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
        "<w:tblPr>" +
            "<w:tblW w:type=\"dxa\" w:w=\"8930\"/>" +
            "<w:tblInd w:type=\"dxa\" w:w=\"344\"/>" +
            "<w:tblBorders>" +
                "<w:top w:color=\"BFBFBF\" w:space=\"0\" w:sz=\"4\" w:themeColor=\"background1\" w:themeShade=\"BF\" w:val=\"single\"/>" +
                "<w:bottom w:color=\"BFBFBF\" w:space=\"0\" w:sz=\"4\" w:themeColor=\"background1\" w:themeShade=\"BF\" w:val=\"single\"/>" +
                "<w:insideH w:color=\"BFBFBF\" w:space=\"0\" w:sz=\"4\" w:themeColor=\"background1\" w:themeShade=\"BF\" w:val=\"single\"/>" +
                "<w:insideV w:color=\"BFBFBF\" w:space=\"0\" w:sz=\"4\" w:themeColor=\"background1\" w:themeShade=\"BF\" w:val=\"single\"/>" +
            "</w:tblBorders>" +
            "<w:shd w:color=\"auto\" w:fill=\"FFFFFF\" w:themeFill=\"background1\" w:val=\"clear\"/>" +
            "<w:tblCellMar>" +
                "<w:top w:type=\"dxa\" w:w=\"15\"/>" +
                "<w:left w:type=\"dxa\" w:w=\"15\"/>" +
                "<w:bottom w:type=\"dxa\" w:w=\"15\"/>" +
                "<w:right w:type=\"dxa\" w:w=\"15\"/>" +
            "</w:tblCellMar>" +
            "<w:tblLook w:firstColumn=\"1\" w:firstRow=\"1\" w:lastColumn=\"0\" w:lastRow=\"0\" w:noHBand=\"0\" w:noVBand=\"1\" w:val=\"04A0\"/>" +
        "</w:tblPr>" +
        "<w:tblGrid>" +
            "<w:gridCol w:w=\"3130\"/>" +
            "<w:gridCol w:w=\"5800\"/>" +
        "</w:tblGrid>" +
      "</w:tbl>";

    Tbl tbl = (Tbl)XmlUtils.unmarshalString(text);
   
    return tbl;
  }
 
  private Tr createTrData(Data row) throws Exception{
    ObjectFactory wmlObjectFactory = new ObjectFactory();
    Tr tr = createTableRow(156);
   
    Tc tc1 = createTableCell(2126);
   
    JAXBElement<org.docx4j.wml.Tc> tcWrapped1 = wmlObjectFactory.createTrTc(tc1);
    tr.getContent().add(tcWrapped1);
               
     
    if (data.getTypesOfEquipment() != null && !data.getTypesOfEquipment().isEmpty()){
      tc1.getContent().add(createParagrahBulletListHeader("Types of equipment"));
       
      for (EquipmentType et : data.getTypesOfEquipment()){
        tc1.getContent().add(createBulletedParagraph(et.getName()));
      }
    }
     
    Tc tc2 = createTableCell(6804);
    JAXBElement<org.docx4j.wml.Tc> tcWrapped2 = wmlObjectFactory.createTrTc(tc2);
    tr.getContent().add( tcWrapped2);
   
    tc2.getContent().addAll(convertStringToParagraphs(data.getComments())); 
   
    return tr;
  }


I convert to pdf with this

Code: Select all
    String inputfilepath = "C:\\Users\\user\\Downloads\\SPAIN_GENERATE_TEST_2.docx";
    WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
    OutputStream os = new java.io.FileOutputStream(inputfilepath.substring(0,inputfilepath.length()-5) + ".pdf");
   
    FOSettings foSettings = Docx4J.createFOSettings();
    foSettings.setFoDumpFile(new java.io.File(inputfilepath + ".fo"));
//    foSettings.setWmlPackage(template);
    foSettings.setWmlPackage(wordMLPackage);
   
    Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);



I havent added all the code but I think is enough to show what I do, with this I generate a .docx correctly, but when I try to convert it to pdf images and tables didn´t show, after some trys I find that if I open generated docx and save it as a new docx (without doing anything else to the file) and use this duplicate file to generate the pdf, now images and tables did show.
Why is this happening? What am I doing wrong?

I attach four files:

- SPAIN_GENERATE_TEST.docx (generated file from code)
- SPAIN_GENERATE_TEST.pdf (pdf generated from previous file)
- SPAIN_GENERATE_TEST_2.docx (duplicate created using Word 2010)
- SPAIN_GENERATE_TEST_2.pdf (pdf generated from duplicated file)

I unzip both .docx and I see some differents between generated docx and I see some differences, first content of [Content_Types].xml differs, generated file include references to image file that duplicate file doesn´t, also image name is changed. I think that these changes in code have some relation with the fact that duplicated file can be converted to pdf with all his parts, but I don´t know if there is a way to reproduce these behaviour using docx4j.

Can anyone help me?

Thanks a lot for your help

And thanks a lot for your work.

Re: Problem converting docx to pdf (missing parts)

PostPosted: Sat Dec 17, 2016 7:40 pm
by jason
You have your drawing in a run in a run, which is wrong:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
            <w:r>
              <w:r>
                <w:drawing>
 
Parsed in 0.000 seconds, using GeSHi 1.0.8.4


Word evidently fixes that.

For a table cell, you have:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
       <w:tc>
          <w:tcPr>
            <w:tcW w:w="6804" w:type="dxa"/>
          </w:tcPr>
          <w:p>
            <w:pPr>
              <w:rPr>
                <w:rFonts w:asciiTheme="minorHAnsi" w:hAnsiTheme="minorHAnsi"/>
                <w:color w:val="595959" w:themeColor="text1" w:themeTint="A6"/>
                <w:sz w:val="20"/>
              </w:rPr>
            </w:pPr>
            <w:r>
              <w:rPr>
                <w:rFonts w:asciiTheme="minorHAnsi" w:hAnsiTheme="minorHAnsi"/>
                <w:color w:val="595959" w:themeColor="text1" w:themeTint="A6"/>
                <w:sz w:val="20"/>
              </w:rPr>
              <w:p>
                <w:pPr>
                  <w:numPr>
                    <w:numId w:val="1"/>
                  </w:numPr>
                </w:pPr>
                <w:r>
                  <w:r>
                    <w:rPr>
                      <w:rFonts w:asciiTheme="minorHAnsi" w:hAnsiTheme="minorHAnsi"/>
                      <w:color w:val="595959" w:themeColor="text1" w:themeTint="A6"/>
                      <w:sz w:val="20"/>
                    </w:rPr>
                    <w:t>ETSI</w:t>
                  </w:r>
                </w:r>
              </w:p>
            </w:r>
          </w:p>
        </w:tc>
 
Parsed in 0.003 seconds, using GeSHi 1.0.8.4


ie a w:p inside a w:r. That's wrong as well. You've done that in a number of places.

I stopped looking at line 1422.

If you correct the above and continue to have problems, please feel free to post again.

Re: Problem converting docx to pdf (missing parts)

PostPosted: Sat Dec 17, 2016 7:46 pm
by jason
https://github.com/plutext/docx4j/commi ... 9b601f44e5 guards against this.

It will be in the next nightly build and v3.3.2 (when released).

Re: Problem converting docx to pdf (missing parts)

PostPosted: Wed Dec 21, 2016 1:44 am
by Lostman
Hi Jason,

Your help has been extremely useful, thanks, I solved all my problems of missing parts. But now I am facing a new problem, I want to generate a pdf that prevents copy and edit.

At this point I travel throught the examples and find how to protect a generated word from editing (I implemented it as a test) but not from copy (you can select all content and copy it in a new file). And when i convert it to pdf I can do the same thing, select all the document and copy to an empty one.

So far I didnt find any clue about this. Do you know how to prevent copy from a generated PDF?

Really, thanks for your help.

Re: Problem converting docx to pdf (missing parts)

PostPosted: Sat Dec 24, 2016 5:50 pm
by jason