Page 1 of 1

Unreadable content error after generating word from template

PostPosted: Mon May 17, 2021 5:17 pm
by Shreya1234
Hi

I am trying to generate word from template with docx4j, but I am getting error/warning when open word document as '' Word found unreadable content in .docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes", after replacing content through docx4j.

Error is only coming in particular case when converting html content from internet that contain some image. Below is the html content that I have used:

Code: Select all
<p style="margin-top:0.5em;margin-bottom:0.5em;color:#202122;font-family:sans-serif;font-size:14px;background-color:#ffffff;">ter the border was defined so to make the northern portion of the territory concerned part of the French mandated territory that became Lebanon, many Zionist geographers &mdash; and Israeli geographers in the state's early years &mdash; continued to speak of "The Upper Galilee" as being "the northern sub-area of the&nbsp;<a href="https://en.wikipedia.org/wiki/Galilee" title="Galilee" style="text-decoration-line:none;color:#0645ad;background:none;">Galilee</a>&nbsp;region of&nbsp;<a href="https://en.wikipedia.org/wiki/Israel" title="Israel" style="text-decoration-line:none;color:#0645ad;background:none;">Israel</a>&nbsp;and&nbsp;<a href="https://en.wikipedia.org/wiki/Lebanon" title="Lebanon" style="text-decoration-line:none;color:#0645ad;background:none;">Lebanon</a>".</p><p style="margin-top:0.5em;margin-bottom:0.5em;color:#202122;font-family:sans-serif;font-size:14px;background-color:#ffffff;"><img src="" /></p><p style="margin-top:0.5em;margin-bottom:0.5em;color:#202122;font-family:sans-serif;font-size:14px;background-color:#ffffff;">Under this definition, "The Upper Galilee" covers an area spreading over 1,500&nbsp;km&sup2;, about 700 in Israel and the rest in Lebanon. This included the highland region of&nbsp;<a href="https://en.wikipedia.org/wiki/Belad_Bechara" title="Belad Bechara" style="text-decoration-line:none;color:#0645ad;background:none;">Belad Bechara</a>&nbsp;in&nbsp;<a href="https://en.wikipedia.org/wiki/Jabal_Amel" title="Jabal Amel" style="text-decoration-line:none;color:#0645ad;background:none;">Jabal Amel</a>&nbsp;located in&nbsp;<a href="https://en.wikipedia.org/wiki/South_Lebanon" class="mw-redirect" title="South Lebanon" style="text-decoration-line:none;color:#0645ad;background:none;">South Lebanon</a>,<sup id="cite_ref-4" class="reference" style="line-height:1;unicode-bidi:isolate;white-space:nowrap;font-size:11.2px;"><a href="https://en.wikipedia.org/wiki/Upper_Galilee#cite_note-4" style="text-decoration-line:none;color:#0645ad;background:none;">[4]</a></sup>&nbsp;</p>

I have used Java 8 and below library of docx4j:

Code: Select all
<dependency>
            <groupId>org.docx4j</groupId>
            <artifactId>docx4j-JAXB-ReferenceImpl</artifactId>
            <version>8.2.9</version>
        </dependency>
        <dependency>
            <groupId>org.docx4j</groupId>
            <artifactId>docx4j-ImportXHTML</artifactId>
            <version>8.2.1</version>
        </dependency>


I have used below code to replace html content in word:

Code: Select all
public static void replaceCustomContent(WordprocessingMLPackage wordMLPackage, MainDocumentPart documentPart,
         String customContentFieldName, String replacedValue)  {

      
      List<Object> textElements = getAllElementFromObject(documentPart, Text.class);

      for (Object textElement : textElements) {
         Text text = (Text) textElement;
         if (text.getValue().contains(customContentFieldName)) {
            try {
               R run = (R) (text.getParent());
               P p = (P) (run.getParent());
               Tc tc = (Tc) p.getParent();
               int cellIndex = tc.getContent().indexOf(p);
               if (cellIndex != -1) {
                  tc.getContent().remove(cellIndex);
                  XHTMLImporter xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
                  replacedValue = "<html><head></head><body>" + replacedValue + "</body></html>";
                  final Document document = Jsoup.parse(replacedValue);
                  document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
                  document.outputSettings().escapeMode(EscapeMode.xhtml);
                  replacedValue = document.html();
                  List<Object> objects = xHTMLImporter.convert(replacedValue, null);
                  for (Object object : objects) {
                     tc.getContent().add(cellIndex, object);
                     cellIndex++;
                  }
               }
            } catch (Docx4JException e) {
               log.error("Docx4j exception while converting template");
               throw new ApiRuntimeException(TemplateServiceException.DOCX4J_TEMPLATE_CONERSION_EXCEPTION,
                     new Object[] {}, HttpStatus.INTERNAL_SERVER_ERROR.value(), e);
            }
            break;
         }
      }
   }

Re: Unreadable content error after generating word from temp

PostPosted: Tue May 18, 2021 7:06 am
by jason
attach the resulting docx?

Re: Unreadable content error after generating word from temp

PostPosted: Tue May 18, 2021 7:18 pm
by Shreya1234
I have attached the generated document

Re: Unreadable content error after generating word from temp

PostPosted: Fri May 21, 2021 7:15 pm
by Shreya1234
I have analyzed it more and found out that issue arise when there is hyperlink in the html content. When I removed the hyperlink, docx is working fine without any error/warning

One more thing, I have merged two different docx here, page 1 and 2 is one document and page 3 and 4 is another document, so while merging, all the hyperlink relationship becomes header and footer references which is strange to me, whereas if I generated the second document only that is without merging, then it is working fine.

Re: Unreadable content error after generating word from temp

PostPosted: Wed May 26, 2021 7:46 pm
by jason
So are you able to produce a minimal bit of html containing a hyperlink which when converted to docx exhibits the issue? This will make the issue easy to identify and fix.

When you "merge" your 2 docx files, how are you doing that? If you are doing this with your own code, you'll need to manage the relIds.