Page 1 of 1

MainDocumentPart getXML() is coming as null

PostPosted: Thu Apr 02, 2020 3:43 am
by arulface
Hi,

documentPart.getXML() is coming as null for some of the docx files however when I open the file in word 2010 and save the same file I am getting values from the docx file.
Enclosed the docx file. Reqesting your kind advise how to handle the scenario when documentPart.getXML() gives null value.

Below is the code:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new FileInputStream(fileNameWithPath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
documentPart.getXML()

Attachment:
CORNEA_CORNEA-D-17-01113_tud_ACE.docx
Not able to get the getXML() data for this file
(21.68 KiB) Downloaded 228 times


Appreciate your timely help to resolve this issue.

Regards,
Arul

Re: MainDocumentPart getXML() is coming as null

PostPosted: Thu Apr 02, 2020 3:17 pm
by jason
You may have noticed a stack trace?

Code: Select all
For input string: "id_0001"
java.lang.NumberFormatException: For input string: "id_0001"
   at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)
   at java.base/java.lang.Integer.parseInt(Integer.java:658)
   at java.base/java.math.BigInteger.<init>(BigInteger.java:535)
   at java.base/java.math.BigInteger.<init>(BigInteger.java:673)
   at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:61)
   at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:766)
   :
   at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:356)


Looking at your main document part, it contains:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:bookmarkStart xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:tnq="http://www.tnq.co.in/ace/" w:id="id_0001" w:name="ACEDirectChange_0001"/>

<w:bookmarkEnd xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:tnq="http://www.tnq.co.in/ace/" w:id="id_0001"/>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


But in CTMarkup, we have BigInteger id, so you'll need to make your w:id a BigInteger.

Re: MainDocumentPart getXML() is coming as null

PostPosted: Fri Apr 03, 2020 1:27 am
by arulface
Thanks Jason! sorry that file was corrupted.

Enclosed two files for your pursue.
File 1: JCCT_01113_tud_ACE.docx - which is giving getXML() null.
File2: JCCT_01113_OpenedIn_2010WordAnd Saved.docx - Opened the above file(File 1) in word 2010 and just added one space at end of the file and saved. This file is giving getXML() value from documentPart.

Requesting your kind help.

Once again sorry for the incorrect file placed last time.

JCCT_01113_tud_ACE.docx
File 1: Not getting values from document i.e getXML() is null
(23.77 KiB) Downloaded 178 times

JCCT_01113_tud_ACE.docx
File 1: Not getting values from document i.e getXML() is null
(23.77 KiB) Downloaded 178 times

Re: MainDocumentPart getXML() is coming as null

PostPosted: Fri Apr 03, 2020 9:52 am
by jason
You should turn logging on:

Code: Select all
09:27:41.637 [main] ERROR o.d.openpackaging.parts.JaxbXmlPart 220 - Problem with part /word/document.xml
org.docx4j.openpackaging.exceptions.Docx4JException: Problem with part /word/document.xml
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:201)
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getJaxbElement(JaxbXmlPart.java:218)
   at org.docx4j.samples.PartsList.printInfo(PartsList.java:125)
   at org.docx4j.samples.PartsList.traverseRelationships(PartsList.java:212)
   at org.docx4j.samples.PartsList.handlePkg(PartsList.java:86)
   at org.docx4j.samples.PartsList.main(PartsList.java:69)
Caused by: javax.xml.bind.JAXBException: Zero length BigInteger
   at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:646)
   at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:353)
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getContents(JaxbXmlPart.java:198)
   ... 5 common frames omitted
Caused by: java.lang.NumberFormatException: Zero length BigInteger
   at java.base/java.math.BigInteger.<init>(BigInteger.java:485)
   at java.base/java.math.BigInteger.<init>(BigInteger.java:673)
   at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:61)
   at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:766)


So where is this Zero length BigInteger?

If you compare your broken docx to the working one in the OpenXML Productivity Tool, you'll see:

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
<w:ins w:id="" w:author="" track_off="">
Parsed in 0.000 seconds, using GeSHi 1.0.8.4


I'm guessing you need to populate that w:id.

Re: MainDocumentPart getXML() is coming as null

PostPosted: Mon Apr 06, 2020 1:21 am
by arulface
Thanks Jason! Understood that w:id should be BigInt and w:id shouldn't be empty. However when we open the same document in word 2010, it opens without any issue. It means the document was not corrupted.

I couldn't get why docx4j is not able to process since word 2010 is able to access/open the file, I have few queries and requesting your clarification.
Is there any way we have to find those missing attributes and set values for it and then process that document using docx4j?
or
Is there any other way to process such document and read necessary values only by skipping w:id?

Note: Currently, I open the document in word 2010 and save

Re: MainDocumentPart getXML() is coming as null

PostPosted: Mon Apr 06, 2020 11:57 am
by jason
docx4j is based on the OpenXML xsd schema, so generally speaking, it expects input conformant with that.

An exception is https://github.com/plutext/docx4j/blob/ ... essor.xslt

That's a mechanism for massaging the input before jaxb unmarshalling (actually, it is used to try unmarshalling again after a failure). You can override this with your own xslt if you want: https://github.com/plutext/docx4j/blob/ ... rties#L152

Word on the other hand, is lax in some respects, and very brittle in others. As you've noticed, it handles some discrepancies silently and without user feedback. Other issues it will correct or attempt to correct. Some problems cause it to fail to open a document, without much info to go on.

So in summary, just because "it opens in Word" doesn't mean the document is "good". The same can be said for docx4j. But if it opens in docx4j and in Word, you can be more confident the document is good :-)

Ideally, you should fix problems like the one we've identified in your input documents. If you can't do that, the docx4j.jaxb.JaxbValidationEventHandler mechanism discussed above might help.