Page 1 of 1

problem with document created by Google Docs

PostPosted: Fri Feb 21, 2014 9:47 pm
by sfmorais
Hi,

When I run this code:

Code: Select all
WordprocessingMLPackage wordprocessingMLPackage = WordprocessingMLPackage.load(new File("c:\\document.docx"));
System.out.println(XmlUtils.marshaltoString(wordprocessingMLPackage.getMainDocumentPart().getJaxbElement(), true, true));


throw an exception

Code: Select all
INFO [org.docx4j.utils.XPathFactoryUtil:22] - xpath implementation: org.apache.xpath.jaxp.XPathFactoryImpl
INFO [org.docx4j.openpackaging.io3.Load3:180] - package read;  elapsed time: 4635 ms
INFO [org.docx4j.openpackaging.parts.JaxbXmlPart:129] - Lazily unmarshalling /word/document.xml
INFO [org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware:299] - For org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart, unmarshall via binder
java.lang.NumberFormatException: For input string: "10206.0"
   at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
   at java.lang.Integer.parseInt(Integer.java:458)
   at java.math.BigInteger.<init>(BigInteger.java:316)
   at java.math.BigInteger.<init>(BigInteger.java:451)
   at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:88)


I view the document.xml and really somewhere there is

Code: Select all
... <w:tblW w:w="10206.0" w:type="dxa"/> ...


I can conclude that the cause of error is the 10206.0 value of <w:tblW w:w and the docx4j expect an integer instead of a bigdecimal

I tested with docx4j-3.0.1 bu problem is the same

Note: the document.docx was created in the Google Docs and downloaded as docx

Attached is the document.docx


Many Thanks

Regards
Sérgio

Re: problem with document created by Google Docs

PostPosted: Fri Feb 21, 2014 11:25 pm
by jason
A quick look suggests Google Docs should be writing an integer.

Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
        <xsd:simpleType name="ST_DecimalNumber">
                <xsd:annotation>
                        <xsd:documentation>Decimal Number Value</xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:integer"/>
        </xsd:simpleType>
 
Parsed in 0.001 seconds, using GeSHi 1.0.8.4


See http://webapp.docx4java.org/OnlineDemo/ ... blW_2.html for @w:

The possible values for this attribute are defined by the ST_DecimalNumber simple type


http://webapp.docx4java.org/OnlineDemo/ ... umber.html

This simple type specifies that its contents will contain a whole decimal number (positive or negative)


Perhaps you could raise it with Google at https://productforums.google.com/forum/ ... ories/docs

Re: problem with document created by Google Docs

PostPosted: Mon Feb 24, 2014 8:50 pm
by sfmorais
Thanks, Jason

If I understand, you said that directly the 'docx4j API' can do nothing to solve it. The problem is the conversion done by google docs that does not respect the xsd, isn´t it ?

As you suggest I sent a topic in google forum, but it is difficult (or can take long time) obtain a reply due the quantity of posts everyday.

How can i turn this around in your opinion ?

Thanks
Sérgio

Re: problem with document created by Google Docs

PostPosted: Mon Feb 24, 2014 9:16 pm
by jason
Does it later say "encountered unexpected content; pre-processing"?

It should .. If so, you could modify "org/docx4j/jaxb/mc-preprocessor.xslt" to replace the decimal with an integer.

If you are comfortable with XSLT, I'd be happy to accept an appropriate template rule as a contrib. If not, let me know...

Re: problem with document created by Google Docs

PostPosted: Mon Feb 24, 2014 10:13 pm
by sfmorais
Jason,

it does not say "encountered unexpected content; pre-processing".

The full stracktrace is

Code: Select all
...
INFO [org.docx4j.openpackaging.contenttype.ContentTypeManager:802] - Detected WordProcessingML package
INFO [org.docx4j.openpackaging.io3.Load3:161] - Instantiated package of type org.docx4j.openpackaging.packages.WordprocessingMLPackage
INFO [org.docx4j.utils.XPathFactoryUtil:22] - xpath implementation: org.apache.xpath.jaxp.XPathFactoryImpl
INFO [org.docx4j.openpackaging.io3.Load3:180] - package read;  elapsed time: 4504 ms
INFO [org.docx4j.openpackaging.parts.JaxbXmlPart:129] - Lazily unmarshalling /word/document.xml
INFO [org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware:299] - For org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart, unmarshall via binder
java.lang.NumberFormatException: For input string: "10206.0"
   at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
   at java.lang.Integer.parseInt(Integer.java:458)
   at java.math.BigInteger.<init>(BigInteger.java:316)
   at java.math.BigInteger.<init>(BigInteger.java:451)
   at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:88)
   at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:733)
   at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$22.parse(RuntimeBuiltinLeafInfoImpl.java:736)
   at com.sun.xml.bind.v2.runtime.reflect.TransducedAccessor$CompositeTransducedAccessorImpl.parse(TransducedAccessor.java:241)
   at com.sun.xml.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:202)
   at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:449)
   at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:427)
   at com.sun.xml.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:71)
   at com.sun.xml.bind.v2.runtime.unmarshaller.SAXConnector.startElement(SAXConnector.java:137)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:240)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:277)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:246)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:277)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:246)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:277)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:246)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:277)
   at com.sun.xml.bind.unmarshaller.DOMScanner.visit(DOMScanner.java:246)
   at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:123)
   at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:106)
   at com.sun.xml.bind.unmarshaller.DOMScanner.scan(DOMScanner.java:99)
   at com.sun.xml.bind.v2.runtime.BinderImpl.associativeUnmarshal(BinderImpl.java:156)
   at com.sun.xml.bind.v2.runtime.BinderImpl.unmarshal(BinderImpl.java:127)
   at org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal(JaxbXmlPartXPathAware.java:316)
   at org.docx4j.openpackaging.parts.JaxbXmlPart.getJaxbElement(JaxbXmlPart.java:130)
   at exp.siga.aaa.docx4j.testes.Main.main(Main.java:36)


So, I can conclude that modify the mc-preprocessor.xslt cannot solve my problem, it is not ?

Thanks

Re: problem with document created by Google Docs

PostPosted: Mon Feb 24, 2014 11:28 pm
by jason
Hi Sérgio,

Here is what happens in org.docx4j.openpackaging.parts.JaxbXmlPartXPathAware.unmarshal

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
                        try {
                                jaxbElement =  (E) XmlUtils.unwrap(binder.unmarshal( doc ));
                        } catch (UnmarshalException ue) {
                                        // try the mc-preprocessor.xslt stuff
 
Parsed in 0.029 seconds, using GeSHi 1.0.8.4


In this case the exception is NumberFormatException (as opposed to UnmarshalException) - interesting.

We can catch that and do the same sort of thing. It may be expedient to do it in mc-preprocessor.xslt; if we don't, then the exception handled there could still occur (we'd need to experiment to see what order they happen in).

Re: problem with document created by Google Docs

PostPosted: Tue Feb 25, 2014 2:11 am
by sfmorais
Thanks in advance Jason

I think that I must something like this:

- Create in my project a JaxbXmlPartXPathAware that overrides the jar and

Code: Select all
         try {
            jaxbElement =  (E) XmlUtils.unwrap(binder.unmarshal( doc ));
               // Unwrap, so we have eg CTEndnotes, not JAXBElement
         }
         catch (NumberFormatException numberFormatException) {
            
            DOMResult result = new DOMResult();
            Templates mcPreprocessorXslt = JaxbValidationEventHandler.getMcPreprocessor();
            
            XmlUtils.transform(doc, mcPreprocessorXslt, null, result);
            doc = (org.w3c.dom.Document) result.getNode();
            try {
               jaxbElement = (E) XmlUtils.unwrap(binder.unmarshal(doc));
            }
            catch (ClassCastException e) {
               Unmarshaller u = jc.createUnmarshaller();
               jaxbElement = (E) XmlUtils.unwrap(u.unmarshal( doc ));      
            }
         }
         catch (UnmarshalException ue) {
                           ... // no changes
                        }


Now, my difficult is understand the mc-preprocessor.xslt and adapt to my issue.
Am I the right way ?


My other way to around the problem is to get the content of 'document.xml' like string text (like XMLUtil.marshalToString() does, but without xsd validation). So, with regex pattern I could replace the values of w:w= to integer (trunc the decimal places). Is there any way to get the string content ?


Thanks

Re: problem with document created by Google Docs

PostPosted: Tue Feb 25, 2014 1:05 pm
by jason
Please see now https://github.com/plutext/docx4j/commi ... bd41be7bc3

I'll upload a nightly incorporating this later today.

I guess the question is: on what other elements/attributes does Google Docs make the same error? Please add to this thread if you find any others...

sfmorais wrote:My other way to around the problem is to get the content of 'document.xml' like string text


So far we've managed to avoid any optional string manipulation step prior to unmarshalling; that may need to change in the future..

Re: problem with document created by Google Docs

PostPosted: Tue Feb 25, 2014 8:18 pm
by jason

Re: problem with document created by Google Docs

PostPosted: Tue Feb 25, 2014 9:39 pm
by sfmorais
Many many thanks jason in advance

In a best analysis in the 'document.xml', I found (for my simple docx - in the first post) other tags with w:w= attribute of other different nodes like: <w:pgSz, <w:right, <w:bottom, <w:left, <w:top, <w:gridCol. For other docx more complex can exists more (I tell you when I find).

But, some w:w= appear in integer like <w:pgSz. (I don´t know what is the criteria of the Google to put some w:w= in integer other with decimal places).

The best is prevent the two possible values for all w:w (integer and decimal) and round or trunc if decimal

Your last jar solve me the first occurence of w:w (<w:tblW w:w="10206.0") but now appear the same exception for a "100.0"

Attached is the document.xml

Many Thanks

Regards,
Sergio

Re: problem with document created by Google Docs

PostPosted: Wed Feb 26, 2014 11:20 am
by jason
OK please see now https://github.com/plutext/docx4j/commi ... e00d5ca257

By the way, could you please add a link to your post in the Google Docs forum, to make it easier to monitor for any replies?

Re: problem with document created by Google Docs

PostPosted: Thu Feb 27, 2014 1:34 am
by sfmorais
Hi Jason

I did some tests and the problem is solved.
Really your solution in xstl is the best to control all w:w atributes for any node

The related link in Google Docs forum is
https://productforums.google.com/forum/#!msg/docs/fcFovKihMtw/Ii7_mf1bO9wJ

Many thanks for your solution and nice.

Regards,
Sérgio