Page 1 of 1

Content Loss while converting Docx to Html :

PostPosted: Thu Apr 15, 2010 9:32 pm
by AnbuChezhian
Hi Jason ,
I attached some documents that docx4j fails to convert and in some documents content loss is also there . Later i found that docx4j fails to handle ,if we apply any styles to P (We can apply styles to P ,by clicking the styles from style tab present at top right corner of MSWord2007 ).The equivalent xml for above style is "< w:pStyle w:val="NoSpacing >" . Like this, lots of p styles is not handled properly .

Can we able to solve this problem ?

Regards,
Anbu Chezhian.S

Re: Content Loss while converting Docx to Html :

PostPosted: Sat Apr 17, 2010 12:51 am
by jason
Hello Anbu Chezhian.S

I've had a look at the documents you supplied, thanks.

I've made some quick fixes in svn.

Several issues remain:

- underline (should be easy to fix)
- wmf/emf images (see recent posts)
- field code handling
- improvements to hanging indentation, numbering
- a JAXB error

Some of these I may look at soon, others not. So you can either try fixing them yourself, or avail yourself of Plutext professional services.

Out of interest, how many documents were there in total in the corpus these came from?

And did you resolve your MathML issue? If so, please consider posting/contributing your solution. Thanks!

Details:-

coming_content_loss.docx
------------------------

style missing basedOn - FIXED


content_los1.docx - ignore w:smartTagPr, FIXED
-----------------


content_missing.docx
--------------------

Image missing (tc containing wp:inline)
Caused by: java.lang.ClassCastException:
org.docx4j.openpackaging.parts.WordprocessingML.MetafileEmfPart
cannot be cast to org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage
at org.docx4j.model.images.WordXmlPictureE20.handleImageRel(WordXmlPictureE20.java:416)
at org.docx4j.model.images.WordXmlPictureE20.createWordXmlPictureFromE20(WordXmlPictureE20.java:239)
at org.docx4j.model.images.WordXmlPictureE20.createHtmlImgE20(WordXmlPictureE20.java:293)

Table - no cell borders

erroe_UC-006_Ppmts_Manage Encumbrances.docx
-------------------------------------------

Handled java.lang.NullPointerException
at org.docx4j.model.listnumbering.Emulator.getNumber(Emulator.java:179)

Image missing (presume emf/wmf). Is that the main problem?
This doc is 18 pages - I only looked at it briefly.

error.docx
----------

Seems ok, maybe because of fixes above?

error_Use Case.docx
-------------------

Seems ok

error_vita_dfp2.docx
--------------------

NOT IMPLEMENTED: support for fldChar
NOT IMPLEMENTED: support for instrText

I think one of the other community members is working on support for field codes.

and question marks generated somewhere from empty paragraphs which look like:

<w:p>
<w:pPr>
<w:pStyle w:val="Achievement"/>
<w:numPr>
<w:ilvl w:val="0"/>
<w:numId w:val="0"/>
</w:numPr>
<w:ind w:left="245"/>
<w:jc w:val="left"/>
</w:pPr>
</w:p>

exception_not_converted.docx
----------------------------

JAXB error

16.04.2010 22:36:21 *INFO * Part: Constructing /word/document.xml (Part.java, line 132)
java.lang.NumberFormatException: For input string: "62259f"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.math.BigInteger.<init>(Unknown Source)
at java.math.BigInteger.<init>(Unknown Source)
at com.sun.xml.bind.DatatypeConverterImpl._parseInteger(DatatypeConverterImpl.java:72)
at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$21.parse(RuntimeBuiltinLeafInfoImpl.java:674)

Needs to be looked at.

fine_listing_not_coming.docx
----------------------------

Small caps come out as normal, that's all.

halfdone_exception1.docx
-------------------------

Underline missing
Bullet not aligned with others
Bulleted hanging indents aren't hanging


notopening.docx
----------------

Added <xsl:template match="w:tblPrEx"/> -- seems ok


underline_missing.docx
----------------------

Underline is indeed missing (should be easily fixed), as is an image (presumed to be emf/wmf).