Page 1 of 1

Can docx4j convert html into docx?

PostPosted: Wed Mar 31, 2010 10:05 pm
by debarcar
Hi Jason,

I wonder if docx4j can convert html to docx.

Thanks and Best Regards!

Re: Can docx4j convert html into docx?

PostPosted: Thu Apr 01, 2010 8:50 am
by jason
Do you mean any old HTML, or HTML produced by docx4j?

I have code which does a good job with HTML produced by docx4j, but its not part of docx4j (I'm pondering what to do with it). I haven't tested it on any old HTML, since that's not my use case.

Re: Can docx4j convert html into docx?

PostPosted: Fri Apr 30, 2010 2:57 am
by bernardo
jason wrote:I have code which does a good job with HTML produced by docx4j, but its not part of docx4j (I'm pondering what to do with it). I haven't tested it on any old HTML, since that's not my use case.


Hi Jason,

Could you put this html to docx code in the docx4j project? Maybe in a HTMLImporter class or something like that. I'm working on a collaborative platform with some document edition features and I use docx4j (by the way thanks for creating it :D) to import documents by converting from docx to html, which I then parse into the platform. The platform also parses it's documents into html in order to export them, so if you could include this code in docx4j I would be able to export docx documents!
It would be much appreciated :D

Thanks in advance,
Bernardo

Re: Can docx4j convert html into docx?

PostPosted: Tue May 04, 2010 9:26 pm
by jason
Hi Bernardo

I'm not ready to include the code in docx4j yet, but I may be able to share the code with you. I'll send you an email (at the address you registered for this forum with).

cheers .. Jason

Re: Can docx4j convert html into docx?

PostPosted: Thu Oct 07, 2010 3:49 pm
by sharu1484
Hello Jason,
I need to insert an HTML table into word. Will this code[one you sent to bernardo] help me doing that? If so, please send it to me as well.

Regards,
Sharad

Re: Can docx4j convert html into docx?

PostPosted: Fri Oct 08, 2010 7:24 am
by jason
I'll look at adding the code to docx4j sometime next week.

Re: Can docx4j convert html into docx?

PostPosted: Tue May 03, 2011 7:54 am
by Mack143
Hi Jason,

I have created HTML from .docx using docx4j. I would like to do the reverse(html->.docx) for my application to work.
Can you please share your thought on this. I dont see any class in docx4j API that can perform this operation.

Chees' :)
Mack.

Re: Can docx4j convert html into docx?

PostPosted: Wed May 04, 2011 12:00 am
by jason
I have converted docx4j's HTML (as opposed to any old bit of HTML) back to docx.

As a result there is a little stuff in docx4j to help you: classes such as org.docx4j.model.properties.run.Bold have constructors which take a CSSValue.

Other bits aren't there (eg the code which uses that, the code for converting an HTML table, and the code to import an image).

Attached is code to convert a table and an html2wordml.xslt you might want to adapt.

Those should help you to get started, until this stuff is formally incorporated. You are welcome to help with that if you'd like to :-)

cheers .. Jason

Re: Can docx4j convert html into docx?

PostPosted: Wed May 04, 2011 1:11 am
by Mack143
Hi Jason,

Nice to see a quick reply from you. I will try what you suggested. I will be thankful if you can find your old working code for this conversion.

Thank you,
Mack

Re: Can docx4j convert html into docx?

PostPosted: Wed May 11, 2011 12:33 am
by Mack143
Hi Jason,

The fact is I am new to XSL and XML. I may be able to contribute to forum after some experience in this filed. I have a basic question, Is it possible to support images in the html to docx conversion??(I dont see any in the xsl you attached in reply) Once again Thank you for your help and for sharing information.

Thank you,
Mack

Re: Can docx4j convert html into docx?

PostPosted: Wed May 11, 2011 7:49 am
by Mack143
Hi Jason,

I was trying to work with the pieces of code you provided in the above reply. I am unable to find the jar that has "org.alfresco.repo.DocxWikiJavascript".
Also the "html2wordmlxslt" has support for only table, but not for images. Can you please help me out with this.

Thank you,
Mack

Re: Can docx4j convert html into docx?

PostPosted: Wed May 11, 2011 10:18 pm
by jason
Here are the XSLT extension functions from that class:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting

       
    /* ---------------Xalan XSLT Extension Functions ---------------- */    
   
    public static String getFirstStyleFromClass(String classValue) {
        int i = classValue.indexOf(" ");
        if (i>0) {
                return classValue.substring(0, i );
        } else {
                return classValue;
        }
    }    

    /**
     * Intent is to be able to detect a normal paragraph;
     * A Normal style is implied by Word, never explicit.
     *  
     * @param classValue
     * @return
     */

    public static boolean isNormalParagraph(String classValue) {
        if (classValue == null || classValue.trim().equals("") ) {
                return true;
        } else if (classValue.startsWith("Normal") ) {
                return true;
        } else {
                return false;
        }
    }    

    /**
     * Word has @xml:space="preserve" iff the string starts or
     * ends with a space.
     *
     * @param theText
     * @return
     */

    public static boolean needPreserveSpace(String theText) {
        if (theText == null ) {
                return false;
        } else if (theText.startsWith(" ")
                        || theText.endsWith(" ")) {
                return true;
        } else {
                return false;
        }
    }    
   
   
    public static DocumentFragment createPPr(String classVal, String styleVal ) {
       
        // create a pPr object
                PPr pPr = Context.getWmlObjectFactory().createPPr();                   
       
        // if @class!=null
                if (classVal!=null && !classVal.equals("") ) {
                       
                        if (isNormalParagraph(classVal)) {
                                // Don't attach Normal, its implied
                               
                        } else {                       
                        // create <w:rStyle>
                                PStyle pStyle = Context.getWmlObjectFactory().createPPrBasePStyle();
                       
                                // set @w:val from getFirstStyleFromClass($class)
                                pStyle.setVal( getFirstStyleFromClass(classVal) );
                               
                        // add this to rPr     
                                pPr.setPStyle(pStyle);
                        }
                }
       
        // other pPr
                if (styleVal!=null && !styleVal.equals("") ) {
                        setPPr(pPr, styleVal);
                }
               
        // convert our rPr to a document fragment, and return it
                Document document = XmlUtils.marshaltoW3CDomDocument(pPr);
                DocumentFragment docfrag = document.createDocumentFragment();
                docfrag.appendChild(document.getDocumentElement());
                return docfrag;
    }
   
    public static void setPPr(PPr pPr, String styleVal ) {

        // Create a CSSStyleDeclaration
                if (cssOMParser==null) {
                        cssOMParser = new CSSOMParser();
                }
               
                CSSStyleDeclaration cssStyleDeclaration = null;
                try {
                        cssStyleDeclaration = cssOMParser.parseStyleDeclaration(
                                        new org.w3c.css.sac.InputSource(new StringReader(styleVal)) );
                } catch (IOException e) {
                        log.error(e);
                        return;
                }
       
        // iterate through it
        // for each entry,
                for (int i=0 ; i < cssStyleDeclaration.getLength() ; i++) {
                        String propertyName = cssStyleDeclaration.item(i);
                        CSSValue value = cssStyleDeclaration.getPropertyCSSValue(propertyName);

                        String cssText = value.getCssText();
                       
                // create a docx4j property object
                        Property property = PropertyFactory.createPropertyFromCssName(propertyName, value);
                       
                        if (property!=null) {
                        // invoke AbstractRunProperty set
                                ((AbstractParagraphProperty)property).set(pPr);
                        }
                }      
    }    
   
    public static DocumentFragment createRPr(NodeIterator spanNodeIt ) {
       
        // create an rPr object
                RPr rPr = Context.getWmlObjectFactory().createRPr();                   
       
               
                // Most nested is last child,
                DTMNodeProxy n = (DTMNodeProxy)spanNodeIt.nextNode();
                int i = 0;
                String styleName = null;
                do {
                       
                        log.debug("In node " + i);
                       
                        if (n.getNodeType()==org.w3c.dom.Node.DOCUMENT_NODE) {
                               
                                log.debug("unexpected DOCUMENT_NODE!");
                               
                                // The following is just debug
                               
                NodeList nodes = n.getChildNodes();
                if (nodes != null) {
                                log.debug("it has child nodes");
                    for (int j=0; j<nodes.getLength(); j++) {
                       
                                        if (((org.w3c.dom.Node)nodes.item(j)).getLocalName().equals("span") ) {
                                        log.debug("is a span");
                                               
                                                if ( ((org.w3c.dom.Node)nodes.item(i)).hasChildNodes() ) {
                                                log.debug("nested child nodes");                                                       
                                                }
                                               
                                                // ignore
                                                log.debug(".. ignoring <span/> ");
                                        } else {
                                                log.debug("is a " + ((org.w3c.dom.Node)nodes.item(j)).getLocalName()) ;                                                
                                        }
                    }
                }                                      
                        } else {
                               
                                // Expected case
                               
                                // handle <w:rStyle> - we only want the most nested
                                // so the last time we set it, we get it right
                                String classVal = n.getAttribute("class");                             
                                if (classVal!=null && !classVal.equals("")
                                                && !classVal.equals("Apple-style-span")) {
                                        log.debug("@class=" + classVal);
                                        styleName = getFirstStyleFromClass(classVal);
                                }      
                               
                        // other rPr
                                String styleVal = n.getAttribute("style");                             
                                if (styleVal!=null && !styleVal.equals("") ) {
                                       
                                        if (classVal == null ||
                                                        (classVal!=null && !classVal.equals("Apple-style-span")) ) {
                                                // Ignore style="color: rgb(0, 0, 0)
                                                // if there is an Apple-style-span
                                                // which Chrome inserts when you backspace into a prior para
                                               
                                                log.debug("@style=" + styleVal);
                                                setRPr(rPr, styleVal);                                         
                                        }
                                       
                                }
                               
                        }
                       
                        // next
                        n = (DTMNodeProxy)spanNodeIt.nextNode();
                        i++;
                       
                } while ( n !=null );
               
        // create <w:rStyle>
                if (styleName!=null) {
                        RStyle rStyle = Context.getWmlObjectFactory().createRStyle();
                        rStyle.setVal( styleName );
                        rPr.setRStyle(rStyle);
                }
               
        // convert our rPr to a document fragment, and return it
                Document document = XmlUtils.marshaltoW3CDomDocument(rPr);
                DocumentFragment docfrag = document.createDocumentFragment();
                docfrag.appendChild(document.getDocumentElement());
                return docfrag;
    }
   
    public static void setRPr(RPr rPr, String styleVal ) {

        // Create a CSSStyleDeclaration
                if (cssOMParser==null) {
                        cssOMParser = new CSSOMParser();
                }
               
                CSSStyleDeclaration cssStyleDeclaration = null;
                try {
                        cssStyleDeclaration = cssOMParser.parseStyleDeclaration(
                                        new org.w3c.css.sac.InputSource(new StringReader(styleVal)) );
                } catch (IOException e) {
                        log.error(e);
                        return;
                }
       
        // iterate through it
        // for each entry,
                for (int i=0 ; i < cssStyleDeclaration.getLength() ; i++) {
                        String propertyName = cssStyleDeclaration.item(i);
                        CSSValue value = cssStyleDeclaration.getPropertyCSSValue(propertyName);

                        String cssText = value.getCssText();
                       
                // create a docx4j property object
                        Property property = PropertyFactory.createPropertyFromCssName(propertyName, value);
                                               
                // invoke AbstractRunProperty set
                        if (property!=null) {          
                                ((AbstractRunProperty)property).set(rPr);
                        }
                }      
    }    
   

    private static CSSOMParser cssOMParser = null;
   
    /* Extension function to create an HTML <img> element
     * from "E2.0 images"
     *      //w:drawing/wp:inline
     *     |//w:drawing/wp:anchor
     */
   
    public static DocumentFragment createHtmlImgE20ForMSIE(WordprocessingMLPackage wmlPackage,
                String docID,
                NodeIterator pictureData, NodeIterator picSize,
                NodeIterator picLink, NodeIterator linkData) {

        WordXmlPicture picture = createWordXmlPicture( wmlPackage,
                         docID, pictureData,  picSize,
                         picLink,  linkData);
       
        Document d = picture.createHtmlImageElement();
       
        log.info( XmlUtils.w3CDomNodeToString(d) );

                DocumentFragment docfrag = d.createDocumentFragment();
                docfrag.appendChild(d.getDocumentElement());

                return docfrag;
    }
   
    public static WordXmlPicture createWordXmlPicture(WordprocessingMLPackage wmlPackage,
                String docID,
                NodeIterator pictureData, NodeIterator picSize,
                NodeIterator picLink, NodeIterator linkData) {
       
               
        // incoming objects are org.apache.xml.dtm.ref.DTMNodeIterator
        // which implements org.w3c.dom.traversal.NodeIterator
               
        WordXmlPicture picture = new WordXmlPicture();
        picture.readStandardAttributes( pictureData.nextNode() );
       
        org.w3c.dom.Node picSizeNode = picSize.nextNode();
        if ( picSizeNode!=null ) {
            picture.readSizeAttributes(picSizeNode);                   
        }
       
        org.w3c.dom.Node linkDataNode = linkData.nextNode();
        if (linkDataNode == null) {
                log.warn("Couldn't find a:blip!");
        } else {
            String imgRelId = ConvertUtils.getAttributeValueNS(linkDataNode, "http://schemas.openxmlformats.org/officeDocument/2006/relationships", "embed");  // Microsoft code had r:link here

            if (imgRelId!=null && !imgRelId.equals(""))
            {
                picture.setID(imgRelId);
                Relationship rel = wmlPackage.getMainDocumentPart().getRelationshipsPart().getRelationshipByID(imgRelId);              
                if (rel.getTargetMode() == null
                                                || rel.getTargetMode().equals("Internal")) {

                        // Give them a link to the cached image which we'll serve dynamically
                        picture.setSrc("/share/proxy/plutext/img"+docID+"?id="+imgRelId);  // Can't use #, since the client doesn't send fragment to the server
                       
                                } else {
                                       
                                        // Give them a link to an External image
                                        picture.setSrc(rel.getTarget());
                                }
               
               
                        }

                        // if the relationship isn't found, produce a warning
                        // if (String.IsNullOrEmpty(picture.Src))
                        // {
                        // this.embeddedPicturesDropped++;
                        // }
                }

                return picture;
        }
   
 
Parsed in 0.035 seconds, using GeSHi 1.0.8.4

Re: Can docx4j convert html into docx?

PostPosted: Wed May 11, 2011 11:39 pm
by Mack143
Thanks a lot Jason!!!!!. I will try it out and keep you updated if I find anything new ;)

Re: Can docx4j convert html into docx?

PostPosted: Tue Aug 16, 2011 10:00 am
by sanjeevkoppal
Did you try this out???
I am still not able to find a jar which contains org.alfresco.repo.DocxWikiJavascript

Re: Can docx4j convert html into docx?

PostPosted: Tue Aug 16, 2011 10:04 am
by sanjeevkoppal
oops, i missed the extract from that code posted by jason

Re: Can docx4j convert html into docx?

PostPosted: Tue Aug 30, 2011 4:59 am
by adilturbo
Dear folks;

did you achieve to convert html to docx. i need the same scenario, but no tips to achieve it.

anay help?

By the way special thanks to docx4j open creators, i contributed also in a great open source ORM solution JPMapper that u can find in sourceforge.

thanks again

Re: Can docx4j convert html into docx?

PostPosted: Wed Sep 07, 2011 1:32 am
by jason
Beyond what docx4j currently provides (which is best suited to the html docx4j itself emits), there is also for docx4j https://github.com/AndreyLevchenko/html-convertor/ ..

Alternatively, one could possibly use flying saucer (xhtml-renderer) as a basis for conversion.

There is also http://notesforhtml2openxml.codeplex.com/ which could be ported from C#

Please feedback your experiences. Thanks.

Re: Can docx4j convert html into docx?

PostPosted: Fri Dec 30, 2011 4:04 am
by vgunaselan
Hi,
has TableModelFromHtml code has been incorporated in the core package?

can you help me to find com.plutext.editor package, so i extend this method.

Thanks
Guna

Re: Can docx4j convert html into docx?

PostPosted: Fri Dec 30, 2011 7:22 am
by vgunaselan
Hi,
Also similar to converting from Html table to docx, can we convert the whole html to docx.

i have tried using doc4j using below,
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");

i am wondering, would there be any alternate approach as above.

Re: Can docx4j convert html into docx?

PostPosted: Fri Dec 30, 2011 9:15 am
by jason
See http://www.docx4java.org/svn/docx4j/tru ... /in/xhtml/

It doesn't convert tables yet.

You can try it with the nightly from http://www.docx4java.org/docx4j/ (you'll also need docx4j-xhtmlrenderer-nightly).

vgunaselan wrote:i have tried using doc4j using below,
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");


You probably don't want to be doing that.

Re: Can docx4j convert html into docx?

PostPosted: Wed Jan 04, 2012 8:02 am
by vgunaselan
Hi,
i have used provided code to translate my html file. the translation completed successfully but when i tried to open the word file. i am getting error in word 2007 "The file output cannot be opened because there are problems with the content". can you guide me.


Thanks
Guna

Re: Can docx4j convert html into docx?

PostPosted: Wed Jan 04, 2012 8:05 am
by vgunaselan
attaching output docx file after renaming to .zip

Re: Can docx4j convert html into docx?

PostPosted: Wed Jan 04, 2012 2:09 pm
by jason
Your attachment contains input.html, but not the Java code you ran it through.

I ran your input.html through the main method in the Importer class, and it produced a docx which opens fine for me.

So what does your code do?

Please note that the html importing is not yet complete, so you should only be using it if you are prepared to get your hands dirty!

Re: Can docx4j convert html into docx?

PostPosted: Fri Jan 06, 2012 10:12 am
by vgunaselan
Hi,
Sorry, i had a classpath issue, which has corrupted the docx file.

Meanwhile, can i know when html importing code would be complete?

i appreciate your help on this.

Thanks
Guna

Re: Can docx4j convert html into docx?

PostPosted: Sat Jan 07, 2012 6:07 am
by vgunaselan
Hi,
i am using docx4j-nightly-20120105.jar and docx4j-xhtmlrenderer-nightly-20111219.jar. and i have used the attached code.

it generated the doc, while opening the doc it is giving error(has problem with the content). can you guide me.

the input file is inside the zip

Re: Can docx4j convert html into docx?

PostPosted: Sat Mar 10, 2012 3:38 am
by benpoole
Somewhat confused in using the new Importer code. I have an older copy of the org.docx4j.convert.in.css.Importer class which is currently throwing an error when used with the flying saucer xhtmlrenderer (snapshot 1.0.0) package (and docx4j 2.7.0 or docx4j 2.7.1 JARs):

128163 [main] INFO org.docx4j.convert.in.css.Importer - CSS name: color
128163 [main] INFO org.docx4j.convert.in.css.Importer - CSS value: #000000
Exception in thread "main" java.lang.UnsupportedOperationException
at org.docx4j.org.xhtmlrenderer.css.parser.PropertyValue.getRGBColorValue(PropertyValue.java:145)
at org.docx4j.model.properties.run.FontColor.<init>(FontColor.java:45)
at org.docx4j.model.properties.PropertyFactory.createPropertyFromCssName(PropertyFactory.java:510)
at org.docx4j.convert.in.css.Importer.addParagraphProperties(Importer.java:271)
at org.docx4j.convert.in.css.Importer.traverse(Importer.java:160)

Is there a specific code combination we should be using? Specifically, I note that the trunk currently has two importer classes now:

org/docx4j/convert/in/css/Importer.java and;
org/docx4j/convert/in/xhtml/Importer.java

… there don't seem to be many differences between them though. Should we be using the 2.8 trunk, plus the xhtmlrenderer daily?

Re: Can docx4j convert html into docx?

PostPosted: Sat Mar 10, 2012 9:52 am
by benpoole
By way of follow-up, I built from trunk today, built the flyingsaucer code as well (1.0.0 SNAPSHOT) and ran some test code against the XHTML importer. It worked!

Interestingly, the colour stuff still throws errors, but not in the way (i.e. before these errors stopped the whole process):

128555 [main] ERROR org.docx4j.model.properties.PropertyFactory - Can't create property from: color:#000000

Anyway, good stuff. I can get on and parse some simple HTML now! :)

Re: Can docx4j convert html into docx?

PostPosted: Mon Apr 02, 2012 10:45 pm
by jason
Today's nightly build of docx4j and docx4j-xhtmlrenderer include proposed final code for XHTML importing for forthcoming docx4j 2.8.0. With this build, you'll need iText 2.1.7 to import XHTML.

This includes some support for importing tables.

I've also committed a couple of new samples; you can find a link to these at http://www.docx4java.org/trac/docx4j/changeset/1772

Ben, thanks for your note about colours. I haven't added any code to honour font color yet.

Re: Can docx4j convert html into docx?

PostPosted: Mon Apr 16, 2012 9:22 am
by benpoole
Great stuff, thanks Jason. The HTML parsing is working pretty well for us (being able to import ordered and unordered lists is very useful), and the ability to import basic tables would be great.