Page 1 of 1

html altchunk

PostPosted: Tue Dec 06, 2011 4:19 pm
by nmcuong2005
Please help me !
It's not work ( Error : There was an error opening the file)
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

String html = "<html><head><title>Import me</title></head><body><p>Hello World!</p></body></html>";
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(html.getBytes());
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
// .. the bit in document body
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId() );
wordMLPackage.getMainDocumentPart().addObject(ac);
// .. content type
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
wordMLPackage.save(new java.io.File("C:\\test.docx"));

Re: html to docx

PostPosted: Wed Dec 07, 2011 9:34 am
by jason
Your code works for me - the resulting docx opens in Word 2010 without error and I can see the text:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.contenttype.ContentType;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.AlternativeFormatInputPart;
import org.docx4j.relationships.Relationship;
import org.docx4j.wml.CTAltChunk;

public class AltChunkHtml {

        /**
         * @param args
         * @throws Exception
         */

        public static void main(String[] args) throws Exception {

                WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

                String html = "<html><head><title>Import me</title></head><body><p>Hello World!</p></body></html>";
                AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
                afiPart.setBinaryData(html.getBytes());
                afiPart.setContentType(new ContentType("text/html"));
                Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
                // .. the bit in document body
                CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
                ac.setId(altChunkRel.getId() );
                wordMLPackage.getMainDocumentPart().addObject(ac);
                // .. content type
                wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
                wordMLPackage.save(new java.io.File(System.getProperty("user.dir") + "/test.docx"));           
        }

}
 
Parsed in 0.017 seconds, using GeSHi 1.0.8.4

Re: html to docx

PostPosted: Wed Dec 07, 2011 5:39 pm
by nmcuong2005
Thanks for help.

I use Word 2007 .

Re: html to docx

PostPosted: Thu Dec 29, 2011 5:48 am
by vgunaselan
Hi,
can i know , how to open this doc in word 2007

Re: html to docx

PostPosted: Thu Dec 29, 2011 9:42 am
by jason
It works fine for me in Word 2007 as well. Any file created by docx4j should work fine on both.

What version of Word 2007 are you using, exactly? For example, I tested it on 12.0.6545.5000, which is SP2, and part of a Microsoft Office Standard 2007 installation.

I'd be surprised if some difference in the version of Word 2007 is the issue though, unless changes to AltChunk behaviour were introduced in a service pack.

Re: html to docx

PostPosted: Fri Dec 30, 2011 3:09 am
by vgunaselan
Thanks Jason,
worked for me in 2007. i had a bug in my customized code.

Re: html altchunk

PostPosted: Thu Nov 22, 2012 11:34 pm
by Zaeem
If I want to collect all the "Tbl" objects from the DOCX created by this method using the function below, none of the tables in the document are returned, but I can retrieve the tables fine if I use a "normal" DOCX created in Word having tables in it. Is there any way to convert the AltChunk based DOCX to "normal" DOCX, so Tbl retrieval is possible?

Code: Select all
static List<Object> getAllElementFromObject(Object obj, Class<?> toSearch) {
   List<Object> result = new ArrayList<Object>();
   if (obj instanceof JAXBElement) obj = ((JAXBElement<?>) obj).getValue();

   if (obj.getClass().equals(toSearch))
      result.add(obj);
   else if (obj instanceof ContentAccessor) {
      List<?> children = ((ContentAccessor) obj).getContent();
      for (Object child : children) {
         result.addAll(getAllElementFromObject(child, toSearch));
      }

   }
   return result;
}

Re: html altchunk

PostPosted: Fri Nov 23, 2012 7:22 am
by jason
An altChunk isn't normal docx content, so you won't find WordML objects in there...

You could use org.docx4j.convert.in.xhtml.XHTMLImporter to create real WordML content.

That said, there is also convertAltChunks() in

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
public abstract class JaxbXmlPartXPathAware<E> extends JaxbXmlPart<E> implements AltChunkInterface
 
Parsed in 0.013 seconds, using GeSHi 1.0.8.4

Re: html altchunk

PostPosted: Fri Nov 23, 2012 5:12 pm
by Zaeem
Thanks for the reply.
When I use XHTMLImporter for converting HTML to DOCX, I get an error:

org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.

And when I use the convertAltChunks() like in the code below (beneath the comment // CONVERTING ALTCHUNKS), I still can't find WordML objects in the converted DOCX. Am I missing something here? :

Code: Select all
      static void HTMLtoDOCX (String html) throws Docx4JException, JAXBException{
               WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
               
               AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
               afiPart.setBinaryData(html.getBytes());
               afiPart.setContentType(new ContentType("text/html"));
               Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
               // .. the bit in document body
               CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
               ac.setId(altChunkRel.getId() );
               wordMLPackage.getMainDocumentPart().addObject(ac);
               wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
               
               // CONVERTING ALTCHUNKS
               WordprocessingMLPackage pkgOut = wordMLPackage.getMainDocumentPart().convertAltChunks();
               pkgOut.save(new java.io.File(Constants.Path.defaultPath + "input.docx"));           
   }

Re: html altchunk

PostPosted: Mon Nov 26, 2012 9:21 am
by jason
Zaeem wrote:org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.


&nbsp; isn't an entity built in to XML, so your XML is invalid. Try replacing it with  

Zaeem wrote: I still can't find WordML objects in the converted DOCX.


Unzip the docx you saved, and manually inspect document.xml to confirm it contains the w:tbl elements you expect. If it does, then something is wrong with your code which searches for them. Note that it is possible for a org.docx4j.wml.Tbl element to be wrapped in a JAXBElement, which might be why you're not seeing them (though your getAllElementFromObject method does cover that possibility).

Re: html altchunk

PostPosted: Tue Nov 27, 2012 4:23 pm
by Zaeem
Thanks for the reply Jason!
Unzip the docx you saved, and manually inspect document.xml to confirm it contains the w:tbl elements you expect.

Just examined the file and I don't see any w:tbl elements in it. The AltChunk, however, is there:
Code: Select all
<w:altChunk r:id="rId2"/>


&nbsp; isn't an entity built in to XML, so your XML is invalid. Try replacing it with

Do you mean I have to replace that sub-string (&nbsp;) with "" (i.e. empty)?

Re: html altchunk

PostPosted: Tue Nov 27, 2012 8:18 pm
by jason
If you look at the source code for public abstract class JaxbXmlPartXPathAware<E>, or your logs (at WARN level) you'll see it doesn't attempt to convert altChunks of type HTML. If you have XHTML content, change your line afiPart.setContentType

Regarding nbsp, sorry, the editor mangled what I wrote. You can change it to a numeric entity. You can google '&nbsp numeric entity'