Page 1 of 1

Embed Images from an HTML output to doc

PostPosted: Thu Aug 11, 2011 12:28 pm
by sanjeevkoppal

I have a RendererFilter (ServletFilter) in which i capture the ServletResponse Content bascically an HTML.
Opening this HTML in a word doc works fine.
The issues here is all the image URLs in the html are not rendered in the word doc if no internet, hence i want to embed these images while converting.
I tried basic HTML img syntax with base64 encoded value using data:URI schema, but word doesn't render these images.
Now i am trying to use docx4j to do this, i was successful adding an image at the end of the document using the example AddImage from the samples.
But,i have lots of images in the html which are referenced by URL, these images are not rendered in word document if no internet, When i use AlternativeFormatInputPart entire html is converted to one big binary data, i can't use this xml to parse using an xxpath query, i am not sure what is the best way of doing this

Here is the code below using AlternativeFormatInputPart
Code: Select all

AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/organize.html") );
//(content4.getBytes() is the content from servletReponse (HTML)
                    afiPart.setContentType(new ContentType("text/html"));
                    Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
                    // .. the bit in document body
                    CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
                    ac.setId(altChunkRel.getId() );

Now with the above code, i have to replace a special text with an image.

Please help....

Any help/solution is highly appreciated.


Re: Embed Images from an HTML output to doc

PostPosted: Fri Aug 12, 2011 3:13 pm
by jason
Hello Sanjeev

Its not entirely clear to me what you are trying to do, but if you use docx4j to add an AlternativeFormatInputPart containing HTML, you'll need to open the docx in Word in order for the HTML content to be converted to normal docx content. docx4j currently can't do that for you (at least for HTML content).

As an alternative, consider adding the images directly to the docx.

cheers .. Jason

Re: Embed Images from an HTML output to doc

PostPosted: Fri Aug 12, 2011 4:27 pm
by sanjeevkoppal
Hey Jason,

Thanks for the quick reply...
I have a html content not generated by docx4j, i am using docx4j so that i can convert the images which are URL based to an embedded one.
The content is pretty huge and contains lots of image, i cannot use docx4j from scratch, like creating text/paragraphs etc...
My only goal is to somehow convert this html content into word, with all the images embedded.
I tried using AlternativeFormatInputPart and added altChunk thinking that all the content is converted, but i still see URLs for the images in the generated doc.

let me know if you need more info, i can attach or send the files i am working.

Thanks again!!!

Re: Embed Images from an HTML output to doc

PostPosted: Mon Nov 07, 2011 1:01 am
by llf2003912
pls help me, i have the same problem

Embed Images from an HTML

PostPosted: Mon Nov 07, 2011 1:08 am
by llf2003912
i have a problem about import html into word.
i use the CTAltChunk to handle it, but the image in the html can not be work well, because the image from the html is using url, if can not render the word without network, so how can i change the url image to enbedded image in word.

Re: Embed Images from an HTML output to doc

PostPosted: Tue Nov 08, 2011 12:14 am
by jason
An image in a docx is either linked or embedded.

If it is "linked", it is outside the docx, and the location (whether a file or on the web) must be available when the docx is opened in Word (or whatever) if it is to be displayed.

If it is to be embedded, it must be included correctly in the docx at the time the docx is created (either by Word or docx4j).

So if you do the HTML conversion yourself, you'll need to bear the above principles in mind.

If you rely on Word to do the conversion (via AlternativeFormatInputPart aka AltChunk), Word will need to be able to find the image at the time the docx is opened. If end users are opening the docx and may not be connected, maybe you should do the HTML conversion yourself, rather than relying on Word. (By the way, the plan is to include HTML importing in the next version of docx4j).