Page 1 of 1

OutOfMemoryError while converting docx to html

PostPosted: Sat Aug 29, 2015 1:43 am
by lucas_tk
HI,

I'm facing problems with OutOfMemoryError while converting docx to html.

I'm using Java application:

package test;

import java.io.ByteArrayOutputStream;
import java.io.OutputStream;
import org.docx4j.Docx4J;
import org.docx4j.Docx4jProperties;
import org.docx4j.convert.out.HTMLSettings;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;

public class TestDocxConverter {

public static void main(String[] args) throws Exception{
String inputfilepath = "C:\\dokumenty testowe\\Portal Agenta - Instrukcja użytkownika_v 3.0pop.docx";
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new java.io.File(inputfilepath));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();

htmlSettings.setImageDirPath("C:\\dokumenty testowe" + "\\media");
htmlSettings.setImageTargetUri("C:\\dokumenty testowe" + "\\media");
htmlSettings.setWmlPackage(wordMLPackage);

OutputStream os, os2;
//os = new FileOutputStream("C:\\test6.html");
os2 = new ByteArrayOutputStream();

Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4jProperties.setProperty("docx", true);

String output="";

try {
try {
//Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
Docx4J.toHTML(htmlSettings, os2, Docx4J.FLAG_EXPORT_PREFER_XSL);
} catch (Docx4JException e) {
System.out.println(e.toString());
}
} catch (OutOfMemoryError E) {
System.out.println(E.toString());
}
output = ((ByteArrayOutputStream)os2).toString("UTF-8");

System.out.println("------------------------");
System.out.println("output:");
System.out.println(output);
}
}

As you can see I'm trying to convert docx to html. My test document is "Portal Agenta - Instrukcja użytkownika_v 3.0pop.docx". It size is about 5MB, it has 64 pages and about 110 images. When I run my converter after some time I get a OutOfMemoryError errors:
[main] ERROR org.docx4j.XmlUtils - java.lang.OutOfMemoryError: Java heap space
javax.xml.transform.TransformerException: java.lang.OutOfMemoryError: Java heap space

I run java with maximum size for memory on Win2003 Server and 32bit Java (1536MB). Is it possible that converter consumes such amount of memory to convert 5MB docx? How to reduce memory consumption?

Regards,
Lucas

Re: OutOfMemoryError while converting docx to html

PostPosted: Sat Aug 29, 2015 7:32 am
by jason
Try Docx4J.toHTML(htmlSettings, os2, Docx4J.FLAG_EXPORT_PREFER_NONXSL);

It doesn't have feature parity, but should use less memory.

Re: OutOfMemoryError while converting docx to html

PostPosted: Tue Sep 01, 2015 12:14 am
by lucas_tk
Hi,

Thanks for suggestion.

I've tried using NONXSL flag. Memory usage is better but output HTML code is not good as it was when using XSL.

Is there a chance to optimize docx4j converter code? I can provide test document (test code I've already provided).

Regards,
Lucas

Re: OutOfMemoryError while converting docx to html

PostPosted: Tue Sep 01, 2015 12:32 am
by jason
You are welcome to improve it; unfortunately I don't have much spare capacity at the moment.

May I ask what you're using the HTML output for? Is it just to display the docx in a web browser, or something else (eg using an HTML editor).

Re: OutOfMemoryError while converting docx to html

PostPosted: Tue Sep 01, 2015 1:25 am
by lucas_tk
I'm using converter in web html editor (CKEditor). User can drag&drop docx document into editor. Document is converted to html and output is presented in editor. Than user can customize output.

Re: OutOfMemoryError while converting docx to html

PostPosted: Thu Sep 03, 2015 1:02 am
by mcgullen
We faced similar OutOfMemory issue when working with the document attached. The walkJAXBElements method of the traversal util never went pass the getChildren stage.

We ended up brute forcing it, specifying -Xmx16g when launching the JVM.

Jason, given we process lots of documents in a pipeline and are willing to skip the occasional outliners that will spike memory usage, we are thinking about writing another traversal util that check if(Thread.interrupted()) from time to time. That way we can do futureTask.cancel(true) whenever there is a time out. Otherwise Docx4j's logic is uninterruptedly blocking.

Will post code here or issue a pull request if you think that can be helpful

In addition, you mentioned "just displaying docx document in a web browser" without the need for editing. We currently do (i) https://github.com/yeokm1/docs-to-pdf-converter plus (ii) https://github.com/mozilla/pdf.js/

Is there a better way using Docx4j?

mcgullen

Re: OutOfMemoryError while converting docx to html

PostPosted: Thu Sep 03, 2015 9:16 pm
by jason
Hi mcgullen

that sounds interesting; yes, please post the code here, or make a pull request, whichever you prefer.

Regarding alternative ways of displaying a docx document in a web browser, docx4j supports converting to XHTML, or PDF (as you know).

Plutext will soon introduce a way to display the docx in the browser (using javascript in the client and some server-side tooling), without the need to first convert to HTML or PDF. The rendering fidelity will be similar to Plutext's existing commercial PDF Converter (it leverages the same code base). If you're interested in trying this out, just let me know off list.

cheers .. Jason

Re: OutOfMemoryError while converting docx to html

PostPosted: Fri Sep 04, 2015 12:45 am
by lucas_tk
I've noticed that method:
void javax.xml.transform.Transformer.transform(Source xmlSource, Result outputTarget)
invoked in XmlUtils.java in method:
public static void transform(javax.xml.transform.Source source, javax.xml.transform.Templates template, Map<String, Object> transformParameters, javax.xml.transform.Result result)
consumes a lot of memory. This is method from built in JAVA library so I don't think that optimizing code is possible.