Page 1 of 1

Out of memory error in converting docx to html

PostPosted: Fri Dec 20, 2013 1:14 am
by sswelling
We are facing out of memory : Java heap space error while converting docx to html when the document size goes above 300KB and the server hangs. We don't have control on the server parameters.

Code is given below. Would be helpful if there are any solutions.

// Convert .docx file to html file and return .html file as File Object
public File convertDocxToHtml(File inputFile)
{
//String str=inputFile.getName().replace(".", " ");
String inputfilepath = inputFile.getAbsolutePath();
int index = inputFile.getName().indexOf(".");
String outputfilepath = tempDir + "\\"+inputFile.getName().substring(0, index)+".html";
File outputFile = null ;
try
{
//XHTMLImporter.setHyperlinkStyle("Hyperlink");
if (inputfilepath.endsWith("docx"))
{
WordprocessingMLPackage docx = WordprocessingMLPackage.load(new File(inputfilepath));
AbstractHtmlExporter exporter = new HtmlExporterNG2();

OutputStream os = new java.io.FileOutputStream(outputfilepath);

HtmlSettings htmlSettings = new HtmlSettings();
htmlSettings.setImageDirPath(outputfilepath + "_Images");
htmlSettings.setImageTargetUri(outputfilepath.substring(inputfilepath.lastIndexOf("\\") + 1) + "_Images");

javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(os);
exporter.html(docx, result, htmlSettings);

StringBuilder stringBuilder = new StringBuilder();

outputFile = new File(outputfilepath);
// get the StringBuilder class from input file
stringBuilder = this.getFileContents(outputFile);
//stringBuilder = stringBuilder.replace(stringBuilder.indexOf("<style>"),stringBuilder.indexOf("</style>")+8 , "");
//String htmlPage = stringBuilder.toString().replaceAll("%0A ", "");
String htmlPage = stringBuilder.toString().replaceAll("%0A ", "");
stringBuilder=null;
String htmlPage1 = htmlPage.replaceAll("height: 5mm;", "height: auto;");
FileUtils.writeStringToFile(new File(outputfilepath), htmlPage1);
}
else
{
LOGGER.error(" Error : ConvertDocxToHTML.java convertDocxToHtml(File inputFile) : Extenssion is Other than .docx");
}
}
catch (Exception e)
{
e.printStackTrace();
LOGGER.error(" Error : ConvertDocxToHTML.java convertDocxToHtml(File inputFile) : "+ e.getMessage());
}

return outputFile;
}

Re: Out of memory error in converting docx to html

PostPosted: Fri Dec 20, 2013 8:52 am
by jason
sswelling wrote:We don't have control on the server parameters.


If you aren't able to give Java enough memory(-Xmx etc), then you may be out of luck.

Having said that, there is one area where you can reduce the amount of memory used- fonts:

Code: Select all
      // Font regex (optional)
      // Set regex if you want to restrict to some defined subset of fonts
      // Do this early in your code.
      String regex = null;
      // Windows:
      // String
      // regex=".*(calibri|cour|arial|times|comic|georgia|impact|LSANS|pala|tahoma|trebuc|verdana|symbol|webdings|wingding).*";
      // Mac
      // String
      // regex=".*(Courier New|Arial|Times New Roman|Comic Sans|Georgia|Impact|Lucida Console|Lucida Sans Unicode|Palatino Linotype|Tahoma|Trebuchet|Verdana|Symbol|Webdings|Wingdings|MS Sans Serif|MS Serif).*";
      PhysicalFonts.setRegex(regex);

Re: Out of memory error in converting docx to html

PostPosted: Mon Dec 23, 2013 9:15 pm
by sswelling
Thanks Jason. We tried your font option but when the file size was 300KB, server went into a hang state. Java memory provided at that time was 1.25GB.

In fact for file size > 200KB the memory requirement shoots beyond 1GB. Is there any other way by which we reduce the memory usage.

Re: Out of memory error in converting docx to html

PostPosted: Tue Dec 24, 2013 8:58 am
by jason
If you must run docx4j in a memory constrained environment, you should probably use a profiler to understand how memory is being consumed.

Having said that, there are 2 ways to do HTML output. The default uses XSLT (Xalan), and the other doesn't.

You could try the non-XSLT approach, by using flag Docx4J.FLAG_EXPORT_PREFER_NONXSL in https://github.com/plutext/docx4j/blob/ ... tHtml.java

But be warned, NONXSL does not have feature parity yet