Page 1 of 1

Docx to HTML formatting is lost

PostPosted: Wed Mar 31, 2021 3:26 am
by Prashanth
Hello,

I am trying to convert docx to HTML in a web application(PEGA).
I imported the necessary jars linked to this based on my search.
Issue -1 Certain converted html loses its format.
Issue -2 If i use byteArray instead of file resulting HTML is messed up.

Not sure where i am wrong.

Code: Select all
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
  java.util.Base64.Encoder encoder = java.util.Base64.getEncoder();
  //Get inputstream from Case Document as a param
   byte[] bs = decoder.decode(Word.getBytes());
  InputStream is =new ByteArrayInputStream(bs);

  WordprocessingMLPackage wordMLPackage = Docx4J.load(is);

HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(FilePath);
htmlSettings.setWmlPackage(wordMLPackage);
  OutputStream os = new ByteArrayOutputStream();
  OutputStream ou = new FileOutputStream(FilePath);

Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
Docx4J.toHTML(htmlSettings, ou, Docx4J.FLAG_EXPORT_PREFER_XSL);
  ou.close();

 
String result = ou.toString();
  return result ;


Generated HTML File
HTML.PNG
HTML File
HTML.PNG (26.17 KiB) Viewed 667 times


HTML as a byteArray o/p
HTML_Stream.PNG
Generated Stream
HTML_Stream.PNG (21.6 KiB) Viewed 667 times


Source content
Source.PNG
Source document
Source.PNG (10.39 KiB) Viewed 667 times

Re: Docx to HTML formatting is lost

PostPosted: Tue Apr 06, 2021 7:59 am
by jason
I expect there's something "unusual" about your source docx, so would need to see that.

Please attach it, or if it is sensitive, anonymise it first using https://github.com/plutext/docx4j/blob/ ... ingle.java