Page 1 of 1

Html to docx, how to remove blank spaces on top of the page

PostPosted: Sun Mar 08, 2015 2:36 am
by barnaba_hunters
Hi!
I'm looking for this for a while now but so far I have no clue how to achieve this. The thing is that I have dynamic html content produced and than using docx4j I convert it to docx. Docx elements look is acceptable at the moment but since there is significant amount of dynamic content populated with help of jstl tags the outcome is all over the place. Here how I do it:

first get page source:
Code: Select all
String content = renderViewHelper.renderView("/report/testPrint2",model);


then I process this string:

@Override
public File createDocx(String content, ReportDto report, Entity entity,
ScpUser user) {

FileOutputStream fop = null;
String inputfilepath = "tempHtmlFile.html";
File tempHtmlFile = null;
File tempDocxFile = null;
try {

tempHtmlFile = File.createTempFile("tempHtmlFile", ".html");
tempDocxFile = File.createTempFile("report", ".docx");

// ensure we have UTF8 everywhere
fop = new FileOutputStream(tempHtmlFile);

Writer out = new BufferedWriter(new OutputStreamWriter(fop, "UTF8"));

out.write(content);
out.flush();
out.close();

String regex = null;
regex = ".*(arial|times).*";
PhysicalFonts.setRegex(regex);

// Document loading (required)
WordprocessingMLPackage wordMLPackage;
System.out.println("Loading file from " + inputfilepath);
// wordMLPackage = Docx4J.load(new java.io.File(inputfilepath));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();

htmlSettings.setImageDirPath(inputfilepath + "_files");
htmlSettings.setImageTargetUri(inputfilepath
.substring(inputfilepath.lastIndexOf("/") + 1) + "_files");

String userCSS = "html, body, div, span, h1, h2, h3, h4, h5, h6, p, a, img, ol, ul, li, table, caption, tbody, tfoot, thead, tr, th, td "
+ "{ margin: 0; padding: 0; border: 0;}"
+ "body {line-height: 1;} ";
htmlSettings.setUserCSS(userCSS);

wordMLPackage = WordprocessingMLPackage.createPackage();
RFonts arialRFonts = Context.getWmlObjectFactory().createRFonts();
arialRFonts.setAscii("Arial");
arialRFonts.setHint(org.docx4j.wml.STHint.DEFAULT);
arialRFonts.setHAnsi("Arial");
XHTMLImporterImpl.addFontMapping("Arial", arialRFonts);
RFonts timesRFonts = Context.getWmlObjectFactory().createRFonts();
timesRFonts.setAscii("Times");
timesRFonts.setHint(org.docx4j.wml.STHint.DEFAULT);
timesRFonts.setHAnsi("Times");
XHTMLImporterImpl.addFontMapping("Times New Roman", timesRFonts);
RFonts serifRFonts = Context.getWmlObjectFactory().createRFonts();
serifRFonts.setAscii("sans-serif");
serifRFonts.setHint(org.docx4j.wml.STHint.DEFAULT);
serifRFonts.setHAnsi("sans-serif");
XHTMLImporterImpl.addFontMapping("MS Sans Serif", serifRFonts);

XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(
wordMLPackage);
xHTMLImporter.setHyperlinkStyle("Hyperlink");
// xHTMLImporter.setParagraphFormatting(FormattingOption.IGNORE_CLASS);
wordMLPackage.getDocumentModel().getSections().get(0)
.getPageDimensions().setPgSize(PageSizePaper.A4, true);
wordMLPackage.getDocumentModel().getSections().get(0)
.getPageDimensions().setMargins(MarginsWellKnown.NARROW);
wordMLPackage.getMainDocumentPart().getContent()
.addAll(xHTMLImporter.convert(tempHtmlFile, null));

wordMLPackage.save(tempDocxFile);
System.out.println("Saved: " + "report.docx");

return tempDocxFile;
} catch (Docx4JException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
System.out
.println("Something went wrong during opening test.docx...");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
tempHtmlFile.delete();
if (fop != null) {
fop.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}

Can anyone help me on this subject?
Best regards,
Mat

Re: Html to docx, how to remove blank spaces on top of the p

PostPosted: Wed Mar 11, 2015 12:43 pm
by jason
Sorry, it is not clear what you are asking.

Re: Html to docx, how to remove blank spaces on top of the p

PostPosted: Tue Apr 28, 2015 11:48 pm
by barnaba_hunters
Hi Jason,
In the file attached, there is example of page inside docx generated with docx4j library from xhtml document. I would like to remove space highlightened with red box so that on every page content starts at the top of the page. What would be your suggenstion on how to do it? Hope this is clearer then my previous post.
Best regards,
Mat
spacetodel.png
spacetodel.png (28.24 KiB) Viewed 2367 times

Re: Html to docx, how to remove blank spaces on top of the p

PostPosted: Thu Apr 30, 2015 11:31 pm
by jason
At a high level, you could pre-process the XHTML, or post-process the resulting docx.

Is there a manual page break before each space you want removed? Or some other way to identify space which you don't want (whilst keeping space you do want)