Page 1 of 1

Content not allowed in prolog

PostPosted: Wed May 13, 2020 7:26 pm
by avelin
Hello,

I have the following docx file (content-not-allowed-in-prolog.docx), where the first char in the file is a Byte Order Mark (BOM) or Zero-width no-break space (U+FEFF).

That's why doxc4j using the SAXParser cannot parse it.

If a docx file starts with a special character, such as a BOM, the following code cannot parse it correctly and the execution lands in the last else (see "Assuming Flat OPC XML").
Usually, such docx files are encoded UTF-8-BOM, instead of UTF-8.

Please, advise how to fix.

Greetings,
Angelina

Code: Select all
   
private static org.docx4j.openpackaging.packages.OpcPackage load(PackageIdentifier pkgIdentifier, InputStream inputStream, String password) throws Docx4JException {
        BufferedInputStream bis = new BufferedInputStream(inputStream);
        bis.mark(0);
        byte[] firstTwobytes = new byte[2];
        boolean var5 = false;

        int read;
        try {
            read = bis.read(firstTwobytes);
            bis.reset();
        } catch (IOException var7) {
            throw new Docx4JException("Error reading from the stream", var7);
        }

        if (read != 2) {
            throw new Docx4JException("Error reading from the stream (no bytes available)");
        } else if (firstTwobytes[0] == 80 && firstTwobytes[1] == 75) {
            return load(pkgIdentifier, bis, Filetype.ZippedPackage, (String)null);
        } else if (firstTwobytes[0] == -48 && firstTwobytes[1] == -49) {
            log.info("Detected compound file");
            return load(pkgIdentifier, bis, Filetype.Compound, password);
        } else {
            log.info("Assuming Flat OPC XML");
            return load(pkgIdentifier, bis, Filetype.FlatOPC, (String)null);
        }
    }

Re: Content not allowed in prolog

PostPosted: Thu May 14, 2020 10:20 am
by jason
Clearly we could address this case, but in the meantime, I'm curious, how was the docx created/what is the source of these files?

A file created with a Microsoft text editor will start with a byte order mark (BOM): http://msdn.microsoft.com/en-us/library ... 01(v=vs.85).aspx

Re: Content not allowed in prolog

PostPosted: Thu May 14, 2020 7:56 pm
by avelin
hi Jason,

I don't know, how the Word files were created. I assume someone had a pretty old MS Office installation on an old Windows machine, and there we go. My task is to parse a large amount of files and to analyse them.
Right now, the only thing I can do to bypass this "Content not allowed in prolog" issue, is open each docx file in MS Word and re-save it, so that it is encoded correctly and the first BOM char is removed.

It would be great, if you had a suggestion for this issue, or a fix.

Thank you again for your input and help.

Angelina