Page 1 of 1

Node Path Utlization: Targeting Paragraphs and Text Exclusio

PostPosted: Wed Apr 13, 2022 5:53 am
by themanoneof
Using: 11.45 Version Reference
Using Java 11 with Intellj

I have been going over, the use of a declared List<Object> NodePath = "//w:p", and working through the idea of targeting only paragraphs, not bullet points, independent variables. I just want paragraphs, I have used the Object with P, and not sure, if I have to insinuate the paragraph with a text generation, and then giving that text a value. I am unsure from the documentation, even with Utils given, which haven't been able to get only a paragraph. I want to target the xml directly, but the documentation is not working in that manner. Supplied below is some of the code I have been using, it feeds into a javascript ui, with a springboot backend for the mapping portion, which shouldn't be a problem, but the java program, I want to only get text, so if a table is given, I want it to ignore, baased on the XML, for bullet points, I want the same, If I have to filter by t he JAXBElement, how would I do that, would it just be going after that directly, is there a Data Relation of some sort if so, any links would be helpful, videos or anything. When I do the for(Object paragraph:

List<Object> SeriesNodes = DocumentFull.getJAXBNodesViaXPath(NodePath, true);
List<Object> paragraphs = getAllElementFromObject(DocumentFull.getContent(), P.class);

One of which grabs the full, document the same one which uses a different node path does the same, I am unsure, why there is no filter at all. Do I have to apply styling settings, create a new document, with the given XML.

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
 public String uploadNewFile(@NotNull @RequestParam("file") MultipartFile ProposalGenerate) throws IOException, Docx4JException, JAXBException, XMLStreamException {
        //Can do it one function, need to finish input stream reading portion of DOcx4j
        System.out.println("File Uploaded");
        WordprocessingMLPackage dataUpload = WordprocessingMLPackage.load(ProposalGenerate.getInputStream());
        System.out.println("Lazy 1:" + dataUpload);
        WordprocessingMLPackage.load(ProposalGenerate.getInputStream());
        System.out.println("Lazy 2:" + dataUpload);
        MainDocumentPart DocumentFull = dataUpload.getMainDocumentPart();
        System.out.println("Document Full" + DocumentFull);
        System.out.println("Lazy" + dataUpload);
        ProposalGenerate.getOriginalFilename();
        System.out.println("Number 1" + ProposalGenerate.getOriginalFilename());
        ProposalGenerate.getInputStream();
        System.out.println("Number 2" + ProposalGenerate.getInputStream());
        //Get the file contents and save it to the console
        byte[] bytes = ProposalGenerate.getBytes();
        System.out.println("Number 3" + Arrays.toString(bytes));
        //Create a new file
        File file = new File("C:\\Users\\" + ProposalGenerate.getOriginalFilename());
        System.out.println("Number 4" + file);
        //Write the file to the file system
        //  file.createNewFile();
        //  System.out.println("Number 5" + file);
        FileEntity fileEntity = new FileEntity(file);
        System.out.println("Number 6" + fileEntity);
        ProposalGenerate.getContentType();
        ProposalGenerate.getBytes();
        System.out.println("Number 7" + Arrays.toString(ProposalGenerate.getBytes()));
        //Give What type of Content you want to grab from the docuemtn
        String NodePath = "//w:t";
        //        String NodePath = "//w:p[1]/w:r[1]/w:t[1]";
        // Making Tmp Variable for NodePath Routing to Taking Place Upon the Document
        // resfresXMLfirst (Does Maybe refesh the XML)
        List<Object> SeriesNodes = DocumentFull.getJAXBNodesViaXPath(NodePath, true);
        System.out.println("Number 8" + SeriesNodes);
        //P.content.get(SeriesNodes);
        // Just a Object Declaration, with the Series of Nodes going to be used, with the document given.
        for (Object ignored : SeriesNodes) {
            // Piece of Garbage Code
            Text place = (Text) ((JAXBElement<?>) ignored).getValue();
            // Get Value of String (Get Value may be Wrong for getting ParagraphH)
            //QUestioning  place.getValue();            .getValue();
            String placeValue = place.getValue();
            // Just a way to show the data can return the object
            System.out.println("Index 1:" + placeValue);
        }//placevalue
        System.out.println("Number 8" + SeriesNodes);
        System.out.println("Number 9" + DocumentFull);
        System.out.println("Plake Number 10.5" + DocumentFull.getContent());
        //        System.out.println("Plake Number 10.6" + DocumentFull.getContent().contains(this.getAllElementFromObject(1, P.class)));
        //        System.out.println("Plake Number 10.6" + DocumentFull.getContent().contains(this.getAllElementFromObject(1, P.class)));

        // Really Important: "//w:p";
        String NodePathing = "//w:p";
       
         //System.out.println("NEW Runnner Scene: Before Bed Number 22" + DocumentFull.getJAXBNodesViaXPath(NodePathing, P.class, true));

        // Pass NodePathing to Object Xpath
        //        List<Object> SeriesPara = DocumentFull.getJAXBNodesViaXPath(NodePathing, true);
      //  List<Object> SeriesPara = DocumentFull.getJAXBNodesViaXPath(NodePathing, true);
        //        List <Object> SerialNodes = Collections.singletonList(DocumentFull.getContent().contains(this.getAllElementFromObject(NodePathing, P.class)));
        //Only Reads Text(Paragraph reads only Text, while Text Reads everything)
        List<Object> paragraphs = getAllElementFromObject(DocumentFull.getContent(), P.class);
            System.out.println("Number 11" + paragraphs);

        //Worst Thing: ", ," for new line andd startIndex and endIndex
        List <Object> SerialNodes = DocumentFull.getJAXBNodesViaXPath(NodePathing, false);
        System.out.println("Number 11" + SerialNodes);
        //System.out.println("NEW Runnner Scene: Before Bed Number 22" + DocumentFull.getJAXBNodesViaXPath(NodePathing, P.class, true));
        //        System.out.println("NEW Runnner Scene: Before Bed Number 22" + DocumentFull.getJAXBNodesViaXPath(1, String.valueOf(P.class), true));
        //        System.out.println("NEW Runnner Scene: Before Bed Number 22" + DocumentFull.getJAXBNodesViaXPath(String.valueOf(P.class), true));

        //System.out.println("NEW Runnner Scene: Before Bed Number 22" + DocumentFull.getJAXBNodesViaXPath(String.valueOf(P.class), false));
// IF statement
        System.out.println("Plake Number 11.65" + SerialNodes);
        for (Object drake : SerialNodes) {
            //if(drake instanceof P) {
           //     P drakeP = (P) drake;
           //     System.out.println("Plake Number 11.66" + drakeP);
          //  }
            System.out.println("Plake Number 10.7" + drake);
            //P place = P.class.cast(drake);
            //Get Paragraph from the Document
            P ParaI = (P) drake;
            System.out.println("Plake Number 10.8" + ParaI);
            System.out.println("Plake Number 10.85" + ParaI.getParaId());
            System.out.println("Plake Number 10.9" + ParaI.getTextId());
            System.out.println("Plake Number 10.10" + ((P) drake).getContent());
            // System.out.println(place);
            System.out.println("Plake Number 10.9" + ParaI.getContent());
          //  System.out.println("Plake Number 10.10" + ParaI.getContent().get(0));
          //  System.out.println("Plake Number 10.11" + ParaI.getContent().get(0).getClass());
         //   System.out.println("Plake Number 10.12" + ParaI.getContent().get(0).getClass().getName());
          //  System.out.println("Plake Number 10.13" + ParaI.getContent().get(0).getClass().getName().equals("org.docx4j.wml.Text"));
            String Placer = String.valueOf(ParaI.getPPr().getPStyle().getVal());
            System.out.println("Plake Number 10.14" + Placer);
            System.out.println("Plake Number 10.15" + Placer.contains("\n"));
            String ParaText = Placer.replace("\n", " ");
            System.out.println("Plake Number 10.16" + ParaText);
            //String testingP = ParaI.getContent().get(0).toString();
          //  String placeValue = place.getPPr().getPStyle().getVal();
            //String placeValue = place.getValue();
            String placeValue = ParaI.getParaId();
           // String placeValue2 = ParaI.getValue();
            System.out.println("ERROR SEEN" + placeValue);
            System.out.println("Number 13" + drake);
            System.out.println("Number 14" + ParaI);
        }//placevalue
        System.out.println("Number 12" + SerialNodes);

       // System.out.println("Number 8" + SeriesPara);
        //DocumentFull.getContent(); //This is the content of the document in the form of a string can pass to the front end
        System.out.println("Number 10" + DocumentFull.getContent());
        System.out.println("Number 11" + dataUpload);

     //   PPr pPr = ((P) XmlUtils.unwrap(DocumentFull)).getPPr();

     //   if (pPr != null) {
     //       PPrBase.PStyle pStyle = pPr.getPStyle();

      //      if (pStyle != null) {
     //     //      String style = pStyle.getVal();
       //         System.out.println("Style: " + style);
          //  }
    //    }
   //     System.out.println("Number 12" + DocumentFull);
    //    System.out.println("Number 13" + pPr);

//        return DocumentFull.getContent() + "Hi There";
        return DocumentFull.getContent() + "Hi There";
    }
 
Parsed in 0.031 seconds, using GeSHi 1.0.8.4


LINKS or Any Help would be Appreciated.

If I grab the document, do I need to use the Factory in some manner, is my loading incorrect. I tried using paragraph is org.docx4j.wml.P; a paragraph is basically made up of runs of text.
@XmlRootElement(name = "p")
public class P implements Child, ContentAccessor
But got problems the with xml root, and unsure, how to access the different portions, I tried using the webapp for more data info, but I only know about the three different portions, core, extended and custom.

Listed Below is the Two Document I have been testing with. The Current UI is able to send over any document chosen.

Re: Node Path Utlization: Targeting Paragraphs and Text Excl

PostPosted: Fri Apr 15, 2022 10:07 am
by jason
If I'm understanding you correctly, your issue is that //w:p also returns bullet points and table cell content?

That XPath returns table cell content because a w:tc can contain a w:p.

To avoid that, you can use an XPath like /w:document/w:body/w:p

Note that that simple XPath would also leave out any w:p which is inside a content control. Whether that matters to you depends on the predictability of the content of your input documents.

Regarding bullet points, these are returned because in Word, numbered and bulleted items are just ordinary w:p elements, with numbering properties. Your challenge is that the numbering properties can be directly attached to the paragraph (ie w:pPr/w:numPr) or present in a style referenced by the paragraph.

So given your requirements, XPath is probably not the best approach to getting the paragraphs you want. You'd be better off iterating through documentPart.getContents().getBody().getContent(), then for each item returned, if it is a paragraph (instanceof P), check whether it is numbered.

For that, have a look at https://github.com/plutext/docx4j/blob/ ... .java#L129

Re: Node Path Utlization: Targeting Paragraphs and Text Excl

PostPosted: Sun Apr 17, 2022 5:05 am
by themanoneof
Thanks for that. I have two things to address with this. One I am also accounting for a problem, when I give it a document with a large amount of pages, with large input strings, like in the the thousands. It causes it to not be read, and I am looking for a way to fix that issue, I am given
Code: Select all
INFO 12052 --- [nio-8080-exec-5] o.d.o.parts.JaxbXmlPartXPathAware        : encountered unexpected content in /word/document.xml; pre-processing
2022-04-16 13:16:10.312  INFO 12052 --- [nio-8080-exec-5] org.docx4j.XmlUtils                      : Using org.apache.xalan.transformer.TransformerImpl
2022-04-16 13:16:11.380 ERROR 12052 --- [nio-8080-exec-5] o.d.o.parts.JaxbXmlPartXPathAware        : For input string: "100.0"

It is mostly dealing wit the input string length, being overcrowded, I get three errors with it, and this only occurs on specific document, don't have any large test cases. I tested with 5 random paragraphs copied, to 70 pages, and got no issues. If you have any idea, what could be causing it.
How would I check for that, for number lists, because the table portion seems to work for that portion. When it comes to anything has no newline break, I want to filter only the paragraphs, with headings like name, date and so on, being enclosed inside of tables. I understand the importance of the nodepath, but utilizing the information you gave me, is where I am having a bit of an issue. How would I exactly substitute xpath, I think the problem with the reading that document, I am having a problem with is that very reason. If you have any guide for that portion, that would be helpful. I'm unsure how to utilize node path without Xpath.

How would itering with documentPart.getContents().getBody().getContent() work, would be replacing the MainDocumentPart portion, and how would recognize it. I want to be able to grab a paragraph, solely from from the XML manner like this, but grab the contents instead of the formatting, with traversal and a callback.
https://www.docx4java.org/forums/docx-java-f6/iterating-over-properties-t2764.html#p9566

I want to exclude regular headings, like name: date: and so on. I know those will get picked up as paragraphs, looking for a way to do that with transversal, which is working with properties link above.

Re: Node Path Utlization: Targeting Paragraphs and Text Excl

PostPosted: Sun Apr 17, 2022 10:05 am
by jason
themanoneof wrote:INFO 12052 --- [nio-8080-exec-5] o.d.o.parts.JaxbXmlPartXPathAware        : encountered unexpected content in /word/document.xml; pre-processing
2022-04-16 13:16:10.312  INFO 12052 --- [nio-8080-exec-5] org.docx4j.XmlUtils                      : Using org.apache.xalan.transformer.TransformerImpl
2022-04-16 13:16:11.380 ERROR 12052 --- [nio-8080-exec-5] o.d.o.parts.JaxbXmlPartXPathAware        : For input string: "100.0"


There is unexpected content in your input docx. How was it created?

For fixing such issues, docx4j ships with and tries to apply https://github.com/plutext/docx4j/blob/ ... essor.xslt

You can override this with your own, via https://github.com/plutext/docx4j/blob/ ... rties#L199

But better of course to fix the program which created the docx if you can.

Re: Node Path Utlization: Targeting Paragraphs and Text Excl

PostPosted: Sun Apr 17, 2022 10:41 am
by themanoneof
I've been treading through it many times today, and I know that is in reference, it has to deal with Xpath. I have been going over the content traversal portion, but I am only able to utilize the portions of the properties, and paragraph properties. I can't figure out for the life of me, how to only select paragraphs, with traversal, with class distinction being hard to understand. The callback makes some sense, but honestly unsure if its even doing any of this, selection with looking at P class. It is giant string of content, the first one was 7770. It has to deal with the length and sorting of the document, plus each document is continually read, making traversal way better.
Code: Select all
static class PFinder extends CallbackImpl {

            List<P> paragraphList = new ArrayList<P>(); 

            @Override
            public List<Object> apply(Object o) {

                    if (o instanceof P ) {
                          paragraphList .add((P)o);
                    }                     
                    return null;
            }
    }

            PFinder PFinder = new PFinder();
            new TraversalUtil(paragraphs, PFinder);

            for ( P p : pFinder.paragraphList ) { ...

I've looked some other code given online, but nothing seems to be able to work as I intend.

A similar problem I am having would be this. docx-java-f6/problem-with-document-created-by-google-docs-t1802.html
I also would like to state I have no document.xml file.

Re: Node Path Utlization: Targeting Paragraphs and Text Excl

PostPosted: Thu Apr 21, 2022 7:39 pm
by jason
I'm not quite sure what you are asking here or how I can help further.

It is generally easy enough to understand the unexpected content. I can look at that for you if you send me your docx.

Regarding working with paragraphs, you aren't doing it the way I suggested, which is fine. If you need to look beyond the properties into styles and numbering, have a look at PropertyResolver