Page 1 of 1

How to count number of characters in a docx file

PostPosted: Wed Jun 22, 2011 1:44 am
by suncity65
Hi
Please can some one tell me how i can get the number of characters in a word docs file.
thanks!

Re: How to count number of characters in a docx file

PostPosted: Wed Jun 22, 2011 3:02 pm
by jason
I assume you want the number of characters of printed text, not the file length.

If you are looking at an existing docx created by Word, you'll find this in the properties.

If you want to count characters in a docx you have created or modified in docx4j, there are several approaches:

- traverse the main document part, and count the characters in each text run; or

- easier - use org.docx4j.TextUtils to get all the text into a StringWriter

Re: How to count number of characters in a docx file

PostPosted: Wed Jun 22, 2011 3:13 pm
by suncity65
Hi Jason

Thank you,
I want to count number of characters including spaces in the document which is uploaded ( created in word ).

Re: How to count number of characters in a docx file

PostPosted: Thu Jun 23, 2011 12:41 am
by suncity65
Thanks for the info Jason, here is my code which might be helpful for some..

Code: Select all
/* Extract Text Count inclusive of spaces using docx4j */ 
              try {
                                File file = new File("c:\\sample1.docx");
               WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
                    MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
                     
                      org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();
                     
                     
                      StringWriter str = new StringWriter();
                      org.docx4j.TextUtils.extractText(wmlDocumentEl, str);
                     
                     
                      String strString = str.toString();
                      System.out.println("Count....."+strString.length());
                     
                      //out.flush();
                    //  out.close();
                      str.close();
            } catch (Docx4JException e) {
               // TODO Auto-generated catch block
               e.printStackTrace();
            } catch (Exception e) {
               // TODO Auto-generated catch block
               e.printStackTrace();
            }

Re: How to count number of characters in a docx file

PostPosted: Wed Jun 03, 2015 8:41 pm
by andreas
How do handle tracked changes. Is there a way to exclude tracked changes?

Re: How to count number of characters in a docx file

PostPosted: Wed Sep 02, 2015 10:48 pm
by andreas
Here is one way to handle "removed" paragraphs in a character count.

Code: Select all
  public void testWordCountInsertedDeletedText() {
    /* Extract Text Count using docx4j */

    try {
      File file = new File("eksempler/Samledokument_2015-01-23_track_changes.docx");
      WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(file);
      MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

      org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document) documentPart.getJaxbElement();

       int deletedChars = 0;
      ClassFinder finderDelText = new ClassFinder(DelText.class);
      new TraversalUtil(documentPart, finderDelText);

      for (Object anDeletedPara : finderDelText.results) {
        if(anDeletedPara instanceof DelText){
          DelText delText = (DelText) anDeletedPara;
          if(delText.getValue()!=null){
            deletedChars += delText.getValue().length();
//            System.out.println("found: "+ delText.getClass() + " value: " + delText.getValue() + " length: " +  + delText.getValue().length() );
          }
        }

      }
      System.out.println("deleted chars in total: "+ deletedChars );

      StringWriter str = new StringWriter();

      org.docx4j.TextUtils.extractText(wmlDocumentEl, str);

      String strString = str.toString();
      String strStringclean = str.toString().replaceAll("[\\n\\t ]", ""); // remove newlines, tabs, and spaces;

      System.out.println(strString.length() + " count with whitespace..... ");
      System.out.println(deletedChars + " deleted chars..... ");
      System.out.println((strString.length()-deletedChars )+ " Count without deleted text (TrackChanges) still with whitespace..... " );

      System.out.println( strStringclean.length() + " count                ..... ");

      str.close();
    } catch (Docx4JException e) {
      e.printStackTrace();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }