Page 1 of 1

A possible Docx4j addon?

PostPosted: Mon Jul 27, 2015 12:37 am
by derekpk
Hello All,

I have written a class that searches Docx files, how do I publish it or demo it the community to find out if I can add it to the Docx4j github repo?

Regards
Derek.

Re: A possible Docx4j addon?

PostPosted: Mon Jul 27, 2015 10:31 pm
by jason
Hi Derek

How much code is there?

You could make a git pull request.

Or you could create a standalone project on github - either temporarily or permanently.

kind regards .. Jason

Re: A possible Docx4j addon?

PostPosted: Tue Jul 28, 2015 12:19 am
by derekpk
Thanks for the response Jason,

If I add it to my git hub account and post a link to it here, could you take a look at it and let me know what you think?

Derek.

Re: A possible Docx4j addon?

PostPosted: Tue Jul 28, 2015 5:35 pm
by jason
Sure, I'd be happy to.

Re: A possible Docx4j addon?

PostPosted: Thu Jul 30, 2015 11:10 am
by derekpk
Hello Jason,

Here is a link to my git: https://github.com/derekpk/DocxSearchAndTag
(Modified the structure to follow maven folder layout)

Have a look at the read me, it explains it detail, it's at about Version 0.5

Basically the search uses xml to define what the search pattern is.
It searches and if it finds a match it will wrap the found text in <tags> at the first and last locations of the found search.

There are no limits to the number of searches, so if you had a a piece of text like the following
"Hello World, Hello everyone, I love searching a string more than one time(but not today)."

You could define searches that would produce the following
"<ONE>Hello</ONE> World, <ONE>Hello</ONE> every<TWO>one</TWO>, I love searching a string more than <TWO>one</TWO> time<THREE>(but not today)</THREE>."

It also has available the character indexes if the user only needs that and not the applied tags.
For instance, using the above example you can have access to the found coordinates.

Search name: ONE
Start position: 1
End position: 5

Search name: ONE
Start position: 13
End position: 18

Search name: TWO
Start position: 24
End position: 27

Search name: ONE
Start position: 66
End position: 69

....... and so on

Any questions or comments(good or bad) please respond

Thanks, Derek

Re: A possible Docx4j addon?

PostPosted: Wed Aug 26, 2015 7:53 pm
by derekpk
Jason, did you get to look at the project?

Re: A possible Docx4j addon?

PostPosted: Thu Aug 27, 2015 10:10 am
by jason
I checked it out from GitHub just now. Sorry for the delay - its been flu season here in the southern hemisphere :-(

Do you need to update your JAXB model or the SequenceMatch.xml example?

I get Attribute 'p:type' is not allowed to appear in element 'p:sequence':

Code: Select all
You chose to search this file: C:\Users\jharrop\git\DocxSearchAndTag\src\main\resources\Example.docx
With this sequence file: C:\Users\jharrop\git\DocxSearchAndTag\SequenceMatch.xml
DocumentEventHandler :
Event: Severity:  2, Message:  cvc-complex-type.3.2.2: Attribute 'p:type' is not allowed to appear in element 'p:sequence'., Linked Exception:  org.xml.sax.SAXParseException; systemId: file:/C:/Users/jharrop/git/DocxSearchAndTag/SequenceMatch.xml; lineNumber: 7; columnNumber: 65; cvc-complex-type.3.2.2: Attribute 'p:type' is not allowed to appear in element 'p:sequence'., LOCATOR   Line Number:  7, Column Number:  65, Offset:  -1, Object:  null, Node:  null, Url:  file:/C:/Users/jharrop/git/DocxSearchAndTag/SequenceMatch.xml
java.lang.Exception: javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException; systemId: file:/C:/Users/jharrop/git/DocxSearchAndTag/SequenceMatch.xml; lineNumber: 7; columnNumber: 65; cvc-complex-type.3.2.2: Attribute 'p:type' is not allowed to appear in element 'p:sequence'.]
   at ie.decoder.docx.searchandtag.Unmarshall.UnmarshallTheDocument(Unmarshall.java:85)
   at ie.decoder.docx.searchandtag.BlobFinder.BlobSetup(BlobFinder.java:94)
   at ie.decoder.docx.searchandtag.BlobFinder.Search(BlobFinder.java:153)
   at ie.decoder.docx.searchandtag.Main.main(Main.java:64)
Caused by: javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException; systemId: file:/C:/Users/jharrop/git/DocxSearchAndTag/SequenceMatch.xml; lineNumber: 7; columnNumber: 65; cvc-complex-type.3.2.2: Attribute 'p:type' is not allowed to appear in element 'p:sequence'.]
   at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(Unknown Source)
   at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(Unknown Source)
   at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(Unknown Source)
   at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(Unknown Source)
   at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
   at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
   at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
   at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(Unknown Source)
   at ie.decoder.docx.searchandtag.Unmarshall.UnmarshallTheDocument(Unmarshall.java:80)
   ... 3 more
Caused by: org.xml.sax.SAXParseException; systemId: file:/C:/Users/jharrop/git/DocxSearchAndTag/SequenceMatch.xml; lineNumber: 7; columnNumber: 65; cvc-complex-type.3.2.2: Attribute 'p:type' is not allowed to appear in element 'p:sequence'.
   at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
   at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.processAttributes(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorHandlerImpl.startElement(Unknown Source)
   at com.sun.xml.internal.bind.v2.runtime.unmarshaller.ValidatingUnmarshaller.startElement(Unknown Source)
   at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.startElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
   at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
   at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
   ... 10 more
Exception in thread "main" java.lang.NullPointerException
   at ie.decoder.docx.searchandtag.BlobFinder.SequenceSearcher(BlobFinder.java:235)
   at ie.decoder.docx.searchandtag.BlobFinder.Search(BlobFinder.java:169)
   at ie.decoder.docx.searchandtag.Main.main(Main.java:64)

Re: A possible Docx4j addon?

PostPosted: Thu Aug 27, 2015 8:00 pm
by derekpk
Sorry about that, It's always the same, everything works great until the first person looks at it and it breaks :oops:

anyway, that sequence file was an old version and is now gone, you can use the example files in the resources folder, Example.docx and Example.xml.

Derek

Re: A possible Docx4j addon?

PostPosted: Fri Aug 28, 2015 11:59 am
by jason
Hi Derek

It worked :-)

Perhaps you could explain the main ways you see people using it (ie key use cases)?

It is great to have new ways to find stuff in your docx. IMHO, it is valuable to highlight what you can do once you've found it.

From a quick look at the code, you're conducting the search on the main document part marshalled to an XML string.

But somewhere I guess you're discarding the OpenXML tags?

So the user can search for just document text, or OpenXML tags (eg w:p), or some hybrid (eg "p>A continent") - not sure at what point the tags are getting discarded.

When the search is complete, we have text (not OpenXML), plus your tags?

Then what? :-)

Does the resulting data structure allow you to then manipulate the docx? Put another way, is this a read/write technique, or just read?

kind regards .. Jason

Re: A possible Docx4j addon?

PostPosted: Mon Aug 31, 2015 7:57 pm
by derekpk
Hello Jason,

Thanks for the feedback.

Perhaps you could explain the main ways you see people using it (ie key use cases)?


This a little bit like I just invented the wheel, I'm just waiting on someone to invent the motor car;-)

I suppose the main motivation and use case is an alternative to RegEX, the syntax for creating complex multiple searches is straightforward xml.

One use case as I see it is for processing multiple documents, as in, to many to process manually.

Lets say you have documents that require localization or legal documents that you know you will have predefined patterns that need some sort
of mechanism of identification of segments but don't know in advance the exact phrase.
Like in the the example I gave to find content within brackets "(Hello World)" and "(Goodbye All)"
OR
"21/06/72"
You only need to know part of the match but not all.

If for example you could tag certain sequences that DON'T require localization. people names, place names, scientific or mathematical formulae.

If for example you could tag certain sequences that DO require localization. date and time formats.


From a quick look at the code, you're conducting the search on the main document part marshalled to an XML string.

But somewhere I guess you're discarding the OpenXML tags?

So the user can search for just document text, or OpenXML tags (eg w:p), or some hybrid (eg "p>A continent") - not sure at what point the tags are getting discarded.

When the search is complete, we have text (not OpenXML), plus your tags?


At present I am only searching the plain text, I'm not looking at the OpenXML, This is something that I intend to address if the interest is there.
At the end of the current process, in addition to the marked up plain text you have access to a structure containing the indexes for all the found sequences.
So with the indexes you can manipulate the document yourself.

Looking forward to your response.

Derek.