Comparing Word Documents for similarities and differences

lameid · Nov 2, 2018

I know VBA well but have barely played in the realm of Word's object model.

My situation is that I may have 20 or more Word Documents used to define the text that is sent amongst various markets. They will likely be similar but with some differences.

Generally we look at these in terms of paragraphs in the documents and then ultimately program an Access report to generate the documents with code specifying differences in a textbox for each paragraph. What I would like to be able to do is identify the documents that are same in certain paragraphs and which are different.

Ultimately I would like to convert at least writing the code to a process that does this. I know exactly how to do this on the Access side. My problem is on the Word side, I don't know how I should be doing this. Should I literally just read each document paragraph by paragraph and then start analyzing that data? Should I compare documents and iterative compare multiple files? I think the first idea is better because there are more than two.

So what Word Object model elements would you look at using for this? This is a big undertaking and honestly a back burner project but the payoff is huge so I am hoping to make some progress with it.

dhookom · Nov 2, 2018

I would expect there are many tools for doing this like this built-in one.

Duane
Minnesota
Hook'D on Access
MS Access MVP 2001-2016

lameid · Nov 5, 2018

I think you have completely missed the scope here. I want to read with VBA and as OUTPUT construct code that deals with the differences. Word has a compare function. I know there is a way to inspect the compares but comparing multiple documents one on one does not really solve the possibilities of the differences and sameness.

macropod · Nov 5, 2018

Au contraire, if you have a base document (or one you can use for that purpose) you can use for the comparison, you can compare each of the others against that. If you want a many-to-many comparison, even that could be done using Word's own tools but you're also going to need a database or something such where you can perform whatever analyses of the results you care to undertake.

Cheers
Paul Edstein
[MS MVP - Word]

DjangMan · Nov 12, 2018

It looks like if you use the document's Compare method that you can then work through the changes/revisions. You would want to start with a 'base' document but it would be possible to compare each document against each other document if you needed to.

https://word.tips.net/T001303_Finding_Changes_by_Editor.html

https://stackoverflow.com/questions...ments-how-to-check-revisions-objects-contents

https://www.thedoctools.com/word-macros-tips/word-macros/extract-tracked-changes-to-new-document/

lameid · Nov 12, 2018

Macropod, how does one do a many to many comparison analysis in word? I have only ever seen the ability to compare two documents at a time.

Otherwise what is the advantage of using the comparison feature within Word vs. say just pulling in bits of text into a table and then using Access code to compare?

mintjulep · Nov 12, 2018

lameid said:
have 20 or more Word Documents used to define the text that is sent amongst various markets.

lameid said:
program an Access report to generate the documents with code specifying differences

I suggest that you first decide if you have a chicken and need an egg, or have an egg and need a chicken.

Depending on that answer, this doesn't seem like much more than a mail merge.

lameid · Nov 12, 2018

mintjulep said:
I suggest that you first decide if you have a chicken and need an egg, or have an egg and need a chicken.

It is more of an ecosystem that a species. I am just looking to code something repetitive that I frequently do. I have my reasons, they are good ones, and I am not going to spend hours laying them out. Most certainly no rationale human being can conceive of something this involved, be willing to do it and not have very good reasons.

A combination of the paragraphs collection and range object along with reading some format information is starting to look promising.

On the other hand somehow pushing each paragraph into a record with rich text format on the paragraph field is interesting too although the logistics of this are elusive to me. This perhaps is the most promising as it would break everything down to a simple text analysis with some HTML tags embedded (what data in the rich text format looks like as a string). I just am not having much luck understanding the clipboard and formatting other than the object built in VBA does not play nice which means the Windows API. Ironically I already poked that bear on another issue and it wasn't worth pursuing at the time. This may be more intriguing.

mintjulep · Nov 12, 2018

lameid said:
I am just looking to code something repetitive that I frequently do.

And you are here seeking help.

And it sounds like a possibly interesting challenge, that several here would be willing to help with.

Except that you thus far can't or won't clearly explain what your something repetitive is.

macropod · Nov 12, 2018

Lameid said:
Macropod, how does one do a many to many comparison analysis in word? I have only ever seen the ability to compare two documents at a time.

Obviously, you'd compare each document against each of the others in turn. Even with just 10 documents, that requires 45 two-document comparisons... Why you'd need to compare each document against all the others, rather than against just the base escapes me.

Cheers
Paul Edstein
[MS MVP - Word]

lameid · Nov 12, 2018

I repeatedly take a set of documents for a Purpose which typically have a variety of 20 or so versions. I/we compare and contrast them generally at the paragraph level. Then I program the content parts that are not part of the standard report implementation which involves a header and footer in the report and most of the footer in Word but I do need the document numbers. Obviously based on some key the text is for I am conditionally doing things and for those that are the same or same for a group, I do not explicitly program the text for a case but for the group. The question then is what things are the same and can be grouped together for the code. The text and document wide fonts are literals in the code. Other fonts I typically implement as a Access "rich text" format font tag.

Purposes come in repeatedly. What I came here for is help with is what can be done on the Word side of the shop. I offered two separate methodologies. Iterative compares of documents which seems at best difficult to manage or reading document pieces and then comparing and contrasting. I have always favored the latter as it seems more direct as ultimately a list has to be read to implement the process and this seems a more linear route to it. I am open to better methods but I a guessing none are.

SkipVought · Nov 12, 2018

What happens in vagueness, stays in vagueness. And Elvis has left the building!

I hope that someone else understands your latest post, but it reminds me of political doublespeak.

Maybe I’m not up to speed with “purposes.” Kind of sounds to me like “I didn’t know the gun was loaded/and I’m so sorry my friend/I didn’t know the gun was loaded/and I’ll never, never do it again!”

So are there parameters that describe a specific “purpose?”

And how is one “purpose” differentiated from another?

How does a set of documents reside in distinct “purpose?”

Skip,
_{Just traded in my OLD subtlety...

for a NUance!}

macropod · Nov 13, 2018

SkipVought said:
What happens in vagueness, stays in vagueness. And Elvis has left the building!

Indeed, the descriptions so far are about as opaque/obscure/vague as any I've seen. A bit of perspicacity wouldn't go astray...

Cheers
Paul Edstein
[MS MVP - Word]

lameid · Nov 13, 2018

Purpose is just a general word indicating that there are distinct sets of letters. The letter has a reason or purpose for being sent. Use whatever word you want. Only documents within a purpose are being compared amongst themselves as described by the OP. The question was asked what am I doing repeatedly.

I see a reply from Macropod that must have happened between the time I loaded the page last time and posted... Yes the many comparisons is why I think it would be easier to read everything from word and do the analysis vs multiple binary comparsions. All the documents have to be compared because out of my example of 20 there may be 3 sets of 4 that are very similar within each grouping of 4 with minor differences in a few out of n paragraphs. Then the remaining 8 (20 - 3 * 4) will vary more substantially. Differences amongst each group and each individual maybe like all the sentences are worded differently or 6 sentences are used instead of 4 to convey the same contextual meaning, etc (people exist, a myriad of possibilities). This sameness by nature sounds like grouping and counting with queries to identify groups and individuals - or equivalent logic since long text is involved. The trick of course will be to add the segmenting logic but that is not really the word piece I am asking for help with. I am also contemplating flattening the logic to just include everything for every version. The code would be longer, but easier to automate coding and may be a good alpha test and should give me some impression of performance versus fewer literals with groups / segments in the report code.

The question remains what is the best way to pull this out of Word?

macropod · Nov 13, 2018

IMHO, I have already answered your final question...

Cheers
Paul Edstein
[MS MVP - Word]

lameid · Nov 13, 2018

Why am I wrong in analysis methodology? What does compare do that is exceptionally helpful here? It seems to me 'Compare' tells me the Contrasts but like I said, I am here for help with Word and the objects side of it has I really don't know what is under the hood there which is why I need the details.

macropod · Nov 14, 2018

Have you actually tried doing a document comparison using Word's tools? If so which method did you use?

Cheers
Paul Edstein
[MS MVP - Word]

lameid · Nov 14, 2018

I use the Compare button on the review ribbon to compare two documents most frequently. Usually I do this if I am comparing a revision to previous version to implement any changes. I then ensure markup is turned on so I can see what has changed showing all three of the markup document and two compared documents.

I am then usually verifying code matches the original while I make the changes indicated by markup. Since the code is layered it is more involved than that. USUALLY content literals are entered by paragraph and done at an appropriate point.

The report code is generally structured like below. This is an over simplification but the basic structure with contrite literals for an example.

Code:

Me!txtP01 = "Dear " & Me!txtFname & " " & Me!txtLname & "," 'this control would be "paragraph" 1, the other controls on the right side of equal sign arehidden for 
                                                             'piping in values from recordsource
                                                             'At this point doing this in code is a team style preference so all content is in same place 
                                                             'as it would be easier to implement this in the control source of the control with source fields

Select Case Grouping 'some sort of grouping is known, the sameness probelm
  Case "Group A" 
     Me!txtP02 = "<div><b>New Product!</b></div>"
    Select Case DocID 'a key that ties back to the original document
      Case 1
        Me!txtP03 = "This is the best Widget on the market!"
      Case 2
        Me!txtP03 = "This Widget gives you the best bang for the buck!"
    End Select
  Case "Group B
    Me!txtP02 = "<div><b>Widgets are here!</b></div>"
    Select Case DocID
      Case 3
        Me!txtP03 = "The Blue ones are the coolest"
      Case 4
       Me!txtP03 = "The Red ones are the coolest"
    End Select
End Select

I had mentioned the possibility of removing the grouping altogether in which case all paragraphs would be specified for each document.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Comparing Word Documents for similarities and differences

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor