Custom Word DOC to XML - Sentences collection 1

MakeItSo · May 7, 2007

Hello friends,

I've posted this in the XML forum, but it might be better posted here.
I am currently developing a method to transform Word documents (later also other Office formats) to XML, storing their content in 1 XML and formatting in another.
I know, that Office 2003+ already provides a "save as XML" functionality. This however does not fully meet my requirements, which is why I want a customized solution.

This is all still quite crude, and the second, formatting XML will in future become a true stylesheet (hopefully).

My problem now is, that not all contents of the Word document are being extracted, since obviously not ALL content belongs to "sentences" collection. Is that so?

I need to segment the contents by sentences or equivalent to sentences.

So far, this is my code, using Chilkat XML:

Code:

Sub Doc2XML()

Dim myXML As New ChilkatXml, myXSL As New ChilkatXml, Knoten As ChilkatXml
Dim doc As Document
Dim Sens As Double, ctr As Double, ran As Range

On Error GoTo errhandl

Set doc = ActiveDocument
'****************
'Write content & formatting xml headers
'****************
myXML.Encoding = "utf-8"
'myXML.Version = "1.0"
myXML.NewChild "document", doc.Name

myXSL.Encoding = "utf-8"
'myXSL.Version = "1.0"
myXSL.NewChild "document", doc.Name
'******************
'*****Calculate number of sentences*****
[b]Sens = doc.Sentences.Count[/b]
ctr = 1

Do While ctr <= Sens
    [b]Set ran = doc.Sentences(ctr)[/b]
    '**************
    'write content
    Set Knoten = myXML.NewChild("sentence", "")
    Knoten.AddAttribute "number", CStr(ctr)
    Knoten.NewChild "text", ran.Text
    '***************
    'write formatting
'...
'and so forth
'...
Loop

myXML.SaveXml doc.Path & "\" & Left(doc.Name, Len(doc.Name) - 4) & "_content.xml"
myXSL.SaveXml doc.Path & "\" & Left(doc.Name, Len(doc.Name) - 4) & "_formatting.xml"

So, my problem is with the bolded part. Only "true" sentences are being extracted. Lots of document content remains untouched.

Do I have to cycle through all the words, or is there a more elegant approach?

All sorts of ideas welcome!

Thanks & regards,
Andy

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

PHV · May 7, 2007

You may perhaps have to take a look at StoryRanges

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

MakeItSo · May 7, 2007

Hmmm. I looked into that, and StoryRanges refers to an even wider range than Paragraphs.
I WILL need story ranges to include all text from headers and footers, but still it will skip quite a many sentences.

I have altered my code like this:

Code:

For StoRan = 1 To doc.StoryRanges.Count
    Sens = doc.StoryRanges(StoRan).Sentences.Count
    '....
        Set ran = doc.StoryRanges(StoRan).Sentences(ctr)

Still won't let me access the text.
Some parts are only addressed through the Paragraphs collection. This however won't break down the text to small enough chunks.
[sadeyes]

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

fumei · May 7, 2007

Only "true" sentences are being extracted. Lots of document content remains untouched.

Could you elaborate on this?

What content - an example? - remains untouched?
What is an example of a "false"(?) sentence?

Gerry
My paintings and sculpture

fumei · May 7, 2007

Are there tables in the document?

Gerry
My paintings and sculpture

MakeItSo · May 7, 2007

Hi Gerry,

Yes, there are tables in the document. They show the same behaviour. Now here's what I mean with "true" sentences:

I have found out, that it obviously depends, not only whether a sentence is ended by a spearator (period, semicolon, colon), but also whether it starts with a capital letter.

So, e.g. in one table, only "Mr." from "Mr. Harry Sample" was extracted.
"Harry Sample" was not considered a sentence, hence not extracted.
One funny example how "z. B.", which is a German abbrev. for "for example" was misinterpreted:
The sentence:

Text text text text text z. B. other text, text, and so forth.

came out like this:

Text text text text text z.<144> other text, text, and so forth.

Only the "B." was treated as a sentence, the rest was not...

Or a really long sentence, containing a "bzw." (respectively), was treated as sentence up to and including "bzw." the two following words remained untouched.

This is weird, isn't it?

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

MakeItSo · May 7, 2007

Addition:

No, there is no difference in either font, style or other formatting like underline/hidden etc. between extracted and non-extracted text.

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

fumei · May 7, 2007

Using the Sentences collection is tricky. For example:

This is a sentence 6.<p>

The above is the text "This is a sentence 6." followed by the paragraph mark.

There are TWO sentences in this. A paragraph mark is considered a sentence.

A blank cell in a table is a sentence. As is the End-of-row marker of a table.

I have been testing with the following code, looking at how sentences are parsed.

Code:

Sub TestSentences()

Dim r As Range
Dim var
Dim sEnd As String
Dim sStart As String
sStart = "start  "
sEnd = "  end"
For var = 1 To ActiveDocument.Range.Sentences.Count
   Set r = ActiveDocument.Sentences(var)
      Debug.Print sStart & r.Text & sEnd
Next
End Sub

I made a table with:

CellA said:
Text text text text text z. B. other text, text, and so forth.

CellB said:
Mr. Harry Sample

NOTE: CellB is the last cell in the row.

Here is the result.

From CellA
start Text text text text text z. end
start B. end
start other text, text, and so forth.
end
From CellB
start Mr. end
start Harry Sample

end

Notice that CellA has three sentences, AND the last sentence includes the - which is the end-of-cell marker.

CellB does parse Harry Sample as a separate sentence, but also includes in that sentence the end-of-cell marker AND the end-of-row marker.

So I really do not know why you are getting untouched words after.

Gerry
My paintings and sculpture

MakeItSo · May 7, 2007

Neither do I, Gerry.
By the way, I already recognized that end of cell and paragraph marker issue, so I exclude that.
I make the range selection shrink until there is only text left:

Code:

Do While Left(ran.Text, 1) = vbTab Or Left(ran.Text, 1) = " "
                ran.Start = ran.Start + 1
            Loop
            Do While (Right(ran.Text, 1) = vbCr Or Right(ran.Text, 1) = vbLf Or Right(ran.Text, 1) = Chr(7)) '7=table cell end character
                ran.End = ran.End - 1
            Loop

This way, I exclude leading spaces and tabs, hard and soft line breaks, and cell end markers.

After anyother close look, it seems that this weird problem occurs in every sentence that contains an abbreviation.
Example:

...mit evtl. notwendigen...(="...with possibly required..."

In this case, too, the entire sentence up to "evtl." has been recognized, however the remaining sentence skipped.

I have added a "stop" order in case the range text contains the first two words following this one "evtl." -
The stop order was not executed, no "sentence" contained these words.

Frankly: I don't get it.

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

fumei · May 7, 2007

I don't get it either, because when I run my code against:

...mit evtl. notwendigen...(="...with possibly required..."

I get:

start ... end
start mit evtl. end
start notwendigen...(="... end
start with possibly required..."
end

It definitely catches the following text as "sentences". The Range object is set for that text. You are saying the range object is never set for anything after "evtl."?

Hmmmm.

Gerry
My paintings and sculpture

MakeItSo · May 7, 2007

Sorry for that mess - I think I sorted it:

When your code returned no such error, I thought it must be what I do during extraction. I replace the sentence with a placeholder. Never thought word would care whether the sentence said "this I do" or "sentence1".
That obviously makes the sentences collection go nuts.
The problem disappeared right now as I deactivated the placeholder function.
[banghead]

I should have guessed...

I'll try again by cycling backwards through the sentences...

Sorry for that. You mentioned such things often enough in other posts - should have known...

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

MakeItSo · May 7, 2007

Yepp, confirmed!
Now, the code extracts everything, including all table cells, headers, footers, hidden text....

Jeeez. Life could be so easy if you paid attention once in a while...

Will post back soon with an amended code piece.

[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]

fumei · May 7, 2007

Hey! Hey!

Gerry
My paintings and sculpture

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Custom Word DOC to XML - Sentences collection 1

MakeItSo

Programmer

PHV

MIS

MakeItSo

Programmer

fumei

Technical User

fumei

Technical User

MakeItSo

Programmer

MakeItSo

Programmer

fumei

Technical User

MakeItSo

Programmer

fumei

Technical User

MakeItSo

Programmer

MakeItSo

Programmer

fumei

Technical User

Similar threads

Part and Inventory Search

Sponsor