Hello friends,
I've posted this in the XML forum, but it might be better posted here.
I am currently developing a method to transform Word documents (later also other Office formats) to XML, storing their content in 1 XML and formatting in another.
I know, that Office 2003+ already provides a "save as XML" functionality. This however does not fully meet my requirements, which is why I want a customized solution.
This is all still quite crude, and the second, formatting XML will in future become a true stylesheet (hopefully).
My problem now is, that not all contents of the Word document are being extracted, since obviously not ALL content belongs to "sentences" collection. Is that so?
I need to segment the contents by sentences or equivalent to sentences.
So far, this is my code, using Chilkat XML:
So, my problem is with the bolded part. Only "true" sentences are being extracted. Lots of document content remains untouched.
Do I have to cycle through all the words, or is there a more elegant approach?
All sorts of ideas welcome!
Thanks & regards,
Andy
[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]
I've posted this in the XML forum, but it might be better posted here.
I am currently developing a method to transform Word documents (later also other Office formats) to XML, storing their content in 1 XML and formatting in another.
I know, that Office 2003+ already provides a "save as XML" functionality. This however does not fully meet my requirements, which is why I want a customized solution.
This is all still quite crude, and the second, formatting XML will in future become a true stylesheet (hopefully).
My problem now is, that not all contents of the Word document are being extracted, since obviously not ALL content belongs to "sentences" collection. Is that so?
I need to segment the contents by sentences or equivalent to sentences.
So far, this is my code, using Chilkat XML:
Code:
Sub Doc2XML()
Dim myXML As New ChilkatXml, myXSL As New ChilkatXml, Knoten As ChilkatXml
Dim doc As Document
Dim Sens As Double, ctr As Double, ran As Range
On Error GoTo errhandl
Set doc = ActiveDocument
'****************
'Write content & formatting xml headers
'****************
myXML.Encoding = "utf-8"
'myXML.Version = "1.0"
myXML.NewChild "document", doc.Name
myXSL.Encoding = "utf-8"
'myXSL.Version = "1.0"
myXSL.NewChild "document", doc.Name
'******************
'*****Calculate number of sentences*****
[b]Sens = doc.Sentences.Count[/b]
ctr = 1
Do While ctr <= Sens
[b]Set ran = doc.Sentences(ctr)[/b]
'**************
'write content
Set Knoten = myXML.NewChild("sentence", "")
Knoten.AddAttribute "number", CStr(ctr)
Knoten.NewChild "text", ran.Text
'***************
'write formatting
'...
'and so forth
'...
Loop
myXML.SaveXml doc.Path & "\" & Left(doc.Name, Len(doc.Name) - 4) & "_content.xml"
myXSL.SaveXml doc.Path & "\" & Left(doc.Name, Len(doc.Name) - 4) & "_formatting.xml"
So, my problem is with the bolded part. Only "true" sentences are being extracted. Lots of document content remains untouched.
Do I have to cycle through all the words, or is there a more elegant approach?
All sorts of ideas welcome!
Thanks & regards,
Andy
[blue]Help us, join us, participate
IAHRA - International Alliance of Human Rights Advocates[/blue]