Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

find great grandchild tags in xm. 1

Status
Not open for further replies.

grnzbra

Programmer
Mar 12, 2002
1,273
US
I have a Word .docx document which, when one looks at its XML has, among other things, the following nested tags
<w:p>
<w:r>
<w:t>Some text</w:t>
<w:t>whch needs</w:t>
<w:t>to be concatinated.</w:t>
</w:r>
</w:p>

I am able to write the script that could go to the <w:r> tags and then work with the <w:t> tags. Unfortunately, I need to start at the <w:p> tags. All the references to this type of thing stop at the grandchild nodes. What is the format for reaching the grandchild node of a specific node?

ie starting at
set xmlDoc=CreateObject("Microsoft.XMLDOM")
xmlDoc.async="false"
xmlDoc.load("document.xml")

for each x in xmlDoc.selectNodes("//w:p")

how do I say
for each <w:r>
for each <w:t>
 
instead use

set objTNodes = xmlDoc.getElementsByTagName("w:t")

-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Thank you for the suggestion.
Unfortunately I must start at the <w:p> tag because in some of the <w:p> tags, there are two or more <w:r> tags with one or several <w:t> tags that are concatinated within the <w:r> tags and processed separately. At other times there are <w:p> tags that have individual <w:r> tags with only one <w:t> tag and that needs to be handled as an individual item. And then there is the one in the example in my first post.

On simple test XML files such as:
<States>
<State ref="FL">
<name>Florida</name>
<capital>Tallahassee</capital>
</State>
<State ref="IA">
<name>Iowa</name>
<capital>Des Moines</capital>
</State>
</States>
I can use

For i = 0 to 2
Set objChildNodes = objXMLDoc.documentElement.childNodes.item(i).childNodes
For Each strNode In objChildNodes
document.write(strNode.xml & "<br>")
Next
document.write("<br>")
Next

However, the <w:p> tag is several tags in from the <w:body> tag so what I need to do is go to the <w:p> tags in the manner you described, and for each <w:p> tag run the loop process shown in the simple example. My problem is starting the loops once I have gotten to the level of the <w:p> tags.
 
For a multi-element static structure (like your States example):

Code:
set xmlDoc=CreateObject("Microsoft.XMLDOM")
xmlDoc.async="false"
xmlDoc.load("document.xml")

set objElements = objXML.getElementsByTagName("w:p")

for each objPNode in objElements
    for each objRNode in objPNode.childNodes
        for each objTNode in objRNode.childNodes
            msgbox objTNode.text
        next
    next
next

But, if there is only one child in the parent (like in your <w:p> example), getElementsByTagName won't return a list but rather a single object. A for..each loop doesn't work on single objects and coding for these conditions will only cause fraustration. You will have MUCH better luck and MUCH LESS stress if your open the file as a test file and concatenate all the words between the <w:t> tags

Code:
set objFSO = CreateObject("Scripting.FileSystemObject")
set objText = objFSO.OpenTextFile("c:\temp\test.xml", 1, true, 0)

do while NOT (objText.AtEndOfStream)
	strLine = objText.ReadLine
	intBegin = inStr(strLine, "<w:t>")
	intEnd = inStr(strLine, "</w:t>")
	if (intBegin and intEnd) then
		strText = strText & mid(strLine, intBegin + 5, intEnd - (intBegin + 5)) & " "
	end if
loop
	
msgbox strText

-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
The problem is that within the first <w:p> tag, the contents of the <w:t> tags have to be take individually. However they don't have any identifiers like <state><name> or <capital>

In the first <w:p> tag, the first <w:p> is address, the second is owner name, the third is move in date, the fourth is move out date and the fifth is landlord. Easy enough.

However down a few <w:p> tags, there is a sentence called description that starts out as a full string of text. When it gets into the xml tags, it is split between four <w:t> tags. Or it could be three <w:t> tags or it could be five or seven or however many it ends up being split. These have to all be concationated.

Is there any way to determine which <w:p> tag any given <w:t> tag is in?

By the way, how did you get your sample code into the code window?
 
Again, the problem remains about what type of object .getElementsByTagName returns. If there is more than one child, .getElementsByTagName will return a list of objects, traversed only by a for..each loop. If there is only one child, a single object is returned and a for..each loop will ignore the object because it is not a list.

I don't completely understand the workings of the XML object. What I have said is simply based off of my observations and may be incorrect. Perhaps someone else will know more.

However, if there is a way to distinguish between those <w:t> that need concatenating and those that do not, I strongly recommend using the alternative method I posted. Post the actual XML (or a snippet) that contains those descriptions that need concatenation and those that do not. I'll try and make it happen.

Sorry, I can't be of further help with the method you wish to use.


-Geates


"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Anyway you can change the tag names? In my testing, if a tag has a ":", then the XML.Load() method will fail to load any data into the XML object.

Code:
set objXML = CreateObject("Microsoft.XMLDOM")
objXML.load("c:\temp\grnxbra.xml")
msgbox objXML.text

If any tags in the XML contain a ":" then objXML.text returns nothing. Once the ":" are removed, objXML.text returns the file content

XML that doesn't work:
Code:
<w:p>
   <w:r>
      <w:t>Some text</w:t>
      <w:t>whch needs</w:t>
      <w:t>to be concatinated.</w:t>
   </w:r>
</w:p>

XML that works:
Code:
<wp>
   <wr>
      <wt>Some text</wt>
      <wt>whch needs</wt>
      <wt>to be concatinated.</wt>
   </wr>
</wp>

Can someone else chime in as to why this is?

-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Got it. I wrote a recursive function to traverse parents and children and get the text between the tag with a specified tag name.

test XML:
Code:
<wbody>	
	<wp>
		Parent 1
		<wr>
			Child 1.1
			<wt>Soup without</wt>
			<wt>chicken is</wt>
		</wr>
		<wr>
			Child 1.2
			<wt>a lot like</wt>
			<wt>soup that</wt>
			<wt>would taste</wt>
		</wr>
	</wp>
	<wp>
		Parent 2
		<wr>
			Child 2.1
			<wt>better if</wt>
			<wt>it had</wt>
		</wr>
		<wr>
			Child 2.2
			<wt>chicken</wt>
			<wt>in it.</wt>
		</wr>
	</wp>
</wbody>

VBS:
Code:
function getNodeText(lstParentNodes, strTagName)
	for each objParent in lstParentNodes
		if (objParent.hasChildNodes) then
			strText = strText & getNodeText(objParent.childNodes, strNodeName)
		else
			for each objNode in lstParentNodes
				if (objNode.nodeName = strTagName) then
					strText = strText & objNode.text & " "
				end if
			next
		end if
	next
	getNodeText = strText
end function

'Create and populate XML object
set objXML = CreateObject("Microsoft.XMLDOM")
objXML.load("c:\temp\grnxbra.xml")

'Get a parent node
set lstParentNodes = objXML.getElementsByTagName("wp")

'print text between "wt" tags
msgbox getNodeText(lstParentNodes, "wt")

output:
Soup without chicken is a lot like soup that would tast better if it had chicken in it.

-Geates


"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Thank you for the help.
I have saved your code as an hta file and added "document.write" statements to see what is happening.
Code:
<html>
<body>
<script language="VBScript">
function getNodeText(lstParentNodes, strTagName)
document.write("We've just entered the function. <br>")
document.write(" The value of strTagName is " & strTagName & ".<br>")
    for each objParent in lstParentNodes
        if (objParent.hasChildNodes) then
document.write("The parent has child nodes.<br>")
            strText = strText & getNodeText(objParent.childNodes(1), strNodeName)
document.write("STRTXT " & strText & "<br>")
        else
document.write("The parent has no child nodes.<br>")
            for each objNode in lstParentNodes
document.write("The node.Name is " & objNode.nodeName & ".<br>")
                if (objNode.nodeName = strTagName) then
                    strText = strText & objNode.text & " "
document.write("STRTXT " & strText & "<br>")
                end if
            next
        end if
    next
    getNodeText = strText
end function

document.write("We start here. <br>")
'Create and populate XML object
set objXML = CreateObject("Microsoft.XMLDOM")
objXML.load("c:\temp\grnxbra.xml")

'Get a parent node
set lstParentNodes = objXML.getElementsByTagName("wp")


'print text between "wt" tags
document.write("TO THE FUNCTION, JEEVES!<br>")
msgbox getNodeText(lstParentNodes, "wt")
document.write("DONE")
</script>
</body>
</html>

This results in an error message indicating for line 7 that "Object does not support this property or method" and the following lines being displayed:

We start here.
TO THE FUNCTION, JEEVES!
We've just entered the function.
The value of strTagName is wt.
The parent has child nodes.
We've just entered the function.
The value of strTagName is .

Obviously, we are only calling the function once, but it seems that the function is looping back to a point before the "for each" line.

After the output line indicating that the parent has child nodes, there should be a line saying "STRTXT Soup witout". But the text is not being picked up. The next output line indicates that we have looped back to the beginning of the function. However, the last output line indicates that we have lost the value of the strTagName. I assume that is what causes the error message.

Also, the "strNodeName" variable doesn't seem to appear anywhere. Just to see what would happen, I deleted it and got an error message indicating that I didn't have enough arguments in the "getNodeText" statement. However, when I Googled "getNodeText" the examples all seemed to have only one parameter. What does the "strNodeName" do?
 
By the way, the following is the actual document from the <w:body> to the first <w:p> tag
Code:
<w:body>
	<w:sdt>
		<w:sdtPr>
			<w:rPr>
		  		<w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi" w:cstheme="majorBidi"/>
		  		<w:sz w:val="76"/>
		  		<w:szCs w:val="72"/>
		  	</w:rPr>
			<w:id w:val="30347306"/>
			<w:docPartObj>
				<w:docPartGallery w:val="Cover Pages"/>
				<w:docPartUnique/>
			</w:docPartObj>
		</w:sdtPr>
		<w:sdtEndPr>
			<w:rPr>
				<w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorHAnsi" w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/>
				<w:sz w:val="22"/>
				<w:szCs w:val="22"/>
			</w:rPr>
		</w:sdtEndPr>
		<w:sdtContent>
			<w:tbl>
				<w:tblPr>
					<w:tblpPr w:leftFromText="187" w:rightFromText="187" w:vertAnchor="page" w:horzAnchor="page" w:tblpXSpec="center" w:tblpYSpec="center"/>
					<w:tblW w:w="5000" w:type="pct"/>
					<w:tblCellMar>
						<w:top w:w="216" w:type="dxa"/>
						<w:left w:w="216" w:type="dxa"/>
						<w:bottom w:w="216" w:type="dxa"/>
						<w:right w:w="216" w:type="dxa"/>
					</w:tblCellMar>
					<w:tblLook w:val="04A0"/>
				</w:tblPr>
				<w:tblGrid>
					<w:gridCol w:w="5040"/>
					<w:gridCol w:w="4703"/>
					<w:gridCol w:w="3649"/>
				</w:tblGrid>
				<w:tr w:rsidR="007F6B36">
					<w:sdt>
						<w:sdtPr>
							<w:rPr>
								<w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi" w:cstheme="majorBidi"/>
								<w:sz w:val="76"/>
								<w:szCs w:val="72"/>
							</w:rPr>
							<w:alias w:val="Title"/>
							<w:id w:val="276713177"/>
							<w:placeholder>
								<w:docPart w:val="D0FCC0AA5B894515B92675413FCB5D01"/>
							</w:placeholder>
							<w:dataBinding w:prefixMappings="xmlns:ns0='[URL unfurl="true"]http://schemas.openxmlformats.org/package/2006/metadata/core-properties'[/URL] xmlns:ns1='[URL unfurl="true"]http://purl.org/dc/elements/1.1/'"[/URL] w:xpath="/ns0:coreProperties[1]/ns1:title[1]" w:storeItemID="{6C3C8BC8-F283-45AE-878A-BAB7291924A1}"/>
							<w:text/>
						</w:sdtPr>
						<w:sdtContent>
							<w:tc>
								<w:tcPr>
									<w:tcW w:w="3525" w:type="dxa"/>
									<w:tcBorders>
										<w:bottom w:val="single" w:sz="18" w:space="0" w:color="808080" w:themeColor="background1" w:themeShade="80"/>
										<w:right w:val="single" w:sz="18" w:space="0" w:color="808080" w:themeColor="background1" w:themeShade="80"/>
									</w:tcBorders>
									<w:vAlign w:val="center"/>
								</w:tcPr>
 
OOps. I missed the <w:p> tag. It's the next line.
 
2 errors, both on line 7.

1. objParent.childNodes(1) returns the first child object. Remember, for..each loops can't traverse objects, only lists (collections). objParent.childNodes returns a list. 2. strNodeName should be strTagName.

Code:
strText = strText & getNodeText(objParent.childNodes[red][s](1)[/s][/red], str[red]Tag[/red]Name)


-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Just one more problem.

Part of the document.xml file is the following:

Code:
<w:bookmarkStart w:id="4" w:name="Check5" /> 
    <w:p w:rsidR="00BB42AB" w:rsidRDefault="00346677" w:rsidP="00AD44BD">
      <w:pPr>
        <w:tabs>
          <w:tab w:val="left" w:pos="1800" /> 
          <w:tab w:val="left" w:pos="7200" /> 
        </w:tabs>
        <w:ind w:left="1080" /> 
      </w:pPr>
      <w:r>
        <w:fldChar w:fldCharType="begin">
          <w:ffData>
            <w:name w:val="Check5" /> 
            <w:enabled /> 
            <w:calcOnExit w:val="0" /> 
            <w:checkBox>
              <w:sizeAuto /> 
              <w:default w:val="1" /> 
            </w:checkBox>
          </w:ffData>
        </w:fldChar>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve">FORMCHECKBOX</w:instrText> 
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="end" /> 
      </w:r>
  <w:bookmarkEnd w:id="4" />

The checked or unchecked status of the checkbox seems to be given by the <w:default w:val="1" /> tag.

By changing the <w:p> (by the way, I was able to extract the text from the document with the colons in place) to <w:default>, I am able to select the <w:default> tags, but can't seem to get the "1" or "0" value from them. I've tried "objNode.text" and "objNode.nodeValue", which gave me nothing and "objNode.nodeVal" which gave me an error message.

Any suggestions for how to extract the number associated with w:val?
 
I was able to extract the text from the document with the colons in place

What did you change in the function? I tried and no nodes are returned from .getElementsByTagName

To answer your question, the .xml property will contain ALL the text from the first "<" to the last ">" of the node.

When iterating the objNode, check to see if ":val=" exists in the first tag of the node. If it does, MID it.

Code:
for each objNode in lstParentNodes
	intEndTag = inStr(objNode.xml, ">")
	intPos = inStr(left(objNode.xml, intEndTag), ":val=")
	if (intPos) then 
		strVal = mid(objNode.XML, intPos, intEndTag - intPos)
		msgbox strVal
	end if
...

-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
After I set up the function, I found that I got, as you said, "Soup without chicken is a lot like soup that would tast better if it had chicken in it.". What I needed was:

"Soup without chicken is a lot like soup that would taste"

and

"better if it had chicken in it."

I couldn't get the function to work that way so I went back to just grabbing the <wt> tags and using vbscript IF statements to start and stop the concatination. There is a problem with this. If one of the <wt> tags starts with the word "section", the concatination will fail to stop for that particular portion. Hence, I was hoping to be able to concatinate the <wt> tags based on the <wp> tags. In any event, I went back to using the straight calling the node by name.
Code:
 Set xmlDoc=CreateObject("Microsoft.XMLDOM")
 xmlDoc.async="false"
 xmlDoc.load("document.xml")
 set nodes = xmlDoc.selectNodes("//w:t")
 document.write("Nodes: " & nodes.length & "<br/>")
 for i = 1 to nodes.length - 1
I used the function with the <wt> tag configuration and when I went to the code above, also used the <wt> configuration. However, when I started trying to get the "val=" information, I thought of trying it with the <w:t> configuration and, while I didn't get the info for which I was looking, I found that all the info from the <wt> configuration was also available with the <w:t> configuration.
It's bad enough that someone is going to have to go through the process of taking the .docx file, add ".zip" to the end of it, expand it, find the document.xml file and save it somewhere. I think I would hear no end of complaints if I then told them they had to open it in notepad and do a find and replace on all the "<w:" and "</w:" tags.
I just have to hope that none of the <w:t> tags following the

<w:t>Comments/Deficiencies:</w:t>

tag starts with

"Section"

(or that the

<w:t>Comments/Deficiencies:</w:t>

(which starts the concatination) isn't split into
Code:
<w:t>Comments</w:t>
<w:t>/</w:t>
<w:t>Deficiencies</w:t>
<w:t>:</w:t>
or something, which is what happenes to what should be:

<w:t>Section - 1 A/C</w:t>

but ends up

Code:
<w:t>Section -</w:t>
<w:t>1</w:t>
<w:t>A</w:t>
<w:t>/</w:t>
<w:t>C</w:t>
Being brand spanking new to XML, this is all very confusing. I do very much appreciate the help.
 
Could you suggest any books that would cover this type of thing? All I can seem to find is how to write XML or how to extract simple stuff. Nothing seems to deal with the strange stuff that comes out of a .docx file.
 
I'm quite confused. Start again. In the simple terms and most convenient definitions (Breakfast Club reference), what are you trying to do.

-Geates

"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
In the original problem, I have paragraph nodes<w:p>(parent), run nodes<w:r>(child) and text nodes<w:t>(grandchild).

I need to concatinate the text nodes into complete runs and apply each completed run to a paragraph, which would be a field in a table row. (each document would be a row in the table). But the field name is also a text node, so I have to note the .text of one node so I know the associated field name, capture and concatinate the .text for the next few rows into a variable until I get a node where .text is the same as the next field name (and be sure not to contatinate that to the previous variable). That starts the process all over again.

in addition, I have a bunch of textboxes which, luckily, are the only places where <w:default> tags show up in the xml. However, no such luck that they would use the .text to hold a 0 or 1.

In any event, I couldn't figure out how to make the function work with the original problem, but it worked beautifully with the checkbox problem. However, now that I've seen it work with the checkboxes, I think I can see how to use it with the text problem. It certainly would be a lot neater than what I've got.
 
I would recommend converting the .docx into a .txt to get rid of ALL the useless crap MSO (Microsoft Office) puts in it's documents (the ":" in the tag names comes from a document saved as .xml via MSO). Loop through the text file and replace all "<w:" and "</w:" to "<w" and "</w", repectively. Save it with an XML extension. Use the getNodeText function to get the data you need

Code:
dim intRecursion

intRecursion = 0

sub convertToText(strFilespec)
	set objWord = CreateObject("Word.Application")
	objWord.Visible = true
	
	if (objFSO.FileExists(strFileSpec)) then
		strFileName = objFSO.GetBaseName(strFileSpec)
		strFilePath = objFSO.GetParentFolderName(strFileSpec)
		strFile = strFilePath & "\" & strFileName & ".txt"
	else
		msgbox "the file does not exist"
		objWord.Quit
		exit sub
	end if

	if NOT (objFSO.FileExists(strFile)) then
		objWord.Documents.Open strFileSpec
		set objDoc = objWord.ActiveDocument
		objDoc.saveAs strFile, 3 'Text format
		objDoc.Close
		objWord.Quit
	end if
end sub 

function getNodeText(lstParentNodes, strNodeName)
	intRecursion = intRecursion + 1
	for each objParent in lstParentNodes
		if (objParent.hasChildNodes) then
			strText = strText & getNodeText(objParent.childNodes, strNodeName)
			if (intRecursion = 1) then strText = strText & vbNewLine & vbNewLine
		else
			for each objNode in lstParentNodes
				intEndTag = inStr(objNode.xml, ">")
				intPos = inStr(left(objNode.xml, intEndTag), "val=")
				if (intPos) then 
					strVal = mid(objNode.XML, intPos, intEndTag - intPos)
					msgbox strVal
				end if
				if (inStr(objNode.nodeName, strNodeName)) then
					strText = strText & objNode.text & " "
				end if
			next
		end if
	next
	intRecursion = intRecursion - 1
	getNodeText = strText
end function

'Convert .docx to txt file. *REQUIRES MSO BE INSTALLED
converToText("file.docx")
	
'Loop through the text file and replace all "<w:" and "</w:" to "<w" and "</w", repectively.
strText = objFSO.OpenTextFile("file.txt", 1, true, 0).ReadAll
strText = replace(strText, "<w:", "<w")
strText = replace(strText, "</w:", "</w")

'Save with an XML extention
set objFile = objFSO.OpenTextFile("file.xml")
objFile.Write strText
objFile.Close

'Open XML file and use getNodeText to gather the data.
set objXML = CreateObject("Microsoft.XMLDOM")
objXML.load("file.xml")

set lstParentNodes = objXML.getElementsByTagName("wp")

msgbox getNodeText(lstParentNodes, "wt")

-Geates


"I hope I can chill and see the change - stop the bleed inside and feel again. Cut the chain of lies you've been feeding my veins; I've got nothing to say to you!"
-Infected Mushroom

"I do not offer answers, only considerations."
- Geates's Disclaimer
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top