I'm working with a legacy VB6/Xpath XML shredder to load a SQL Server database. It's working, but it's really slow. I'm looking for suggestions to speed it up.
The xml files I'm working with contain only one set of elements, not multiple sets of elements. For example, by analogy to Microsoft's familiar books.xml sample file, my xml files look like this:
My xml files do **not** look like this:
As shown in the VB6/XPath pseudo code below, the shredder currently works by sequentially recursively processing each XML file using a Select Case structure within a For/Next structure. That works but it's slow because every element in every XML file is considered. The problem is that I'm working with hundreds of thousands of xml files, each of which contains many hundreds of elements. I'm only interested in about 3 dozen of those hundreds of elements. I know the tags that identify the elements I'm interested in--they're always the same. Are there any obvious ways to speed this up? For example, instead of recursively parsing the entire XML file, can I somehow extract and parse only the 3 dozen or so elements that I am interested in? A complication is that a few of the elements I'm interested in have an indeterminate number of child nodes and I need to extract information from every one of those child nodes.
The xml files I'm working with contain only one set of elements, not multiple sets of elements. For example, by analogy to Microsoft's familiar books.xml sample file, my xml files look like this:
Code:
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
My xml files do **not** look like this:
HTML:
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web" cover="paperback">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
As shown in the VB6/XPath pseudo code below, the shredder currently works by sequentially recursively processing each XML file using a Select Case structure within a For/Next structure. That works but it's slow because every element in every XML file is considered. The problem is that I'm working with hundreds of thousands of xml files, each of which contains many hundreds of elements. I'm only interested in about 3 dozen of those hundreds of elements. I know the tags that identify the elements I'm interested in--they're always the same. Are there any obvious ways to speed this up? For example, instead of recursively parsing the entire XML file, can I somehow extract and parse only the 3 dozen or so elements that I am interested in? A complication is that a few of the elements I'm interested in have an indeterminate number of child nodes and I need to extract information from every one of those child nodes.
Code:
Public Sub shredXML(ByRef Nodes As MSXML2.IXMLDOMNodeList)
Dim xNode As MSXML2.IXMLDOMNode
For Each xNode In Nodes
If xNode.nodeType = NODE_ELEMENT Then
Select Case xNode.nodeName
Case "element1"
extract stuff from element1 & load into database
Case "element2"
extract stuff from element2 & load into database
Case "element3"
extract stuff from element3 & load into database
...
Case "elementN"
extract stuff from elementN & load into database
End Select
End If
If xNode.hasChildNodes Then 'parse xml file
shredXML xNode.childNodes 'recursively
End If
Next xNode
End Sub