Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using 1 XSLT on Multiple (millions of) XML Files - Batch

Status
Not open for further replies.

genesiusj

Programmer
Dec 18, 2013
10
US
Hi,
Totally new to XML, so my apologies for my beginner to advanced(?) questions below. I have to get up to race speed, even though I am just learning to walk. I have searched and read several articles and tutorials on the Internet to try to learn and find answers, but the amount of useful information I was able to find has been limited, or beyond my comprehension.

I have millions (not an exageration) of XML files that I need to covert to text files. A third party provides us, and their other clients, with XML files for each of customers. From my research, I found the easiest way would be by using an XSL style-sheet. Each XML could contain 10 to well over 100 different elements. Several parents, but also 3-4+ child layers deep.

Here is a sample of my XSLT.
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0"

xmlns:xsl="<xsl:strip-space elements="*" />

<xsl:template match="/">
<xsl:for-each select="/TPlist/TP">
<xsl:apply-templates select="document(@file)"/>
</xsl:for-each>
</xsl:template>

<xsl:template match="Jurisdiction">Jurisdiction,
<xsl:value-of select="."/>,
</xsl:template>

This XSLT uses <xsl:template match=" code for of the 200+ elements. These elements are taken (discovered) from about 10-15 XSD files.

I created a test XML as a master XML file (from what I read on the Internet) to "apply" my XSLT to each of the million++ XML files. However, I don't know how to get that to work with my XLST.

Here is the test XML master file.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="XSLTTest.xsl"?>
<TPlist>
<TP file="20121.xml"/>
<TP file="20122.xml"/>
</TPlist>

While this is using actual XML file names, it would be impossible to list all million++ files in this master. Is there code to have the XSLT applied to all the XML files in a folder?

Another question I have. What is the code for XSLT to transform to text? I found various sites trasnforming to XHTML, HTML, PDF, etc. But no text.

Last question. After I created the XSLT and added this code "<?xml-stylesheet type="text/xsl" href="XSLTTest.xsl"?>" to my two test XML files "20121.xml" and "20122.xml" and double clicked one of the XML files it opens up Internet Explorer. Is there a software that I need to install (or is it already on my PC) that will open up XML? Also, once/if I get the text transform working in my XSLT, where will the ouptut be?

Thanks and God Bless, Genesius
 
Hi Genesius,

Welcome to Tek-Tips.

Genesius said:
What is the code for XSLT to transform to text?
Use the following as a top-level element (that is, a child of the <xsl:stylesheet> element):
Code:
<xsl:output mathod="text"/>

Since this appears to be something you will be doing in batch, you probably want to use a command line XSLT processor, rather than embedding the <?xml-stylesheet> processing instruction in the XML documents that are being supplied.

You seem to emphasize the fact that there are millions of input documents, but are still too vague about the actual output you wish to generate.

You need to specify the platform (Windows, Linux, etc?) on which this processing will take place, as each platform has its own 'best choice' for XSLT processing.

So, take a deep breath, and let us help you break this into smaller pieces that you can then assemble for a solution.

Tom Morrison
Hill Country Software
 
Tom,
Thanks for the quick response.

This code <xsl:eek:utput method="text"/> replaces
<xsl:template match="/">
<html>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>

in my XSL?

Therefore, this is what the first few lines of code in my XSL should look like?
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="<xsl:eek:utput method="text"/>


Here are 2 example XML files (without the declaration and other beginning code/statements).
<body>
<Customer>
<Name>Oscar Jones</Name>
<City>Newtown</City>
<State>Ohio</State>
</Customer>
<PurchaseInfo>
<WigitA>
<Quantity>100</Quantity>
<Discount>12</Discount>
</WigitA>
<WigitB>
<Quantity>45</Quantity>
</WigitB>
</PurchaseInfo>
</body>

<body>
<Customer>
<Name>Carol Smith</Name>
<City>Perriville</City>
<State>Indiana</State>
</Customer>
<PurchaseInfo>
<WigitB>
<Quantity>11</Quantity>
<Discount>4</Discount>
</WigitA>
<WigitC>
<Quantity>33</Quantity>
<Discount>1</Discount>
</WigitC>
</PurchaseInfo>
</body>



Required output. As these are millions of XML files, the output would be thousands(?) of text files as following.
Name,Oscar Jones|Cty,Newtown|St,Ohio|QtyA,100|DscntA,12|QtyB,45;Name,Carol Smith|Cty,Perriville|St,Indiana|QtyB,11|DscntB,4|QtyC,45|DscntC,1;etc.

The delimeters used are commas between element name its value; pipe between elements; and semi-colon between XML files (records). These delimiters are not set in stone. Only need to know how to code them in the XSL style-sheet.

At moment, for testing a few hundred XML files, I will using a Windows XP PC. Full scale: Windows 2000 Server. I am going to D/L Saxon to start, but because of issues with our IT dept (PC's are locked down) I would like to find a portable (USB memory stick version) of an XSLT Processor, if one exists.

Here is a tutorial I found on the Internet. Are there any others you can recommend? [URL unfurl="true"]http://infomotions.com/musings/getting-started/getting-started.html[/url]

Thanks again for all your help. Let me know if there is any other information you require.

God Bless,
Genesius
 
Since you are going to be on a Windows machine, I would suggest downloading msxsl.exe from the Microsoft web site. From your description, I would think that a combination of batch scripting using msxsl might serve you best. (Time will tell.) MSXSL will use the MSXML.DLL that is already on your system.

w3schools and ZVON have tutorials that I have found useful before. Don't dive in to those yet. Let's wait until you have a better specification, and then you can use the tutorials a bit more efficiently.

So it seems like you do not want all of the data that is in each XML document. Or do you? If you want the data from a limited set of elements then we would use one technique. If there are an unlimited, or at least unknown, number of elements from which you are harvesting data, then another technique would be better.

The <xsl:eek:utput> element does not replace <xsl:template>. Rather (using indentation to help):
Code:
<xsl:stylesheet ...>
    <xsl:output method="text" encoding="utf-8"/>
    
    <xsl:template match="/">
    ...
    </xsl:template>
    ...
</xsl:stylesheet>
XML output is the default output method, so one must specify <xsl:eek:utput> if text or HTML is desired. You should probably also specify the encoding, since MSXML/MSXSL will probably choose UTF-16 by default.

You are going to learn a lot about recursion and template matching, which is a bit of a change from the non-recursive, procedural programming experience that most folks have.

So, please respond with the information described in my third paragraph, and we can get you started on this adventure.



Tom Morrison
Hill Country Software
 
I also note the following:
[tt]QtyA,100|DscntA,12|QtyB[/tt]

In these name-value pairs, it seems like the name is some sort of mash-up derived from the parent element (e.g. <WidgitA>) and the element containing the data used for the value (e.g. <Quantity> or <Discount>).

Do you have an algorithm for this mash-up?

Tom Morrison
Hill Country Software
 
Genesius?

Tom Morrison
Hill Country Software
 
Tom,
I was out yesterday; dental appointment. I am going to check what you wrote now and I will post what I find.

Thanks and God Bless,
Genesius
 
Tom,
Again, thank you so much for your help. Where I work they lock down everything. When I did a search for msxml.dll it does not exist. So. I have another question for you, if I may. At home, I have Vista-64. What files do I need to run an XSLT processor on that OS? If I can get this to work at home, I can use as a proof of concept to my manager to have the IT dept unlock my PC and install the necessary bits on it.
Thanks in advance.
God Bless,
Genesius
 
Hi Genesius,

I was at a business meeting all day today.

I think I understand why you have such a lock-down situation. I did some cybersleuthing on another social site, where we are actually a 3rd level connection possibility. Go figure! So I understand that your examples are also not precisely what you are really dealing with.

MSXML.DLL is installed if Internet Explorer of any recent vintage is installed. However, in your lock-down situation, you may not even have IE installed. MSXML is a generic name; look for msxml4.dll, msxml6.dll. It will be there, in windows\syswow64 on a 64-bit system, windows\system on a 32-bit system.

The one thing you will have to install is msxsl.exe. It is a command line interface for MSXML, which has the XSLT processor in it.

Try to make your examples as close to your actual requirements as possible. That will help me get to the point exactly without too much wasted misdirection. I think this is going to be easier than you ever imagined.

Tom Morrison
Hill Country Software
 
Tom,
Perhaps we can connect on that network if you want.

As you know then I cannot give too detailed info; however, here is a bit more that I can let you know. We used to receive this data in text files for each "member" processed by the third party. Now they send us this yearly data in XML files. You can find the schemas by going to their website (not mine). The XML files might contain all, some, or none of the elements contained in each of the schemas from their site.

I need to transform the data back into text format for our purposes.

Thanks, Merry Christmas and God Bless,
Genesius
 
Merry Christmas, Genesius. Let's pick it back up on Thursday.

I cannot connect on the other network without paying their very high fee. I will see if the common link can be helpful.

Meanwhile, I will give you some ideas on how to use MSXSL to append text from XML documents onto a single text file, since that seems to be that you are wanting.

Best regards,

Tom Morrison
Hill Country Software
 
Dear Genesius,

I slightly modified your example so that we could make some progress.

First. the TPList.XML document. We really want to get rid of this document if we can, because it has to be created from some other list (I surmise).
Code:
[small]<body>
	<Customer>
		<Name>Carol Smith</Name>
		<City>Perriville</City>
		<State>Indiana</State>
	</Customer>
	<PurchaseInfo>
		<WigitA>
			<Quantity>11</Quantity>
			<Discount>4</Discount>
		</WigitA>
		<WigitC>
			<Quantity>33</Quantity>
			<Discount>1</Discount>
		</WigitC>
	</PurchaseInfo>
</body>[/small]
Code:
[small]<body>
	<Customer>
		<Name>Oscar Jones</Name>
		<City>Newtown</City>
		<State>Ohio</State>
	</Customer>
	<PurchaseInfo>
		<WigitA>
			<Quantity>100</Quantity>
			<Discount>12</Discount>
		</WigitA>
		<WigitB>
			<Quantity>45</Quantity>
		</WigitB>
	</PurchaseInfo>
</body>[/small]
Code:
[small]<TPlist>
<TP file="document1.xml"/>
<TP file="document2.xml"/>
</TPlist>[/small]
The first XSLT processes the documents using TPList.xml. I will show another way below that eliminates TPList.xml.
Code:
[small]<xsl:stylesheet version="1.0" xmlns:xsl="[URL unfurl="true"]http://www.w3.org/1999/XSL/Transform">[/URL]

<xsl:output method="text" encoding="utf-8"/>

<xsl:variable name="nameValueSeparator" select="','"/>
<xsl:variable name="fieldSeparator" select="'|'"/>
<xsl:variable name="recordSeparator" select="';'"/>

<xsl:template match="/">
	<xsl:apply-templates select="TPlist/TP"/>
</xsl:template>

<xsl:template match="TP">
	<xsl:apply-templates select="document(@file)/body/Customer"/>
</xsl:template>

<xsl:template match="Customer">
<xsl:variable name="nameField"><xsl:call-template name="outputNameValue">
 	                        <xsl:with-param name="theName" select="'Name'"/>
				<xsl:with-param name="theValue" select="./Name"/>
				</xsl:call-template></xsl:variable>
<xsl:variable name="cityField"><xsl:call-template name="outputNameValue">
 	                       <xsl:with-param name="theName" select="'Cty'"/>
			       <xsl:with-param name="theValue" select="./City"/>
			       </xsl:call-template></xsl:variable>
<xsl:variable name="stateField"><xsl:call-template name="outputNameValue">
 	                        <xsl:with-param name="theName" select="'St'"/>
				<xsl:with-param name="theValue" select="./State"/>
				</xsl:call-template></xsl:variable>
<xsl:variable name="qtyDscntFields"><xsl:apply-templates select="../PurchaseInfo/*[substring(local-name(),1,5) = 'Wigit']" mode="widgetMode"/></xsl:variable>
<xsl:variable name="allFields" select="concat($nameField,
				              $cityField,
				              $stateField,
				              $qtyDscntFields)"/> 
<!-- which has a trailing field separator -->
<xsl:value-of select="concat(substring($allFields,1,string-length($allFields)-1),$recordSeparator)"/></xsl:template>

<xsl:template match="*" mode="widgetMode">
<xsl:variable name="myWidgetName" select="substring(local-name(),6)"/>
<xsl:call-template name="outputNameValue"><xsl:with-param name="theName" select="concat('Qty',$myWidgetName)"/>
					  <xsl:with-param name="theValue" select="Quantity"/>
</xsl:call-template><xsl:if test="Discount"><xsl:call-template name="outputNameValue">
                                            <xsl:with-param name="theName" select="concat('Dscnt',$myWidgetName)"/>
					    <xsl:with-param name="theValue" select="Discount"/>
</xsl:call-template></xsl:if></xsl:template>

<xsl:template name="outputNameValue">
<xsl:param name="theName"/>
<xsl:param name="theValue"/>
<xsl:value-of select="concat($theName,$nameValueSeparator,$theValue,$fieldSeparator)"/></xsl:template>	

</xsl:stylesheet>[/small]

This XSLT produces the following:
[tt][small]Name,Carol Smith|Cty,Perriville|St,Indiana|QtyA,11|DscntA,4|QtyC,33|DscntC,1;Name,Oscar Jones|Cty,Newtown|St,Ohio|QtyA,100|DscntA,12|QtyB,45;[/small][/tt]

In my next post, I will slightly modify this stylesheet to process only one of the data documents. Combining the strengths of two tools gets the desired result.

Tom Morrison
Hill Country Software
 
(a continuation from the previous post)

The XSLT stylesheet is slightly modified, so that only a single data document is processed.
Code:
[small]<xsl:stylesheet version="1.0" xmlns:xsl="[URL unfurl="true"]http://www.w3.org/1999/XSL/Transform">[/URL]

<xsl:output method="text" encoding="utf-8"/>

<xsl:variable name="nameValueSeparator" select="','"/>
<xsl:variable name="fieldSeparator" select="'|'"/>
<xsl:variable name="recordSeparator" select="';'"/>

<xsl:template match="/">
	<xsl:apply-templates select="body/Customer"/>
</xsl:template>

<xsl:template match="Customer">
<xsl:variable name="nameField"><xsl:call-template name="outputNameValue">
 	                           <xsl:with-param name="theName" select="'Name'"/>
							   <xsl:with-param name="theValue" select="./Name"/>
							   </xsl:call-template></xsl:variable>
<xsl:variable name="cityField"><xsl:call-template name="outputNameValue">
 	                           <xsl:with-param name="theName" select="'Cty'"/>
							   <xsl:with-param name="theValue" select="./City"/>
							   </xsl:call-template></xsl:variable>
<xsl:variable name="stateField"><xsl:call-template name="outputNameValue">
 	                           <xsl:with-param name="theName" select="'St'"/>
							   <xsl:with-param name="theValue" select="./State"/>
							   </xsl:call-template></xsl:variable>
<xsl:variable name="qtyDscntFields"><xsl:apply-templates select="../PurchaseInfo/*[substring(local-name(),1,5) = 'Wigit']" mode="widgetMode"/></xsl:variable>
<xsl:variable name="allFields" select="concat($nameField,
							                  $cityField,
							                  $stateField,
							                  $qtyDscntFields)"/> <!-- which has a trailing field separator -->
<xsl:value-of select="concat(substring($allFields,1,string-length($allFields)-1),$recordSeparator)"/></xsl:template>

<xsl:template match="*" mode="widgetMode">
<xsl:variable name="myWidgetName" select="substring(local-name(),6)"/>
<xsl:call-template name="outputNameValue"><xsl:with-param name="theName" select="concat('Qty',$myWidgetName)"/>
										  <xsl:with-param name="theValue" select="Quantity"/>
</xsl:call-template><xsl:if test="Discount"><xsl:call-template name="outputNameValue"><xsl:with-param name="theName" select="concat('Dscnt',$myWidgetName)"/>
										  <xsl:with-param name="theValue" select="Discount"/>
</xsl:call-template></xsl:if></xsl:template>

<xsl:template name="outputNameValue">
<xsl:param name="theName"/>
<xsl:param name="theValue"/>
<xsl:value-of select="concat($theName,$nameValueSeparator,$theValue,$fieldSeparator)"/></xsl:template>	

</xsl:stylesheet>[/small]

Now, use MSXSL.EXE from a command line to process document1.xml (for simplification, all assumed to be in the same directory) with the following command line:
[tt][tab]msxsl document1.xml TekTips3.xsl[/tt]
The following will be sent to standard output:
[tt]Name,Carol Smith|Cty,Perriville|St,Indiana|QtyA,11|DscntA,4|QtyC,33|DscntC,1;[/tt]

Repeat for document2 and you get:
[tt]Name,Oscar Jones|Cty,Newtown|St,Ohio|QtyA,100|DscntA,12|QtyB,45;[/tt]

If we now run the commands redirecting standard output to extend a file:
[tt][tab]msxsl document1.xml TekTips3.xsl >>output.txt
[tab]msxsl document2.xml TekTips3.xsl >>output.txt[/tt]
and then inspect the contents of [tt]output.txt[/tt] we find the following:
[tt]Name,Carol Smith|Cty,Perriville|St,Indiana|QtyA,11|DscntA,4|QtyC,33|DscntC,1;Name,Oscar Jones|Cty,Newtown|St,Ohio|QtyA,100|DscntA,12|QtyB,45;[/tt]

At this point, you can see that the possibility exists for processing large numbers of these data documents with the results output in the form that the legacy program can process.

Unfortunately, this being an XML technical forum, I won't delve into CMD/batch scripting for Windows. I have found this reference quite helpful. In particular, the FOR command allows you to invoke MSXSL for each XML document in a directory.

Genesius, since you are new to XSLT processing, I would expect the example I have shown to raise some questions. Please feel free to ask for explanation(s).



Tom Morrison
Hill Country Software
 
Tom,
Hope you had a wonderful Christmas and are enjoying your holiday. I did not fall of the face of the earth; I am working on another project that I was hoping to complete before the weekend (almost there). BTW it is not XML.

I briefly looked at the code you sent and I will delve further into it next week (maybe over the weekend if I have a chance to on my Vista machine.

I don't think you are looking at the correct site. The site with these schemas is free to the public.

Have a great weekend.
God Bless,
Gensius
 
Thank you for the wishes. I am indeed enjoying myself. While I work quite a bit from home, it has its benefits. My granddaughter (3.5 years old) came in to give me a hug just a few minutes ago.

I was referring to making a connection on a social web site where professionals are Linked, and where I can see your picture and some limited information. My niece is, generally speaking, in the same industry, for a commercial firm used by many of your customers; I have asked her for an introduction as suggested by that social web site. Should it become necessary, you may be able to derive a means to contact me via my first name at my work address...

I have visited the other site and seen the schema to which you refer. It is quite complex! I hope we can keep the examples here relatively simple and that you can extrapolate to your specific need.



Tom Morrison
Hill Country Software
 
Hi Genesius,

Since you have difficulty getting things installed, I did a bit of research and came up with a JScript version of XSLT that you can run with Windows Script Host (WSH) -- no installation required. WSH should be available on all the systems. At a command prompt simply type [tt]cscript[/tt]; you should get a header with the version information and usage much like the following obtained on a Win8 machine:
[tt]Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

Usage: CScript scriptname.extension [option...] [arguments...]

Options:
//B Batch mode: Suppresses script errors and prompts from displaying
//D Enable Active Debugging
//E:engine Use engine for executing script
//H:CScript Changes the default script host to CScript.exe
//H:WScript Changes the default script host to WScript.exe (default)
//I Interactive mode (default, opposite of //B)
//Job:xxxx Execute a WSF job
//Logo Display logo (default)
//Nologo Prevent logo display: No banner will be shown at execution time
//S Save current command line options for this user
//T:nn Time out in seconds: Maximum time a script is permitted to run
//X Execute script in debugger
//U Use Unicode for redirected I/O from the console[/tt]

Here is the JScript version of XSLT:
Code:
[small]var xml;
var xsl;
var out, outDoc;

try 
{
    args = WScript.Arguments;

    if(args.length != 2)
    {
        WScript.Echo("Usage: xslt.js file.xml file.xsl");
        WScript.Quit(1);
    }
    else
    {
        xml = args(0);
        xsl = args(1);
        var xmlDoc = new ActiveXObject("Msxml2.DOMDocument.6.0");
        var xslDoc = new ActiveXObject("Msxml2.DOMDocument.6.0");

        if(xmlDoc.load(xml) == false)
        {
            throw new Error("Could not load XML document " + xmlDoc.parseError.reason);
        }

        if(xslDoc.load(xsl) == false)
        {
            throw new Error("Could not load XSL document " + xslDoc.parseError.reason);         
        }

		WScript.echo(xmlDoc.transformNode(xslDoc));
    }
}
catch(e)
{
    WScript.Echo(e.message);
    WScript.Quit(1);
}[/small]
This code will only send the output to standard output.

Using my previous posts as an example, you run this from the command line as:
[tt]cscript //nologo document1.xml TekTips3.xsl >>output.txt[/tt]
[tt]cscript //nologo document2.xml TekTips3.xsl >>output.txt[/tt]

You get identical output as you would from msxsl.exe.

I hope this is useful to you.


Tom Morrison
Hill Country Software
 
Oops!

The command lines I provided in the previous post failed to include the script name. That should be (assuming the JScript is named xslt.js):

[tt]cscript xslt.js //nologo document1.xml TekTips3.xsl >>output.txt
cscript xslt.js //nologo document2.xml TekTips3.xsl >>output.txt[/tt]

Tom Morrison
Hill Country Software
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top