Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Convert XML from PDF to usable XML 1

Status
Not open for further replies.

dv1

Technical User
May 15, 2009
8
US
I need to strip out extra attributes from an xml file. I have a very basic idea of how to add extra attributes, since that's what most tutorials show, but I don't know how to remove attributes.

I need to convert this:
Code:
- <Page01_a>
- <body xmlns:xfa="[URL unfurl="true"]http://www.xfa.org/schema/xfa-data/1.0/"[/URL] xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml"[/URL] xfa:APIVersion="Acroform:2.7.0.0" xfa:spec="2.1">
- <p style="margin-top:0pt;margin-bottom:0pt;text-valign:middle;font-family:'Myriad Pro';font-size:10pt">
  Here 
  <span style="xfa-spacerun:yes"> </span> 
  <span style="font-style:italic">is</span> 
...
</Page01_a>

To this:
Code:
<Page01_a>Here <italic>is</italic></Page01_a>
 
For that specific structure, the transformation may look awkward too as it deals with special treaments, that's normal. There might have some finer namespace issue to be taken care of separately by yourself.
[tt]
<xsl:template match="Page01_a">
<xsl:copy>
<xsl:apply-templates select="*/*" />
</xsl:copy>
</xsl:template>
<xsl:template match="Page01_a/*/*">
<xsl:apply-templates select="text()|*[local-name()='span']" />
</xsl:template>
<xsl:template match="Page01_a/*/*/text()">
<xsl:value-of select="normalize-space()" />
</xsl:template>
<xsl:template match="Page01_a/*/*/*[local-name()='span']">
<xsl:choose>
<xsl:when test="@style='font-style:italic'">
<italic>
<xsl:call-template name="spacing" />
</italic>
</xsl:when>
<xsl:eek:therwise>
<xsl:call-template name="spacing" />
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
<xsl:template name="spacing">
<xsl:choose>
<xsl:when test="string-length(normalize-space())=0">
<xsl:text> </xsl:text>
</xsl:when>
<xsl:eek:therwise>
<xsl:value-of select="normalize-space()" />
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>[/tt]
 
ok, that did exactly what I had asked, but I guess my knowledge of xsl is more limited than i realized. I only posted one example because i figured that i could adapt it to work for my needs. But i can't get it to loop and fix the whole document.
This is what I have:
Code:
 <?xml version="1.0" encoding="UTF-8" ?> 
- <form1>
- <Page01_a>
- <body xmlns:xfa="[URL unfurl="true"]http://www.xfa.org/schema/xfa-data/1.0/"[/URL] xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml"[/URL] xfa:APIVersion="Acroform:2.7.0.0" xfa:spec="2.1">
- <p style="margin-top:0pt;margin-bottom:0pt;text-valign:middle;font-family:'Myriad Pro';font-size:10pt">
  Here is 
  <span style="xfa-spacerun:yes"> </span> 
  <span style="font-weight:bold">bold</span> 
- <span style="font-weight:normal">
  text. 
  <span style="xfa-spacerun:yes"> </span> 
  </span>
  </p>
  </body>
  </Page01_a>
- <Page02_a>
- <body xmlns:xfa="[URL unfurl="true"]http://www.xfa.org/schema/xfa-data/1.0/"[/URL] xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml"[/URL] xfa:APIVersion="Acroform:2.7.0.0" xfa:spec="2.1">
- <p style="margin-top:0pt;margin-bottom:0pt;text-valign:middle;font-family:'Myriad Pro';font-size:10pt;font-style:italic">
  Here is italic text. 
- <span style="font-style:normal">
  <span style="xfa-spacerun:yes"> </span> 
  </span>
  </p>
  </body>
  </Page02_a>
- <Page02_b>
- <body xmlns:xfa="[URL unfurl="true"]http://www.xfa.org/schema/xfa-data/1.0/"[/URL] xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml"[/URL] xfa:APIVersion="Acroform:2.7.0.0" xfa:spec="2.1">
- <p style="margin-top:0pt;margin-bottom:0pt;text-valign:middle;font-family:'Myriad Pro';font-size:10pt">
  Here is regular text. 
  <span style="xfa-spacerun:yes"> </span> 
  </p>
  </body>
  </Page02_b>
- <Page02_c>
- <body xmlns:xfa="[URL unfurl="true"]http://www.xfa.org/schema/xfa-data/1.0/"[/URL] xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml"[/URL] xfa:APIVersion="Acroform:2.7.0.0" xfa:spec="2.1">
  <p style="margin-top:0pt;margin-bottom:0pt;text-valign:middle;font-family:'Myriad Pro';font-size:10pt;font-weight:bold;font-style:italic">This is bold-italic text.</p> 
  </body>
  </Page02_c>
  <Page02_f /> 
  <Page02_e /> 
  <Page02_d /> 
  <Page02_e /> 
  <Page02_f /> 
  <Page02_g /> 
  <Page02_h /> 
  <Page02_i /> 
  <Page02_j /> 
  </form1>

And this is what i need:
Code:
 <?xml version="1.0" encoding="UTF-8" ?> 
- <body>
- <Page01_a>Here is <bold>bold</bold> text.</Page01_a>
- <Page02_a><italic>Here is italic text.</italic></Page02_a>
- <Page02_b>Here is regular text.</Page02_b>
- <Page02_c><bold-italic>This is bold-italic text.</bold-italic></Page02_c>
  <Page02_f /> 
  <Page02_e /> 
  <Page02_d /> 
  <Page02_e /> 
  <Page02_f /> 
  <Page02_g /> 
  <Page02_h /> 
  <Page02_i /> 
  <Page02_j /> 
  </body>
So basically I just need to keep any bold/italic styling in the document.
 
[0] The updated xml is different from the op in a couple of sense.
[0.1]
[0.1.1] span may appear nested.
[0.1.2] The font-style and font-weight (additional consideration) now can appear in p tag and span tag etc...
[0.2]However, xsl can luckily treat these complications. I use mode to simplify the referencing so that the ancestor reference can be made implicit and not to clogging the match and select.

[1] Try this implementation.
[tt]
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="[ignore][/ignore]">
<xsl:eek:utput method="xml" encoding="utf-8" indent="yes" />
<xsl:template match="/">
<body>
<xsl:apply-templates select="*/*[starts-with(local-name(),'Page')]" mode="acf-proc" />
</body>
</xsl:template>

<xsl:template match="*[starts-with(local-name(),'Page')]" mode="acf-proc">
<xsl:copy>
<xsl:apply-templates select="*/*[local-name()='p']" mode="acf-proc" />
</xsl:copy>
</xsl:template>
<xsl:template match="*[local-name()='p']" mode="acf-proc">
<xsl:apply-templates select="text()|*[local-name()='span']" mode="acf-proc" />
</xsl:template>
<xsl:template match="text()" mode="acf-proc">
<xsl:variable name="b" select="contains(parent::*/@style,'font-weight:bold')" />
<xsl:variable name="t" select="contains(parent::*/@style,'font-style:italic')" />
<xsl:choose>
<xsl:when test="$b and $t">
<bold><italic>
<xsl:call-template name="spacing" />
</italic></bold>
</xsl:when>
<xsl:when test="$b and (not($t))">
<bold>
<xsl:call-template name="spacing" />
</bold>
</xsl:when>
<xsl:when test="(not($b)) and $t">
<italic>
<xsl:call-template name="spacing" />
</italic>
</xsl:when>
<xsl:eek:therwise>
<xsl:call-template name="spacing" />
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
<xsl:template match="*[local-name()='span']" mode="acf-proc">
<xsl:apply-templates select="text()|*[local-name()='span']" mode="acf-proc" />
</xsl:template>

<xsl:template name="spacing">
<xsl:choose>
<xsl:when test="string-length(normalize-space())=0">
<xsl:text> </xsl:text>
</xsl:when>
<xsl:eek:therwise>
<xsl:value-of select="normalize-space()" />
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
[/tt]
[3] I have not reviewed spacing template. If there are other fine adjustment, there could the white-space rendering and xslt-processor specific setting. I am not going to concern with that detail. You have to look at it to make those fine adjustments, if any.
 
Thank you so much! This solution was definitely out of my league. BTW, it works perfectly.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top