Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Best method for updating variables in an xml file 2

Status
Not open for further replies.

epatton

Technical User
Aug 22, 2006
10
CA
I have a pretty big (306 line) xml file that represents information (metadata) about geological data collected by a government agency. The data in this xml is ugly, and hard to parse. However, there are really only about 20 different fields in this file that change from one geological survey to the next. Every dataset collected in every survey needs to have an accompanying xml file created which describes its metadata.

I would like to be able to write some kind of shell/sed/grep/awk script that only changes the data in these 20 or so special fields with information that the user provides interactively, either through shell variables passed to the program or interactive question-and-answer, and outputs the rest of the lines in the file unchanged. I'm not sure if awk is the best approach, or if I need some kind of awk/sed/shell hybrid script. Here's a snippet of the format of the xml file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata>
<idinfo>
<citation>
<citeinfo>
<origin>John Doe</origin>
<pubdate>Unpublished Material</pubdate>
<title>Coloured, Shaded-Relief Image of Multibeam Bathymetry of Tilt Cove, Newfoundland, Canada</title>
<geoform Sync="TRUE">remote-sensing image</geoform>
<serinfo></serinfo>
<pubinfo></pubinfo>
<ftname Sync="TRUE">TiltCove_2_ave_fill_shade_comb.tif</ftname></citeinfo>
</citation>
<descript>
<timeperd>
<timeinfo>
<rngdates>
<begdate>20010627</begdate>
<enddate>20010708</enddate>
</rngdates>
</timeinfo>
<current>ground condition</current>
</timeperd>
<status>
<progress>Complete</progress>
<update>None planned</update>
</status>
<spdom>
<bounding>
<westbc>-59.680215</westbc>
<eastbc>-59.616513</eastbc>
<northbc>44.013563</northbc>
<southbc>43.950450</southbc>
</bounding>

I know it looks like gibberish, but the key point here is that only a select few of these xml tags need to change from one dataset to the next (i.e., tags that represent time, dates, persons, location names, etc.)

Can anyone think of a general approach for automating the production of these xml files (given a finished one as a template), allowing for the modification of key xml tags in every one?
 
I have been working on something similar for a while. I have been trying to create html files from information in basic ascii text files. I chose to use Perl, but the scripting language is not so important. This is the method I used.
- create a 'template' file in html format
- create the text file (this can be edited later & the Perl script re-run to produce a new version of the html file)
- write a Perl script to 'make' the new page(s), as follows:

Open the text file and read the data into Variables & Arrays (also prompt the user for information)
Open an output file in a temporary location (don't want to overwrite the existing html file yet)
Open the 'template' file and read it line by line in a loop
Write the line to the output file as is, or modified by the data in the Variables & Arrays
At the end of the loop, close all files
Overwrite the existing html file with the new one

- save a copy of the existing html file
- test the script


I hope that helps, to get you started.

Mike
 
Thanks, Mike! This is a great start - exactly what I needed. I have a question for you though: how did you make decisions whether a line should be read from the template and copied to output, or read from the variables/arrays?

Thanks very much,

~ Eric.
 
Hi Eric,

In the template there would be lines like (example taken from your post but modified):
epatton said:
<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata>
<idinfo>
<citation>
<citeinfo>
<origin>###author###</origin>
<pubdate>Unpublished Material</pubdate>
<title>###title###</title>

The first few lines get copied as is, because they don't meet any search criteria, as follows:

if the line contains ###author### then replace that string with data from a variable/array before writing it to the output file
else-if the line contains ###title### then replace that string with data from a variable/array before writing it to the output file
else write the line as is to the output file

I hope that makes sense.

Mike
 
Or rather than lots of if's and else-if's you could just pull out the string between the #s and use that as the index to the array.

Annihilannic.
 
Thanks Mike, I understand now.

Annihilannic, could you elaborate a bit more? Thanks!

~ Eric.
 
Hi Eric,

Glad to hear it and thanks very much for the 'star'.

Regards.

Mike
 
By example:

template.xml input file:
Code:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <metadata>
            <idinfo>
                    <citation>
                            <citeinfo>
                                    <origin>###author###</origin>
                                    <pubdate>Unpublished Material</pubdate>
                                    <title>###title###</title>

variables input file:
Code:
author:Arthur Conan Doyle
title:The Resident Patient

updatexml script:
Code:
awk -F: '
        NR==FNR { vars[$1]=$2 ; next }
        /###(.*)###/ {
                token=gensub(".*(###[^#]*###).*","\\1","")
                tokenname=gensub(".*###(.*)###.*","\\1","")
                if (tokenname in vars) {
                        sub(token,vars[tokenname])
                } else {
                        sub(token,"undefined")
                }
        }
        { print }
' variables template.xml

updatexml output:
Code:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <metadata>
            <idinfo>
                    <citation>
                            <citeinfo>
                                    <origin>Arthur Conan Doyle</origin>
                                    <pubdate>Unpublished Material</pubdate>
                                    <title>The Resident Patient</title>

Note that this solution depends on the gensub() function which is only available in GNU awk I believe.

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top