Convert TXT to XML 1

Goppin · Mar 29, 2012

Hi, I heard that Perl is good for text manipulation, which should be useful for converting TXT file data to XML based files. I have text files which are somewhat structured, they contain data which I usually manually copy & paste into the image handling application Aperture (from Apple) to a specific picture.

The TXT file is structured like so:

"Title"
"URL"
"Comments"

Of which then the custom data follows like so, which you can see here for example.

I need to create a Perl script to export an XML file so that data looks like this .

Aperture uses IPTC metadata over XMP and so I figure I need to adhere to the following fields:

"Title" -> "photoshop:Headline"

"Comments" -> "dc:description"

"URL" -> "Iptc4xmpCore:SubjectCode"

So, I guess I'm asking for rudimentary help here, I cannot program in Perl but was told that a "simple" script would get me there, am I on the right track?

Thanks in advance!

Annihilannic · Mar 29, 2012

All I can see at the first link you provided is:

Code:

Title:
Dribbble - Cat & Giraffe by Luke Bott

URL:
[URL unfurl="true"]http://dribbble.com/shots/453717-Cat-Giraffe[/URL]

It doesn't adhere to the format you describe, nor can I see any of the "custom data" which "follows"?

There are a number of XML perl modules, but if the output file format is very consistent I'd be tempted to just insert the field data into a template in the appropriate places. I guess it depends the source data is sufficiently "clean" for insertion directly into XML.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Mar 29, 2012

Ahh, you're right, I got it mixed up with another file, sorry, so it would be like this:

<?xpacket begin='' id=''?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9-9, framework 1.6'>
<rdf:RDF xmlns:rdf='

http://www.w3.org/1999/02/22-rdf-syntax-ns#'

xmlns:iX='

http://ns.adobe.com/iX/1.0/'>

<rdf

escription rdf:about='' xmlns:Iptc4xmpCore='

http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/'>

<Iptc4xmpCore:SubjectReference>

http://dribbble.com/shots/453717-Cat-Giraffe</Iptc4xmpCore:SubjectReference>

<Iptc4xmpCore:SubjectCode>
<rdf:Bag>
<rdf:li>

http://dribbble.com/shots/453717-Cat-Giraffe</rdf:li>

</rdf:Bag>
</Iptc4xmpCore:SubjectCode>
</rdf

escription>
<rdf

escription rdf:about='' xmlns

hotoshop='

http://ns.adobe.com/photoshop/1.0/'>

<photoshop:Headline>Dribbble - Cat & Giraffe by Luke Bott</photoshop:Headline>
</rdf

escription>
<rdf

escription rdf:about='' xmlns:dc='

http://purl.org/dc/elements/1.1/'>

<dc:description><rdf:Alt><rdf:li xml:lang='x-default'>TEST COMMENT</rdf:li></rdf:Alt></dc:description>
</rdf

escription>
<rdf

escription rdf:about='' xmlns

hotomechanic='

http://ns.camerabits.com/photomechanic/1.0/'>

</rdf

escription>
<rdf

escription rdf:about='' xmlns:xap='

http://ns.adobe.com/xap/1.0/'>

<xap:Rating>0</xap:Rating>
</rdf

escription>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>

I mentioned Custom Data I meant, in essence, that each file has it's relevant data, so the structure would always be "Title" and then "URL", but, as you can see from the first link, the following data from those headings is specific to each picture and thus custom. Basically the structure is certain, but not the data belonging to each heading of course.

Your solution sounds interesting, I know I'm dealing with a text file format and so it's not structured, however it could be that data is broken up according to the headers e.g. searching for "Title: " would then take that chunk of data and not confuse it with "URL " - you mean that kind of consistency?

I have two widely different examples here of an outputted TXT file, but they both adhere to the same structure:

1.

http://cl.ly/2I0k2Z3L3m1U1f1V0M3i

2.

http://cl.ly/0R0b0x2G1j0k3k3C1v2U

What do you think? and thanks for your reply.

Annihilannic · Mar 29, 2012

Here is something quick'n'dirty:

Perl:

[COLOR=#006600]#!/usr/bin/perl -w[/color]

[COLOR=#0000FF]use[/color] strict;

[COLOR=#0000FF]my[/color] $template=[COLOR=#808080]"<?xpacket begin='' id=''?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9-9, framework 1.6'>
<rdf:RDF xmlns:rdf='[URL unfurl="true"]http://www.w3.org/1999/02/22-rdf-syntax-ns#'[/URL] xmlns:iX='[URL unfurl="true"]http://ns.adobe.com/iX/1.0/'>[/URL]
<rdf:Description rdf:about='' xmlns:Iptc4xmpCore='[URL unfurl="true"]http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/'>[/URL]
    <Iptc4xmpCore:SubjectReference>URL</Iptc4xmpCore:SubjectReference>
<Iptc4xmpCore:SubjectCode>
    <rdf:Bag>
        <rdf:li>URL</rdf:li>
    </rdf:Bag>
</Iptc4xmpCore:SubjectCode>
</rdf:Description>
<rdf:Description rdf:about='' xmlns:photoshop='[URL unfurl="true"]http://ns.adobe.com/photoshop/1.0/'>[/URL]
    <photoshop:Headline>TITLE</photoshop:Headline>
</rdf:Description>
<rdf:Description rdf:about='' xmlns:dc='[URL unfurl="true"]http://purl.org/dc/elements/1.1/'>[/URL]
    <dc:description><rdf:Alt><rdf:li xml:lang='x-default'>COMMENT</rdf:li></rdf:Alt></dc:description>
</rdf:Description>
<rdf:Description rdf:about='' xmlns:photomechanic='[URL unfurl="true"]http://ns.camerabits.com/photomechanic/1.0/'>[/URL]
</rdf:Description>
<rdf:Description rdf:about='' xmlns:xap='[URL unfurl="true"]http://ns.adobe.com/xap/1.0/'>[/URL]
    <xap:Rating>0</xap:Rating>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
"[/color];

[COLOR=#0000FF]foreach[/color] [COLOR=#0000FF]my[/color] $file (@[COLOR=#0000FF]ARGV[/color]) {
	[COLOR=#FF0000]open[/color] F, $file [COLOR=#FF8000]or[/color] [COLOR=#FF0000]die[/color] [COLOR=#808080]"unable to open $file"[/color];
	[COLOR=#0000FF]local[/color] $/;
	[COLOR=#0000FF]my[/color] $content=<F>;
	[COLOR=#0000FF]if[/color] ($content =~ /Title:\n(.*?)\n\nURL:\n(.*?)\n\nComment:\n(.*)/) {
		[COLOR=#0000FF]my[/color] $title=$[COLOR=#FF0000]1[/color];
		[COLOR=#0000FF]my[/color] $url=$[COLOR=#FF0000]2[/color];
		[COLOR=#0000FF]my[/color] $comment=$[COLOR=#FF0000]3[/color];
		[COLOR=#FF0000]print[/color] [COLOR=#808080]"Title is $title\n\n"[/color];
		[COLOR=#FF0000]print[/color] [COLOR=#808080]"URL is $url\n\n"[/color];
		[COLOR=#FF0000]print[/color] [COLOR=#808080]"Comment is $comment\n\n"[/color];
		[COLOR=#0000FF]my[/color] $xml = $template;
		$xml =~ [COLOR=#FF8000]s[/color]/TITLE/$title/g;
		$xml =~ [COLOR=#FF8000]s[/color]/URL/$url/g;
		$xml =~ [COLOR=#FF8000]s[/color]/COMMENT/$comment/g;
		[COLOR=#FF0000]open[/color] OUTPUT, [COLOR=#808080]">${file}.xml"[/color] [COLOR=#FF8000]or[/color] [COLOR=#FF0000]die[/color] [COLOR=#808080]"unable to create output ${file}.xml"[/color];
		[COLOR=#FF0000]print[/color] OUTPUT $xml;
		[COLOR=#FF0000]close[/color] OUTPUT;
	} [COLOR=#0000FF]else[/color] {
		[COLOR=#FF0000]print[/color] [COLOR=#808080]"$file has unexpected format\n"[/color];
	}
	[COLOR=#FF0000]close[/color] F;
}

You would execute it using perl scriptname.sh file1.txt file2.txt etc. It works by using a template and substituting in the required fields where the capitalised placeholder keywords are in the template.

Where I half expect this to break though is when the supplied text is not "XML-safe"... I'm no XML wizard so not sure how to recognise that reliably.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Mar 30, 2012

This looks great, even though I don't really understand it. I've gone over it though and sort of get what's going on, I'll see what I can and get back to you.

Thanks for your help!

Goppin · Mar 31, 2012

I found that loads of errors occurred, too many to list. The Terminal bit worked, but, for example I received many errors similar to this:

Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 1, near "ansi\"
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 1, near "ansicpg1252\"
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 1, near "cocoartf1038\"
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 2, near "fonttbl\"

I was trying it with the following file:

http://cl.ly/1f2e1T3c2z0W1y3S0U2p

...and then other errors like this:

(Missing semicolon on previous line?)
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 8, near "f0\"
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 8, near "fs24 \"
(Do you need to predeclare fs24?)
Backslash found where operator expected at /Users/zzz/Desktop/Test/xmlconvert.sh line 9, near "cf0 #!/usr/bin/perl -w\"

A bit of a bugger, I think it's going to be something simple but that it's been caught out quite a bit here and so it looks worse than it is.

Thanks again.

Annihilannic · Apr 1, 2012

Hmm... I'm sensing some corruption of the script contents. What environment are you working in (OS, shell, perl version, etc)? How exactly are you invoking the script?

Especially odd is the fact that the shebang line ("#!/usr/bin/perl -w"), which needs to be the very first line of a script, appears to be on line 9.

I had a typo in my previous post, I meant to call it scriptname.pl... however the name really doesn't matter.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 3, 2012

Right, well, the script has been furthered since we last all spoke thanks much to the time and expertise of Phil Harvey (ExifTool:

http://www.sno.phy.queensu.ca/~phil/exiftool/).

So now the script will interrogate the files to spit out an XMP file with the data from the Text file in the right places:

#!/usr/bin/perl -w

use strict;

my $commentTemplate="<rdf

escription rdf:about='' xmlns:dc='

http://purl.org/dc/elements/1.1/'>

<dc:description><rdf:Alt><rdf:li xml:lang='x-default'>COMMENT</rdf:li></rdf:Alt></dc:description>
</rdf

escription>
";

my $template="<?xpacket begin='' id=''?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9-9, framework 1.6'>
<rdf:RDF xmlns:rdf='

http://www.w3.org/1999/02/22-rdf-syntax-ns#'

xmlns:iX='

http://ns.adobe.com/iX/1.0/'>

<rdf

escription rdf:about='' xmlns:Iptc4xmpCore='

http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/'>

<Iptc4xmpCore:SubjectReference>URL</Iptc4xmpCore:SubjectReference>
<Iptc4xmpCore:SubjectCode>
<rdf:Bag>
<rdf:li>URL</rdf:li>
</rdf:Bag>
</Iptc4xmpCore:SubjectCode>
</rdf

escription>
<rdf

escription rdf:about='' xmlns

hotoshop='

http://ns.adobe.com/photoshop/1.0/'>

<photoshop:Headline>TITLE</photoshop:Headline>
</rdf

escription>
COMMENT
<rdf

escription rdf:about='' xmlns

hotomechanic='

http://ns.camerabits.com/photomechanic/1.0/'>

</rdf

escription>
<rdf

escription rdf:about='' xmlns:xap='

http://ns.adobe.com/xap/1.0/'>

<xap:Rating>0</xap:Rating>
</rdf

escription>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
";

foreach my $file (@ARGV) {
open F, $file or die "unable to open $file";
local $/;
my $content=<F>;
print "==== $file\n";
if ($content =~ /Title:\n(.*?)\s+URL:\n(.*?)(\s+Comment:\n(.*?)\s+)?$/) {
my $title=$1;
my $url=$2;
my $comment=$4;
print "Title is $title\n\n";
print "URL is $url\n\n";
if (defined $comment) {
print "Comment is $comment\n\n";
} else {
$comment = '';
}
my $xml = $template;
foreach ($title, $url, $comment) {
s/&/&/g;
s/>/>/g;
s/</</g;
}
$xml =~ s/TITLE/$title/g;
$xml =~ s/URL/$url/g;
$xml =~ s/COMMENT/$commentTemplate/ if length $comment;
$xml =~ s/COMMENT/$comment/g;
my $outfile = $file;
$outfile =~ s/\.[^\\\/]*$//;
open OUTPUT, ">$outfile.xmp" or die "unable to create output $outfile.xmp";
print OUTPUT $xml;
close OUTPUT;
} else {
print "Unexpected format\n";
}
close F;
}

The code *does* work for me but a problem I've found, on occasion, is that the files sometimes do not always have the "Comment" field filled out and this will stop the script from working as it should i.e. only "Title" and "URL" are present.

Apparently the following line is of concern:

if ($content =~ /Title:\n(.*?)\s+URL:\n(.*?)(\s+Comment:\n(.*?)\s+)?$/) {

I thought it might be that Comments wasn't case sensitive but it is, I've had a look and how the code bounces from place to place and I can't tell why it cannot skip the Comment search when it can't find it in a text file.

There is a section to this which makes sure that the overriding extension to a file is always going to be outputted as XMP, this is because a picture file is accompanied with a Text file (like a BMP file which has no ability to contain metadata) and so looks like "filename.bmp" and so the textfile looks like "filename.bmp.txt" and if the script had of converted it it would then read as "filename.bmp.txt.xmp" and so when importing into Aperture it would not work, it must end in XMP and be the only extension to the same file name as the picture.

Apparently the problem lies in the "$content" expression, when the script tests for it and it doesn't match. I have a link about how expressions work but it's as long as my Santa list and it's full of stuff that I don't understand so I was hoping somebody else might be able to help.

link:

http://perldoc.perl.org/perlre.html#Regular-Expressions

txt with "Title", "URL" and "Comment":

http://cl.ly/1X2a2t2o0P3l2H1y2B1V

txt with just "Title and "URL":

http://cl.ly/0N3H2r3Y1p0M0h0I143M

Thanks!

Annihilannic · Apr 3, 2012

Progress indeed.

I ran your code against the data on the second link you provided there and it seems to work as you describe. This was the XMP produced:

Code:

<?xpacket begin='' id=''?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9-9, framework 1.6'>
<rdf:RDF xmlns:rdf='[URL unfurl="true"]http://www.w3.org/1999/02/22-rdf-syntax-ns#'[/URL] xmlns:iX='[URL unfurl="true"]http://ns.adobe.com/iX/1.0/'>[/URL]
<rdf:Description rdf:about='' xmlns:Iptc4xmpCore='[URL unfurl="true"]http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/'>[/URL]
    <Iptc4xmpCore:SubjectReference>[URL unfurl="true"]http://smokingdesigners.com/choosing-typeface-projects/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+smokingdesigners%2FXkob+%28Smoking+Designers%29&amp;utm_content=Google+Reader</Iptc4xmpCore:SubjectReference>[/URL]
<Iptc4xmpCore:SubjectCode>
    <rdf:Bag>
        <rdf:li>[URL unfurl="true"]http://smokingdesigners.com/choosing-typeface-projects/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+smokingdesigners%2FXkob+%28Smoking+Designers%29&amp;utm_content=Google+Reader</rdf:li>[/URL]
    </rdf:Bag>
</Iptc4xmpCore:SubjectCode>
</rdf:Description>
<rdf:Description rdf:about='' xmlns:photoshop='[URL unfurl="true"]http://ns.adobe.com/photoshop/1.0/'>[/URL]
    <photoshop:Headline>Choosing the Appropriate Typeface for your Projects | SmokingDesigners | Graphic Design Fashion and Photography</photoshop:Headline>
</rdf:Description>

<rdf:Description rdf:about='' xmlns:photomechanic='[URL unfurl="true"]http://ns.camerabits.com/photomechanic/1.0/'>[/URL]
</rdf:Description>
<rdf:Description rdf:about='' xmlns:xap='[URL unfurl="true"]http://ns.adobe.com/xap/1.0/'>[/URL]
    <xap:Rating>0</xap:Rating>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>

So... I don't understand what's not working?

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 3, 2012

when I ran the code it didn't work and produced the "Unexpected format" error.

I'm typing:

perl xmlconvert.pl 1333350837348_1.png.txt

How did you get it to work?

Annihilannic · Apr 3, 2012

I just copied and pasted the text from the linked page into notepad, saved as a text file, and then ran it the same way as you did using Strawberry Perl under Windows. (I wouldn't normally do it under Windows but I thought initially that you might be...)

What operating system are you working in? Could there be funny characters in the file that I inadvertently filtered out? If it's a Unix-like environment, what is the output of cat -vet 1333350837348_1.png.txt?

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 3, 2012

I'm actually on OS X (10.6), the output is:

Title:$
Nezinscot Farm | Sustainable Packaging Design$
$
URL:$

http://ambalaj.se/2011/11/16/nezinscot-farm/$

$
Comment:$
Lindsay Perkins [

http://www.lindsayperkins.com/

] ?M-^@M-^]To design a brand and package line that reflected the farm?M-^@M-^Ys eco-friendly agricultural methods. All packaging is reproduced individually, making each piece a little bit different, just as products from an organic farm. The concept The Grass is Greener on our side is emphasized throughout the brand and product lines by the use of biodegradable paper made with grass seeds. So wherever and however the packaging is disposed, grass will grow one way to give back to the farm?M-^@M-^Ys free range animals.?M-^@M-^]

But the "1333350837348_1.png.txt" file has "a Comments" section, I know you mentioned the second file, which was the one without (and the one I'm having trouble with) but the "1333350837348_1.png.txt" works for me.

I don't know what the problem is? I've tried them together and separately but that one without the "Comments" entry won't work!

Annihilannic · Apr 3, 2012

Ah, I hadn't tried that one.

When I did it also failed due to the lack of an end-of-line character after the comment. As soon as I added one it worked fine.

You can remove the $ from the end of your regular expression if the EOLs aren't always present.

As you can see from the cat -vet output the directional double-quote characters may also present challenges... see how it goes.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 4, 2012

Well, I thought I had it, I removed the "$" from the end of the following line:

if ($content =~ /Title:\n(.*?)\s+URL:\n(.*?)(\s+Comment:\n(.*?)\s+)?/) {

And the script processed all the files, but this time it only went as far as to convert the "Title" data only, nothing else. This applies to txt files with "Comments" included and those that don't, so now all XMP files have just the "Title" data in them.

Annihilannic · Apr 5, 2012

I haven't had time to test much, but does this work for you?

Code:

    if ($content =~ /Title:\n(.*?)\s*URL:\n(.*?)(\s*Comment:\n(.*?)\s*)?$/) {

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 5, 2012

Affraid it only works on txt files with Comments in, not on the ones without. I tried removing the end "$" again but it only reproduced the issue of not looking at "URL" and "Comments".

Thanks for trying.

prex1 · Apr 6, 2012

Try this

Code:

  if ($content =~ /Title:\n(.*?)\s+URL:\n(.*?)(\s+Comment:\n(.*?)\s+)?$/[COLOR=red]s[/color]) {

The modifier [tt]s[/tt] will allow to [tt](.*?)[/tt] to match also on newlines, that may be present within the fields. Those newlines, except the terminating ones, will stay in the captured fields, you should decide whether to accept or remove them. Also: if the Comment field may have no terminating newline, you should replace the last [tt]\s+[/tt] with [tt]\s*[/tt].
BTW I wouldn't have used a regex for this task in the first place, but this is a matter of taste for complications... [smile]

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Annihilannic · Apr 6, 2012

How would you have done it, Franco? Just processing the input file line-by-line, or something fancier?

As I said originally... quick and dirty; I wasn't expecting it get this complicated.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]

Goppin · Apr 6, 2012

That's it Prex1, you nailed it.

It works!

Many thanks to you and the continued efforts of those here on the thread.

prex1 · Apr 6, 2012

Yes, line by line, Annihilannic.
Something like

Code:

while(<F>){
  if(index($_,'Title:')==0){
    $where=1;
  }elsif(index($_,'URL:')==0){
    $where=2;
  }elsif(index($_,'Comment:')==0){
    $where=3;
  }else{
    chomp;
    s/\s+/ /g;
    next unless $_;
    if($where==1){
      $title.=$_.' ';
    }elsif($where==2){
      $URL.=$_.' ';
    }elsif($where==3){
      $comment.=$_.' ';
    }
  }
}

This too could be defined as quick and dirt (intentionally using elsif in place of other more elegant constructs), but is far more readable and modifiable, and would be faster if thousands of files had to be processed.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Convert TXT to XML 1

Technical User

MIS

Technical User

MIS

Technical User

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

Programmer

MIS

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor