Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Script won't open xml file for editing 1

Status
Not open for further replies.

bingoldsby

Technical User
Jan 24, 2002
68
0
0
US
In another thread, I have been asking for assistance getting an HTML form to call a script which will open an XML file, find a specific line, and add another line using the form's input data to modify that new line. (it's adding an email address to a whitelist)

I have the form and a script, but the script has a problem which I can not discern and fix. Hoping for help here.

The problem seems to be that the script can not open the xml file to gather it all up into an "array of strings." The webserver (Apache on Windows) gives that error message in it's error log. Depending on how and from where I try to access the cgi, there are other problems, it seems.

Here is the script.

Code:
#!c:\perl\bin\perl.exe

# The following accepts the data from the form and splits it into its component parts

read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
    
@pairs = split(/&/, $buffer);
    
foreach $pair (@pairs) {
    ($name, $value) = split(/=/, $pair);
    $value =~ tr/+/ /;
    $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    $FORM{$name} = $value;
}

# Read XML file to array of strings

$xml_file="c:\prpgram files\apache group\apache\cgi-bin\whitelist_test.xml";
open(XML, $xml_file) || die("Could not open file!");
@xml=<XML>;
close(XML);

# Write XML back to file, adding email

open(XML,">>$xml_file") || die("Cannot Open File");
foreach $line (@xml) {
    print XML $line;
    if ($line == "<expression casesensitive=\"no\" type=\"regex\" onmatch=\"score += 1\">zyxxxyz</expression>") {
        print XML "<expression casesensitive=\"no\" type=\"regex\" onmatch=\"score += 1\">$FORM{email}</expression>"
    }
}
close(XML);

# Write the thank you page

print "Content-type: text/html\n\n";

print <<EndStart;

    <html>
    <head>
    <title>Thank You</title>
    </head>
    
    <body bgcolor="#ffffff" text="#000000">
    
    <h1>Thank You</h1>
    
    <p>Your email address has been stored.</p>
    
    </body>
    </html>

EndStart

There may be other problems with this script, but this is the only one I have been able to identify so far. The section of the script: "#read XML file to array of strings" and in particular, the line: "open(XML, $xml_file) || die("Could not open file!");" is where the first and halting error occours when the script is run directly from a browser (not via the form - there other issues there for later)

I'll post or email the other parts of the puzzle - the form and the xml file for anyone who can help. It seems like this should be an easy go for several persons who hang out here, according to what I've been reading in other threads.

I'm very hopeful on getting a solution to this problem.

Thank you all very much.

Brian - Union Gospel Mission
Yakima, WA
 
Your main problem is this line:
Code:
$xml_file="c:\prpgram files\apache group\apache\cgi-bin\whitelist_test.xml";

Within double-quoted strings, a backslash is an escape character. You could double them up - // codes for / - but the better solution is to (a) use single quotes (why turn on interpolation when you're not interpolating?) and (b) use forward slashes. DOS and 'doze are perfectly happy with either and getting into the habit of using forward-slahses protects you from this mistake again and makes your code more portable as *nix and many other OSs use forward-slashes.

I also suspect that "prpgram" should read "program".

And now a plea: this sort of code
Code:
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
    
@pairs = split(/&/, $buffer);
    
foreach $pair (@pairs) {
    ($name, $value) = split(/=/, $pair);
    $value =~ tr/+/ /;
    $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    $FORM{$name} = $value;
}
is doomed to failure and early obsolescence. Parsing http data is much more complicated than this and your code will fall over if it encounters unicode or any character-set-encoding change as well as several other fairly common http extensions.

You should use CGI.pm. It makes your code clearer and simpler, copes with lots of arcane details that you may never have heard of, let alone coded round, and offers a large measure of future-proofing.

Parsing queries is very hard but all of the work has been done for you for free if you use the module.

I also note that your XML parsing code makes huge assumptions about the layout of the XML - in particular, that at least one opening tag, content and closing tag all live on the same line.

If you are absolutely sure that this will always be the case then you will get away with it. If you are not generating the XML with your own code, you may have a nasty shock in the future, which the use of an XML module would solve. Your call - real XML parsing is not without overhead.

Yours,

fish


[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Thank you, Fish,

I'm sure I like the way you think also. Unfortunately, I don't understand anything about what you're thinking - except the part about the quotes around the xml file location string. I did notice some errors talking about unexpected escaped characters in that line. I have some Regex experience, so know the concept and importance.

As for using "modules" and such, that's way beyond me. This code was written by a fellow on the xml board at this site. He was kind enough to give it a go for me but doesn't have Perl installed on his machine and couldn't try it out. In spite of that, of course, I greatly appreciate his efforts, as it has given me a place to start, before which I had none.

The form/cgi/xml setup is to be used by the few Email users on our local network (about 30) to add email addresses to an xml spam-filtering-system whitelist. We would have no issues with performance what-so-ever, and no outside access would be given. This specific script will do no more than just find the given line and add another beneath it (leaving the original search-line intact)

As I try to continue to implement this script, I'll be ever so grateful if you will stand by and give me enough assistance to get it working properly.

Last night, I downloaded a trial of the personal edition of Komodo code editor from ActiveState. I suppose I should have gotten on here and asked which one was recommended. I have no experience in Perl scripting, except for setting some up for a few forms I'm running. Never had to do something out of the ordinary. It's frustrating for a beginner, but an educational challenge, none-the-less. I'm almost 60, so the old dog is reaching.

Thanks again. I'll give a go with what I can upon your suggestions and let you know where it goes. Probably quite soon today.

Brian
 
I've got a long drive to a gig tonight so I'm off-line until tomorrow, crack of lunch. I don't have time to test this now, but give this a shot:
Code:
#!c:\perl\bin\perl.exe

use CGI qw/ :standard /;

# Read XML file to array of strings

$xml_file=[red]'[/red]c:[red]/[/red]pr[red]o[/red]gram files[red]/[/red]apache group[red]/[/red]apache[red]/[/red]cgi-bin[red]/[/red]whitelist_test.xml[red]'[/red];
open(XML, $xml_file) [red]or[/red] die("Could not open file [red]$![/red]");
@xml=<XML>;
close(XML);

# Write XML back to file, adding email

open(XML,">>$xml_file") || die("Cannot Open File");
foreach $line (@xml) {
    print XML $line;
    if ($line == "<expression casesensitive=\"no\" type=\"regex\" onmatch=\"score += 1\">zyxxxyz</expression>") {
        print XML "<expression casesensitive=\"no\" type=\"regex\" onmatch=\"score += 1\">[red]param('email')[/red]</expression>"
    }
}
close(XML);

Note the use of the $! system variable - this gives an automatic explanation of why the open failed. In addition, using [tt]or[/tt] rather than || can save you from subtle problems. The only difference is the precedence - [tt]or[/tt] binds much more losely - and using [tt]or[/tt] guarantees that the die() doesn't get sucked into an expression earlier on the line.

Aplogies if there are bugs or typos there.

Yours,

fish

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Ok... Two more issues:

Software error:

Can't find string terminator "EndStart" anywhere before EOF at c:\PROGRA~1\APACHE~1\apache\cgi-bin\test.cgi line 39.

This comes up when attempting to run the CGI directly from the browser. Line 39 now is the start of the "Print return message" section - Print <<EndStart; - specifically.

And... when trying to run the script by entering data in the form, from the same machine across the network (not using the machine that the script is on), I get a Windows dialogue box which says "Alert - c is not a registered protocol" - that from Firefox. IE6 just does nothing.

If I try to run the CGI directly from the browser on the Server, I see another Windows dialogue box which is "file download" running the progress bar - like it is actually downloading a file. That's curious.

Will the form code help?

And so far, no changes to the xml file are being made yet.

As before, I really appreciate your suggestions about the changes to the over-all structure of the script, but I'm not capable of doing such or even understanding the benefit.

Thanks,

Brian
 
OOps,

I missed your response this morning, Fish, before I wrote the immediate above. I'm just finding it and will try it as soon as possible.

Thanks.

Brian
 
Well, although the result of your change (Fish) to the script is unusual (not yet correct), I consider this to be progress.

Running the script directly (without the final section to write back a "Success Notification") the xml file is opened, and written to (rewritten?). The entire original file contents remains the same - then it is writen again (directly continuing from the bottom of the original), but with the line:

<expression casesensitive=\"no\" type=\"regex\" onmatch=\"score += 1\">param('email')</expression>

directly preceeding every line of the original.

I have yet to be able to run the script via the form and get any kind of response. (read the preceeding message)

Looking for your return. Played pretty, I hope.

Brian
 
Fish, and all,

Please read my final post in the other thread I got into. I respectfully request to be excused from this discussion.

Thank you,

Brian
 
I was really moody when I posted that - I hate playing weddings at the best of time and this one was late and distant! We're very close to a solution now - I've just read all the various posts on this subject and now better understand what you want to do.

If you want to continue, the code beneath might work.

As to the errors above, "'c' is not a registered protocol" gave me a few moments of pause. Perl's open uses the underlying operating system's open, as you would expect, and 'doze now includes suport for opening URLs, so I think that your line "open ('c:/....." was being interpreted as a URL - cf open(' or open('irc://.... - hence the error: c isn't http, ftp, mailto or any recognised protocol. This could only be determined at run-time as new apps can register new protocols dynamically. Despite my earlier rant about forward-slashes and back-slashes in paths, it seems that drive designators break under new 'doze. This is news to me and, I suspect, many others so I'm going to post it as a tip. The solution is to go back to backslashes. The would need escaping as \\ in double-quoted strings but not in single-quoted strings, so that's what we'll use.

Next:
The entire original file contents remains the same - then it is writen again

This is a consequence of
Code:
open(XML,"[red]>>[/red]$xml_file") || die("Cannot Open File");
where you are opening the file in append mode. You can use a single '>' to open in "clobber" mode.

You could also use [tt]+<[/tt] to open in read-write mode and use [tt]seek()[/tt] to go back to the top before writing, which is slicker than closing and re-opening. It will also help if you want later to add file locking to protect against multiple, simultaneous invocation.

No for the write-back loop. I think that you are trying to add a line for the email address in the right place in the xml file and you are using zyxxxyz as a flag to indicate the correct position. Is this correct?

There are several reasons why your code isn't currently doing this. Firstly, == compares numbers and the strings you are looking at, when evaluated as numbers, are zero, so the conditional block would always be executed. We need to use [tt]eq[/tt] to compare scalars as strings.

Before we go on: a quick note on quoting. Perl has several types of quotes and quote-like operators and these can be used to reduce the number of \ characters needed in strings. Rather than [tt]"this contains a \" character"[/tt], it is neater (and, actually, more efficient) to write [tt]'this contains a " character'[/tt].

Combining these ideas, (and using eq instead of == for string comparison)
Code:
#!c:\perl\bin\perl.exe

use CGI qw/ :standard /;

# Read XML file to array of strings

$xml_file='c:\program files\apache group\apache\cgi-bin\whitelist_test.xml';
open(XML, '+<', $xml_file) or die("$0: $xml_file $!");
@xml=<XML>; # slurp

# Write XML back to file, adding email
seek( XML, 0, 0 );

my $preamble = '<expression casesensitive="no" type="regex" onmatch="score += 1">';
[red]# did I not see leading spaces in another thread? - easy to add them if you need[/red]
my $flag = $preamble . 'zyxxxyz</expression>';

foreach $line (@xml) {
  print XML $line;
  if ($line eq $flag ) {
    print XML $preamble, param('email'), '</expression>', "\n";
  }
}
close(XML);

Some other thoughts:

I think it would be clearer to use an XML comment to mark the insert point. Something like
Code:
<!-- INSERT POINT FOR AUTOMATIC ENTRIES -->
would make your code more self-documenting, which can be a huge bonus in 18 months or so when you are trying to modify it.

There are several performance improvements that spring to mind but they all complicate the code, which is probably not what you want. Here's a mildly hot-rodded version for comparison:
Code:
#!c:\perl\bin\perl.exe

use CGI qw/ :standard /;

# Read XML file to array of strings

use constant MARK => '<!-- END OF AUTOMATIC ENTRIES -->' . "\n";
use constant PRE => '<expression casesensitive="no" type="regex" onmatch="score += 1">';
use constant PST => "</expression>\n";
$xml_file='c:\program files\apache group\apache\cgi-bin\whitelist_test.xml';

open(XML, '+<', $xml_file) or die("$0: $xml_file $!");

while (<XML>) {
  if ($_ eq MARK ) {
    my $mark = tell;  # remember this location
    my @rest = <XML>; # only store the second half of the file
    seek( XML, $mark - length(MARK), 0 );
    print XML
      PRE, param('email'), PST, # prepend - less to copy next time
      MARK;
    print XML foreach @rest;
    last; # break out of the while() loop
  }
}
close(XML);

Untested, but it should suffice to illustrate my point.

Have I atoned?

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Thank you very much, Fish. (hope that's an acceptable use of name calling). I presume you are in GB. Perhaps the time difference is somewhat of an advantage, as "sleeping on it" has always been good.

It seems like I'm always trying on something new (to me) or different, but usually want instant gratification. If I can get something that works first, I can then use it to try to to understand the underlying principles. Plus, I have a bunch of people here that I have promised a way for them to add their own contacts to the whitelist and am anxious over that.

But really, this is better. I have been poking through some tutorials and am about to make the purchase of the code editor (Komodo by ActiveState - seems like it would be a good companion to the ActiveState Perl distro, I have). Unless you have a suggestion for something better?

I started to read and consider your writings above, but wanted to get this out first. Now I'll study what you have so kindly provided me and try to implement some changes.

Once again, I do deeply appreciate your assistance and accept it in the spirit of "learning," as it is desired that I do here. Good for you, I needed the redirection.

Brian
 
OK...

I've been able to run the form, get it to run the script (your first rewrite - haven't tried the revved up sample yet), and the script runs to the end (little "no boo boo" print out at the end of it). That's progress for me.

The "c is not a registered protocol" thing was coming from the form, which had as the "action" argument, the entire file location starting with "c:\program files\ etc." I moved the form to another directory, changed the form action to "../cgi-bin/etc." and it works -- in as far as the target xml file looks fine. But, no change or added email line is being made yet.

Because I see no change in the xml, I don't know if the script is actually acting on it or not. Is there something I can temporarily place in the script that will write something to the xml so I can actually see some action?

Next, I'm going to make the change to the XML "comment" to mark the insert point, as you suggest (good one) and see how badly I can mangle that.

Brian
 
OK, again... excellent this time!

I tried your souped up version and, BINGO!, that worked almost exactly as I've hoped - almost.

The added email addresses are being placed BEFORE the comment line <!-- END OF AUTOMATIC ENTRIES --> Seems like it should come after.

...and the <expression, etc. line is being written with two "<<" at the beginning of the line. That probably breaks the xml.

Also, when an email address is added, the "dot, or any dot" needs to be excaped because it's a regular expression. I could instruct each user to add the backslash, but it would be safer if it was inserted automatically. I made the mistake of not excaping a dot in the first part of an email address the other day and got a 32 GB log file generated as a result. Swamped the whole machine. Glad I found that pesky little period.

Perhaps I can figure out how to do it, but you might give me a clue.

Wow, this is great and I appreciate your helpfulness.

Thanks again,

Brian
 
Yes I live in the UK and, yes, fish is my name (legally changed many years ago). I'm glad you're seeing progress.

form, which had as the "action" argument, the entire file location starting with "c:\program files\ etc."
The value of the "action" attribute of a FORM tag is interpreted as a URL rather than as a filename, which explains your observations. I'm surprised that it's working with ../cgi-bin/etc as .. is usually banned at the top of a URL as an attempt to navigate the weberver's filesystem. It looks as if it's being ignored by your webserver. Try dropping the .. so that you have [tt]action="/cgi-bin/etc[/tt] - you may end up with more portable HTML.

The added email addresses are being placed BEFORE the comment line

This is deliberate: the script runs faster before it encounters the trigger line - all it's doing is reading the file. Adding new items before the trigger line means that what you have added this time goes through the fast, non-copying loop next time rather than the slower copying part. It also adds to the end of the list, which seems more natural. You might want to move the mark to as near the end of the file as you can.

...and the <expression, etc. line is being written with two "<<" at the beginning of the line
I suspected this might happen. Change
Code:
seek( XML, $mark - length(MARK), 0 );
to
Code:
seek( XML, $mark - length(MARK) -1, 0 );
I was simply not going back far enough.

when an email address is added, the "dot, or any dot" needs to be excaped

Hmmmm. Bit of a problem there, because it's not just dots - there may be other characters of significance to the regex engine that could occur in an email, or be injected maliciously. Quite which characters are unsafe depends on the regex engine being used and I assume that this is hidden in an application somewhere. The safest thing to do is probably to use perl's quotemeta function to escape all non-word characters:
Code:
    print XML
      PRE, quotemeta(param('email')), PST,
      MARK;

I hope that this gets your script running. Once it is, you should think about the problem of multiple simultaneous invocation which would, currently, trash your XML file. Are you familiar with the concepts of file locking?

Yours,

fish

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Fish,

Got all that and understand. I fixed the double << thing and moved the comment line down to the bottom of the Email address section - seeing it add email addresses correctly to the bottom of the section. Good thinking, of course.

I'll look into the "file lock concept" and also try to figure out how to add a tab to the front of the <expression... line to match the rest of it - or just remove all the existing tabs and make the lines start without.

This form/script is only going to be available to a few trusted users here on the lan. So I'm trying to be confident that some hacker is going to take advantage of its existance.

By the way, I am a musician also - piano/organ (theatre organ to be specific). I worked for over twenty years at a pipe organ factory (small). And at the same time played evenings several days a week at a number of pizzza parlours which had Wurlitzer pipe organs installed. Great years, those.

Thanks and have a good evening.

 
I used to play church organ and once had a tour round Harrison & Harrison's, a traditional church organ builder. I watched guys in their 70s planing timber to within a couple of thou - absolutely amazing.

I played bass (both electric and steam) professionaly for many years but these days play mostly guitars and mandolins, although I'm learning 5-string banjo (very trendy these days, it seems!)

I'm stupidly busy next week so I may not be able to answer questions as quickly as I would normally like, but I will try to check in at least once a day.

As for your leading tabs: \t works within double-quotes but, as you want double quotes in the string, you're probably best using one of the more exotic techniques described in QUOTE AND QUOTE-LIKE OPERATORS under perlop. I'd use
Code:
use constant PRE => [red]qq{\t[/red]<expression casesensitive="no" type="regex" onmatch="score += 1">[red]}[/red];
adding tabs to taste.

There's good info about file locking in both the flock section of perlfunc and in How can I lock a file? and subseqent sections in perlfaq5.

Good luck,

fish


[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Once again, Fish, I thank you heartily (did that in every post you made here).

I have the form and the script up and running. It's performing well. Lots of musical notes to compare, I suppose. I also am going to be the busiest this week of any other time of the year. Been so for a week or so - plus the other family affairs. Good to be sitting for a change.

Keep a lip upper stiff.

Brian
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top