Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Difficulty With LWP Automated Access to Password Protected WebSite 1

Status
Not open for further replies.

New3Perl

Programmer
Sep 7, 2009
12
US
I am trying to write a Perl application to monitor data on a password protected web site. I am using the Perl & LWP book by Sean Burke as a reference. The website uses a https: login page to get a user id/pin. Once logged in one can navigate around and get various kinds of information that is occasionally updated. The problem that I am having is that I am not getting past the login page. Let's say that the address of the login page URL is " The HTML for this page contains a form specification that includes

<form name="loginform" method="post" action="
It also includes Java script, which I am not too familiar with, and one text and password element for the user id & pin. Note the difference in the name of the site (".../login2.asp") vs. the site listed in the action (".../login.asp"). When I try to send a post request to the login2.asp site, I just get the HTML from that site back as a response. I am clearly not effectively attempting to login because I can try it both with a valid and an invalid user id and get the same results (and I know that an invalid user id when manually supplied in a browser causes the site to send HTML that looks different).

One of the things that the Burke book says is that there can be problems when working with a site that contains Java Script because it isn't always clear what the proper form data to submit is. Maybe this is what is going on in my case. What would be helpful is (1) some sort of eavesdropping program that could monitor the traffic between my normal (non Perl) browser and the site so that I could see what 'correct' post request really looks like; (2) the ability to get at the cookies that the site uses to keep track of a user session - that way I could log on manually and then start Perl with the proper Cookie file to keep the session going. I am currently using either IE or FireFox as a browser. However, the Burke book says that Perl can only read NetScape cookies (but the book was written a while ago).

I'm fairly new to this sort of thing, so there may be other ways to handle this better. I'd appreciate any ideas on what to do.
 
New3Perl said:
When I try to send a post request to the login2.asp site, ...

I think judging by that form you should be sending your POST request to the login.asp (as advised by the action parameter), not login2.asp.

Annihilannic.
 
Yes. I have tried it both ways. If I send the POST to login.asp (instead of login2.asp), then I get a response header with status line '302 Object Moved'. The content is a short HTML page with a link pointing back to login2.asp. So something is not quite right here. That's why I'm imagining that having some sort of eavesdropping program that I could use to "listen in" to the messages between a brower and the site would be quite helpful.
 
Try Wireshark. It's free, and works well - bear in mind that it is a full-on tech-head's tool so it gives you every last bit of line traffic on the interface, but with suitable filtering on the output it will give you what you need. On Linux you might have to run as root though.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
If it's over SSL (https), Wireshark won't be of much help; you'll just see the encrypted SSL and not the HTTP requests underneath.

Take a look at Ospy - it hooks into the winsock API so you can see what data your apps are sending over the network on a per-app basis, at a level pretty close to the app itself, so you see the HTTP traffic going over SSL before it's encrypted.

Besides that, getting a 302 Found header isn't that unusual; some sites do that after you log in, so it automatically redirects you to a new page, so that hitting "Refresh" doesn't make your browser prompt you to re-submit your form data and log in again. Just make your code follow the URL it's sending you to (if cookies are involved, make sure you set up a cookie jar and then the next page would get you the content you should get for being logged in).

Cuvou.com | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
I'll definitely look into Ospy and see what it can do - sounds like it is what I need.

Regarding the 302 message - when I get this I always get redirected to the the logon page, not the page that I know I should get to when I actually have a successful logon. Also, as in the case where I POST to login2.asp, the 302 response is independent of whether or not a valid user id is provided. So I definitely don't think that I am getting past the login. The site does use cookies to keep track of who is logged in and I have the LWP cookie jar set up, so hopefully no problem there.

I'm not very familiar with JavaScript and how that works. Could it be that JavaScript is being executed when the user hits the enter button after filling out the userid/pin form (as part of the form validation process?) such that the POST parameters that are submitted to the site are actually different from what one might guess by looking at the HTML for the page? Can you tell me how JavaScript works? My current understanding is that when some triggering event occurs (e.g., user submits a form), that a script is run to do something with the input. But this script resides on the server, so that would mean that non-HTML information would have to be transmitted by the client to the server, the script would run, the results would go back to the client, and then the (possibly modified) POST HTML would get sent from the client back to the server. Does that sound correct, or am I not understanding something?

On a separate, but related, subject, I have found out that FireFox now keeps its cookies in a database file, as opposed to earlier NetScape versions that used to keep them in a text file. I'm thinking that one (non-optimal) way I could get around this login problem is to manually login to the site with FireFox, then export the cookies for the site to a file that could be used by perl. That way, I could avoid having perl do the logging in. But I would have to manually login each time I wanted to run because the cookies change every time. I'd rather get Perl to do the login, but this might be a temporary work around until I figure that out. Does anyone know how to get FireFox to export its cookies to a text file?
 
It's been a while since I posted my original message on this topic. I've made some progress since then, but I am still having some problems. I initially focused on the second non-optimal way of dealing with the login problem: manually logging and then (also manually) copying over the cookies from FireFox (Options/Show Cookies) into Perl's cookies.lwp file. This approach was successful and I now have the basic monitoring capability that I wanted up and running.

However, I would still like to get the automated login working and I remain stuck on that. I have tried using oSpy to spy on what is going on when one logs in, but most of the data being sent back and forth looks encrypted and I can't see hardly any HTML. So I guess that my current problem is that I can't get oSpy to work the way it is supposed to. Perhaps someone could recommend something to help me get it working. I'm afraid that I don't have too many clues as to what is wrong, but I'll describe what I'm seeing.

There are actually two ways to login to the website. One way is through what I will call the "main page" that has a lot of information on it in addition to the UserID/PIN field. This is a non-HTTPS page with an address that looks like " If one logs in and then logs out, one gets sent to what I'll call the "alternate login" page that just has a place for the UserID/PIN. This is the page that I referred to in my earlier posting: "
I've tried using oSpy to capture data while logging in either way. In either case, almost all of what I see going back and forth in the oSpy window is unintelligible. If I start out with the main page it do see a little HTML initially. Some of the out-going headers have the line "Accept-Encoding: gzip,deflate". Does this mean that the data is zipped and that's why I can't see it? There are also some incoming chunks that contain what looks like parts of HTML headers that I have captured using Perl to get these login pages. But only parts of the headers. The rest of the chunks that contain these parts just look like garbage. I've also navigated to certain pages (once logged in) that I know exactly what the HTML looks like since the program that I have developed to monitor the site uses these pages. I can't see anything that looks like that HTML. Just garbage.

Is it possible that some kind of encoding is turned on that oSpy can't handle? I'd appreciate any help that you could provide...
 
afaik Ospy hooks into the Winsock DLL, so it only works on apps that use Winsock. Firefox, being a cross-platform app, probably uses something else. Internet Explorer most likely uses Winsock so you might want to try that.

Anyway, what I would do to solve this problem would be to look at the form on the login page, see what the form's action is and all the input fields that will get submitted to it, and then just make an LWP::UserAgent request to the form action with all the fields included (fill them in with your code)... and see how the server replies, specifically by looking at its headers. Most likely it would include a "Set-Cookie" header along with the Location header; see if the cookies it sends look anything like the ones you get in Firefox. And then follow the location header and get that page and see what you get and what the headers are, etc.

Basically just make your app pretend like it's a web browser; instead of worrying too much about what's going on between you and the server, just make your app do what Firefox does: submit the form and see the results the server sends back.

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
Ok. I'll try what you suggested wrt following the Location header and let you know how things go. Looking over some of the notes that I made when I was first trying to get things working, your suggestion is making a lot of sense. I pretty much know barely enough to do useful things in Perl, and I'm not too knowledgeable about "acting like a web browser". The book I'm going by didn't talk much about the Location header (at least in the parts I was reading).

I did try oSpy with Internet Explorer, but I can't get it to work at all. oSpy just hangs "Waiting for logging agent" when I try to get it to attach to a process. I also tried oSpy on a random purely non-https site with FireFox and that seems to work fine. I can see the HTML exchange that is going on clearly. So I must be doing something wrong with that program...
 
I have now tried what (I think) you suggested, but am still having problems. Here are the details. Let's call the login page ' If I get this page using

my $response = $browser->get('
then two Set-Cookie lines come back in the response header and the HTML from this page contains the following form:

...
<form name="loginform" method="post" action="<input type=hidden name="location" value="">
<input type=hidden name="qstring" value="">
<input type=hidden name="absr_ID" value="">
<input type=hidden name="foil" value="">
<div style="margin: auto; text-align:center;">
<div id="loginContainer" style="text-align: left;">
...
<table>
...
<td><input type="text" class="loginFormText" ID="txtLoginID" name="id" value=""> </td>
...
<td><input type="password" class="loginFormText" ID="txtPassword" name="pin"> </td>
...
<input type="image" src="images/button_login.png" ID="btnLogin" alt="Login" name="submit" />
...
</table>
...
<!--endLoginContainer-->
...
</form>

In the above code '...' indicate omitted text. Based on this I tried modified my Perl script to (1) Get this page as above; (2) Send a POST command of the following form:

$response = $browser->post(' ['id' => 'myid', 'pin' => 'mypin', location => '', qstring => '', absr_ID => '', foil => '']);

Before running this I deleted the cookies.lwp file so as to start with an empty cookie jar (cookies are set when the initial page is gotten, as indicated above). The response header for the POST contains

Status line is 302 Object moved
Location: login2.asp?x=x&&pswd=

There are no cookies in this header. The content is the following short piece of HTML

<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="login2.asp?x=x&amp;&amp;pswd=">here</a>.</body>

If I now add

$response = $browser->get('
to the Perl script I get an 200 OK response header with no cookies or Location: lines and the content is exactly the same as in the initial GET to this page. So I've gone around in a loop between the login2.asp and login.asp pages! Tell me what I'm doing wrong here. From the syntax '?x=x&&pswd=' in the action line, it looks like I might be supposed to add something like '?id=myid&&pswd=mypin' to the URL, but that would be treating it like a GET instead of a POST. Note that, as I mentioned at the start of this whole e-mail chain, there is also JavaScript on the login2.asp page. Could this be doing something to the POST request somehow? Perhaps you can see why I wanted to get something like oSpy working. Any further suggestions would be much appreciated.
 
You get cookies from the server at first, but then you clear the cookie jar before submitting the form?

I would guess that the cookies you get the first time might be session cookies that correspond to a session file kept on the server, and the browser would be expected to keep that cookie and send it on all its subsequent requests. So, clearing the cookie jar wouldn't be a good idea if this is the case; the server might accept your login info and be all set to mark you "logged in" on its side, but if your client lost its cookies it isn't logged in anymore... it needs to send the cookies back to the server with each request so the server can identify your client as being the one who just logged in.

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
No. You misunderstood me. I ran three tests. Before each test I cleared the cookie jar so that the "starting state" of the test would be the same. I did not clear the cookie jar in the middle of a test. The tests were as follows:

(1)Get the login2.asp page:
my $response = $browser->get('
(2)Get the login2.asp page, then submit the POST request:
my $response = $browser->get('$response = $browser->post(' ['id' => 'myid', 'pin' => 'mypin', location => '', qstring => '', absr_ID => '', foil => '']);

(3) Get the login2.asp page, submit the POST request, and then get the page that the Location line in the header points to.
my $response = $browser->get('$response = $browser->post(' ['id' => 'myid', 'pin' => 'mypin', location => '', qstring => '', absr_ID => '', foil => '']);
$response = $browser->get('
So you see that even though I cleared out the cookie jar before each test, I always will "fill the jar again" at the beginning of each test by getting the login2.asp page. I just did the testing this way so that I could see what was going on at each step of the process, given the same starting point. I believe that the third test does exactly want you were suggesting if I understood you correctly, but as I said above, it seems to take me in a circle. Can you see if I am missing something?

Thanks for your help in advance...
 
Try using a request sniffing add-on for Firefox.
Live HTTP Headers or one of the other add-ons here should work pretty well. I've used one in the past which logs request/response headers as well as bodies (so you see the sources of pages sent by the server, and form data posted to it, etc.)

Then compare what Firefox is doing with the server to what your script is doing and see if your script is doing something different.

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
Success! After looking over the available add-ons I downloaded HttpFox. Very helpful. The main problem ended up being the sort of thing that usually trips me up - a stupid syntax error. I had left out the quotes around the names of the hidden input fields in the POST. The corrected code looks like:

my $response = $browser->get('$response = $browser->post(' ['id' => 'myid', 'pin' => 'mypin', 'location' => '', 'qstring' => '', 'absr_ID' => '', 'foil' => '']);
$response = $browser->get('
HttpFox didn't actually show me all of the details about what was going on, but it helped me focus on the exact step where my problem was. I'm still not sure that I fully understand what actually goes on in "normal" login process. What one sees in HttpFox when the login button is clicked is a single POST with the appropriate arguments and cookies for which the Result is listed as "(Aborted)". However, its clear that something is happening because HttpFox's details on the response header indicate "200 OK" and the content is HTML for a page that looks similar to the initial login page (it has the usual form for logging in), but is no longer identical. Included in this HTML is some code that begins

<meta http-equiv="Refresh" content="0; url=https://www.thesite.com/homepage.asp?x=x&setcookie=true&"><script>location.replace(' size=2 face=verdana><b>You have successfully logged in, click <a href=' to continue</a>.

But in spite of what the above looks like, one never actually sees a message on the screen saying "You have successfully logged in, click here to continue". What HttpFox shows after the POST is a bunch of GETs for JavaScript and css code that is referenced on the new page, followed by a GET for ' which is the user's home page on the site. So apparently, there is something in the HTML that directs the browser to jump to the home page without any additional response required from the user.

In Perl, with the proper quotes around the hidden input field names, the header of the POST response no longer comes back as 302 Objective Moved, but is 200 OK, so there is no more redirection going on.

At any rate, the Perl code listed above works now. Thanks for all your advice. Much appreciated. I know enough about programming to get things done, but I don't do this sort of work very often, so what I really need is a pointer in the right direction from an expert every now and then. Just what you gave me...

One question on a different subject: Does Perl have any kind of interprocess communication capability? What I want to do is start off a Perl program that just runs continuously and then be able to start some other program to tell the first one when to stop or to give it updated information related to what it is doing on the web. One thing that works is to update a file with the necessary info and then have the Perl program regularly check to see if the file has been updated. But that seems a bit hokey. Is there some way that I could implement this between two different Perl processes or between Perl and some other language? Even better, could the second process have a graphic user interface so that the user could click on stuff that would effectively issue commands to the running Perl task?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top