Program Randomly Stops

chvol · May 29, 2004

Who knows how to do any of the following (1-3)?

1. “You're going to want to run it not through the web server, but via the command line.”

2. “You can use a combination of PHP with client side code to keep the script running. Set a particular number of pages to be processed, a number lower than when you experience the timeout. As long as there are pages to be processed print a client side script, e.g. JavaScript, to reload the page. This will reload until all pages are done - then don't output the reloader code.”

3. “You are running the script from a browser. The browser receives output from your PHP script - or waits for it. Send output to the browser that triggers the reloading of the page. This makes sense when you invoke (whatever snoopy is) the class a finite number of times, e.g. 100 URLs at a time. Once the URLs are processed keep track of the number of processed URLs via a session variable. Then output some JavaScript to the browser that reloads the page. Here's an example How do I "print" a client side script?{: print "<script>document.location.reload();</script>"; Alternatively you can do it in PHP if you make sure there is no output sent before redirecting. You could use the header() function to redirect the script to itself. header("Location: ".$_SERVER['PHP_SELF']);”

Explanation:

I have written a PHP program to retrieve the HTML of the first page of search results from Google for a large file of search values. My program uses class SNOOPY

http://sourceforge.net/projects/snoopy

to translate the URL into the HTML. After about 100-300 references to snoopy, which takes about 5-15 minutes, it stops.

1. I start it using Internet Explorer.
2. There is no error message.
3. The program has no reference to set_time_limit.
4. max_executiontime=0 in php.ini .
5. A program that simply executes an endless for, periodically printing out the current time, runs indefinitely, which I manually stopped after 2 hours.
6. I am running it on my single PC windows 98 webserver Apache from PHPTriad

http://sourceforge.net/projects/phptriad

.

I received the above three answers, but I don’t know how to do those things. I program only in PHP (and HTML.) I can start up Apache webserver, write PHP programs using a file editor, and start them using Internet Explorer. I couldn’t get any further explanation.

Charlie chvol@aol.com

jimoblak · May 29, 2004

If you want Google data, contact Google corporate offices. While Google offers links to freely available content on the internet, the information provided by Google and the method that Google uses to present these links is Google's own valuable asset, protected by copyright law. You are apparently violating copyright with your program. It would not be surprising if Google has created physical safeguards on its system from leeches. This may be why your program is stopping.

Based on thread434-775917 , it seems that you are not really creating a web spider. Google is 'Google' because they created a real web spider. You appear to be creating a leeching program by stealing data from other web spiders. This is not how legitimate spiders work and you should really contact an attorney before you go any further with this program.

I suggest you contact domain registrars to seek domains to crawl. You are violating copyright law by leeching Google.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

chvol · May 29, 2004

Wrong. You are making some very bad assumptions. The other thread is a separate issue. As it says, I was asking about writing my own search engine, not using google's.

The current post has to do with a program that checks how many finds that Google reports, in order to sort a list by how often it occurs on the internet. I merely draw conclusions. I don't display anything from google. (I can get the same results from any search engine, of course.)

jimoblak · May 29, 2004

The assumption that you are leeching copyrighted content from Google is not wrong. You admitted to doing this. Your application (whether it is being discussed in this thread or another) is leeching copyrighted data from Google, even if you allege to do additional processing (derivative work) on this data.

This is still where you should directly contact Google and/or consult an attorney as you are treading on feeble legal ground. Your program, based on what you have stated in this thread, is apparently leeching Google's proprietary ranking data. You did not personally determine what site is ranked higher or appears on Google's first page. This is what can cause you legal trouble.

Even if you do not see anything legally wrong with what you are doing, Google's servers might: and this may be why your program dies. Servers can be programmed to deny leechers, even obstinate ones that do not understand help offered by the Tek-tips community.

If...

I don't display anything from google.

...then stop using Google. Use another search engine that does not copyright its results and does not have the intelligent servers that block leechers. Review

http://dmoz.org/license.html

for a good alternative.

Just because snoopy is open source and hosted on sourceforge.net does not mean that everything that you do with it is legal. This is why you really need to speak with an attorney before you go further with this. I am confused why you are not asking your questions on the snoopy forum at sourceforge.net.

Tek-tips users are not here to aid in illegal (or legally questionable) activity.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

chvol · May 29, 2004

"You did not personally determine what site is ranked higher or appears on Google's first page. This is what can cause you legal trouble."

I am glad to hear that, because I don't look at the rank of the URLs returned, only the single number on the page that is the total number of finds, to sort some phrases by frequency of use in English.

But thanks for the link anyway.

jimoblak · May 29, 2004

Yes, you are making use of the rank of URLs returned by Google. This is why you are searching only the first page of Google's results:

retrieve the HTML of the first page of search results from Google

I do not work for Google and I do not own stock in their company. I'm simply offering this information to help others in this forum realize that what you are doing is not entirely legal, no matter how you try to justify it. Taking content from any other site in a manner not intended by the site owner is infringement of their copyright. You can very simply contact Google's offices and ask if you may obtain certain data from them. You may ask them if they prevent leeching through some mechanism. I trust that the reason your program is not working is because of the technological safeguards Google put in place for leeching and hope that dmoz may be more inviting to your snoopy application.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

chvol · May 30, 2004

Yes, you are making use of the rank of URLs returned by Google. This is why you are searching only the first page of Google's results."

That's not true. If I was concerned with the URLS and their rank, I WOULDN'T stop at the first page. It is the opposite.

I don't look at the URLs, much less their order. It doesn't matter what their order is. I look at the total number of finds (NOT any URLs or THEIR order) to decide the order in which to store the information that I already have in my own database. What I'm doing has nothing to do with the URLs or their order. Do you understand what I am doing (as I described)?

I am not allowed to copy their information or store it anywhere. And I don't do that. I don't display their information anywhere. This doesn't violate their Terms of Service.

You should be careful to understand what is going on before you start throwing around charges against someone. Making false accusations against someone is also not legal.

Moonspell · May 30, 2004

wow, 2 threads on the same subject on the same forum. cool, 2 responses on the same subject on the same forum. chvol give up and do something better like learn javascript

jimoblak · May 30, 2004

If...

I am not allowed to copy their information or store it anywhere. And I don't do that

...then why are you accessing Google? Leeching content from other sites is exactly what snoopy does. Why are you being so obtuse on this matter?

It is true that I have little idea what you are doing with Google's data and 'ranking' but the fact remains that you are taking data from Google. You have stated that you are leeching with snoopy over and over again.

Snoopy does not come with documentation or any legal warnings so I can understand why you do not realize why what you are doing is wrong. Allow me to give you a legal explanation of your issue: If you use snoopy to access a web site and then display it for your own use, there should be no issue. This would be the same as viewing the web site in a regular web browser. However, if you use snoopy to grab a portion of a web page using regular expressions or some other means to edit end extract certain data, then you begin to walk a very fine line with 'fair use'. This gets even more questionable when you create a program that bombards Google with more requests than what the average Google user would make. You are not viewing their data in the format that they approved for the general public. You are not viewing the advertisements that fund Google's services. You are most likely violating their copyright. I am pretty sure that you did not really read their TOS because any use of a Google product requires written permission (

http://www.google.com/permissions/index.html

).

You have been given a wealth of options on this matter. Even if you want to be ignorant about copyright law, you should have at least taken the advice that Google may have set up physical safeguards and that is why your program dies. Try another web site to leech from.

I am not throwing around charges. I have provided the tek-tip that you should consult an attorney for what you are doing because it is most definitely legally questionable. Making these warnings is not making a false allegation against you. Calling you obtuse is not a false allegation either. You have demonstrated this in your posts.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

chvol · May 30, 2004

you are making use of the rank of URLs returned by Google" is a lie. This is expressly prohibited by their TOS and as I said I don't make use of that.

"You are most likely violating their copyright."

No, because I am not publishing the data. One is allowed to read copyrighted material and learn from it. You just can't write it down (store it) and give it to someone else. And that is what I do and don't do.

"I am pretty sure that you did not really read their TOS."

That's 3 out of 3.

I tried 40 other search engines. A number have no copyrights and no TOS, and my program works fine with them. Apparently it was Google. One of the alternatives was written by a guy in his 20's who believes in new forms of Democracy and apparently does not believe in keeping a bunch of lawyers well-fed.

Sounds like a good guy to me - live and let live.

Adios,

C.

jimoblak · May 30, 2004

3 out of 3

What? Are you silly enough to think this is some sort of a game?

I am not willfully presenting false information (aka 'lie'). My statements may be in error based on what little information you have provided. The only truth that has been revealed from this thread is that you are swiping data and Google does not like it. This is why Google has phyiscal safeguards from leeches.

Apparently it was Google.

Why did you waste so much time trying to excuse your abuse of copyright when the comment about Google's physical safeguards appeared in my first post? If this is how you reason, it is no wonder logic does not work in your programming.

A guy in his 20's with different ideas is not going to come to the defense of another and provide a reason why you should not be civilly fined for copyright infringement. A court will not care who this guy is. The copyright information was not provided to feed lawyers. It was presented so that you and anyone equally foolish enough to steal content does not lose money when you become a defendant in court. If you keep your code clean, you won't ever need a lawyer. Even if you are just some high school kid doing a project in your own time for your own benefit, I felt obligated to legally question your project since others learn from posts on Tek-Tips.

Do you realize how easy it would have been to solve your problem in any of your threads if you posted all of your code? You would probably not do this because you have a very skewed idea about copyright. You believe it is okay to steal data from others but you apparently would never share your own data.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

jimoblak · May 30, 2004

Whoops! My sincerest apologies... Had I thoroughly appreciated your comment:

And that is what I do and don't do.

, I would not have tried to reason with you earlier. Now I feel so foolish for carrying on this conversation. I'll stop now.

- - picklefish - -
Why is everyone in this forum responding to me as picklefish?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Program Randomly Stops

chvol

Programmer

jimoblak

Instructor

chvol

Programmer

jimoblak

Instructor

chvol

Programmer

jimoblak

Instructor

chvol

Programmer

Moonspell

Programmer

jimoblak

Instructor

chvol

Programmer

jimoblak

Instructor

jimoblak

Instructor

Similar threads

Part and Inventory Search

Sponsor