Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Track Web Page Changes

Status
Not open for further replies.

Dronealone

IS-IT--Management
Mar 13, 2002
64
GB
Hi,

Can anybody recommend a tool, or give me some guidance on how to track changes to a remote page?

There is content relating to my company on other websites, and I want to check when it changes or is dropped of the site altogether.

Thanks!
 
You could build a tool in PHP that checks the other URLs and compares the content to a cached local copy.
If your question gets more specific about implementing a PHP solution there will be follow up.

 
Hi,

thanks for that, this is what I've thought up so far:

1) A user will select some text from the page they want to monitor and paste this into a textbox which will upload this into the database.

2) Whenever the check occurs, the remote file will be open and read into a variable, using fopen() and fread(). I would then use strip_tags() to remove all the HTML markup from the page (this is necessary as the user will have selected the text from the browser and will therefore have no markup).

3) I will then do a substr() looking for the user supplied text within the stored page content.

4) If it is not found I will conclude this page has changed and therefore flag it up as so.

Does this solution make sense / will it work in all cases? Any suggestions?

Thanks very much.

Then

 
Some thoughts and questions:

ad 1) The copy and paste will work fine if you are not concerned about any links etc that might be in there.

ad 2) I would use fsock() rather than fopen() since your URL wrappers might be disabled on your next host.
Removing the markup is ok for just checking the text. You will have to handle whitespace also.

Does the solution make sense?
If that's what you need to do, yes. Will it work in all cases?

No. I can see some cases it will not work.
1. The copied/pasted text has been interpreted by a browser. That means if there are any HTML entities in the source code there will be a difference.
2. When a URL can't be contacted (host down etc.) will that flag the page as changed?
 
Yeah, I'm not concerned about the formatting of the text on the site or any links contained within the text, basically just whether it is still there.

Point taken about fsock(), thats a better solution.

In response to your points:

1) When will the text be interpreted? I thought this would only occur when pasting into a more 'intelligent' bit of software e.g. Dreamweaver. I thought if just pasting into a web page text box, you would just get the text as is laid out on screen.

2. Another good point, I will check to see if the fsock() connection works and receives data, if not I will flag the page as unverified.

Cheers
 
When you see it in the browser and you are copying it it is interpreted.
Example
The source code says:
The company is worth $
 
When you see it in the browser and you are copying it it is interpreted.
Example
The source code says:
Code:
The company is worth $5,000,000
It shows as:
The company is worth $5,000,000

Whitespace also is an issue. A space on the screen can be 50 spaces and a newline in the source.
When you retrieve the page you will have the source, not the intepreted text that was pasted by the user into the db.

Ok?
 
With you, to get round the interpreted code, after I have stripped the HTML tags, can I not just use html_entity_decode() to compare against the pasted text?

Cheers
 
I would recommend to write a function that:
1. converts all HTML entities (html_entity_decode())
2. collapses all concatenated spaces to one space

The function would be applied to the text that the user pastes before it's put into the table.
It will also be applied after strip_tags() to the retrieved source code.

That should cover the majority of cases.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top