Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Library needs help - Extracting ISBN from web page

Status
Not open for further replies.

sromine

Technical User
Apr 21, 2006
38
US
Hello,

I work for a library and as a side project I am trying to put together a project to make our library catalog more useful by mashing it up with Amazon.com. I have posted this into several forums not knowing which would be the best forum or the best language to use. Thank you to anyone who can get me started on this script. I know just enough programming to modify what others have created. If you need more info, please let me know.

What I am looking for is some help with code whereby I feed in a url and then it isolates the ISBN number into a variable and then adds it to the end of a url.

Here is a sample url where the isbn would need to be extracted.

Here is an example of that isbn at the end of a url to access info on amazon..
 
Uh, what are you trying to match in the first URL? I don't see any pattern repeating in the second URL to amazon.

D.E.R. Management - IT Project Management Consulting
 
Thanks for your quick response. I appreciate the help.

I see that the link I used would not work properly because it had a session id. The following should illustrate. If you click on this link, you will see an ISBN number. I am trying to isolate this isbn number into a variable and then attach it to the amazon url, maybe some type of include statement. If you have any ideas or know of a better language to accomplish this in, please let me know. No ego here, just be glad to get some type of code I can work with to better serve the patrons.

 
What is the page originally written in? It might be useful if we were able to see the original (server side) source. If it is written in php already then it would be pretty simple to accomplish your task.
 
Regex should help you isolate the ISBN from any old page. check out
assuming the pages on the site you gave above are all formatted identically (and dynamically generated) then the best answer would surely be for you to grab the isbn from the database that serves the page. but if you must screen scrape then the following might work for you;

Code:
<?
$searchtext = "<span class=\"text\">ISBN&nbsp;:</span>";
$webpage = "[URL unfurl="true"]http://web2.co.douglas.or.us/web2/tramp2.exe/do_keyword_search/guest?setting_key=English&servers=1home&index=default&query=0307276902";[/URL]
$contents = file_get_contents($webpage);
$pos = strpos($contents,$searchtext);
echo "ISBN is ". substr($contents, ($pos + 55 + strlen($searchtext)), 10);
 
Thanks everyone for the quick response.

The original catalog is proprietary and written with a combination of javascript and html. There is some opportunity to tap into the database, but to be quite honest with you they are not all that forthcoming with the documentation to do so, mainly because they sell their own software that costs thousands of dollars that fills out the catalog with some of the stuff amazon.com does.....

what i am going to try and do is put together an alternative web page that attempts to mash our catalog with amazon.com data....

i dont know much about php, excpet at one point i noticed they had a very powerful include statement that allowed you to easily include other web pages...i am under the impression it will even allow you to include full urls on other servers, rather than just local files on local web server....

do you think i should post this message in the javascript area to get a response? do you know if javascript is capable of doing something like what i am trying to accomplish?
 
jpadie,
I was going to suggest regex as well but was unsure of the actual comparison to use to pick it out. I am slightly confused about your solution though, maybe you could explin the last line.

the substr pulls out part of the $contents (webpage), then there's an offset, which is the start of the <span class..., plus 55 (?), + length of the <span class..., then read the next 10. I guess I'm just confused about the 55. Is that the HTML between the </span> and the start of the ISBN? I'm counting 50, but mine show up as tabs.

** Quick google search found this
So, the whole pattern reads as:

^ISBN\s(?=[-0-9xX ]{13}$)(?:[0-9]+[- ]){3}[0-9]*[xX0-9]$

Importantly, you cannot validate an ISBN using regex alone as the last number is a checksum which is governed by special validation logic - as laid out on the ISBN site. That logic was adequately summarized by James Hart during the course of the discussion on the regex list:


---{James Hart}---
For each of the first 9 digits, multiply the digit by its position in the ISBN - so, multiply digit 1 by 1, digit 2 by 2, digit 3 by 3, and so on to digit 9 * 9. Add the lot of them up. Take the result mod 11. This gives you your last digit - with the caveat that if the result mod 11 is 10, the last digit is X.
from site:
 
same url as i posted ;-) good to see that google is consistent!

i didn't spend a great deal of time on the above but i counted 55 chars of source code in the html. remember that a tab, in the source code, is two characters "\t".
 
lol... Sorry about that, I didn't even see the link (and I looked like 10 times). Good call on '\t' being 2 chars.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top