Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl - Google Reader

Status
Not open for further replies.

pdupreez

Technical User
May 16, 2001
49
ZW
I use Google Reader (RSS aggregator?) a lot to capture interesting links and store them as "starred" for later processing. What I am looking for is a way to get all the links in the starred items, create a new html page (if necessary), spider through all the links within that new page and extract links to specific hosts into a text or html file.

How can this be done, and is Perl the right tool of choice?
 
This is probably not going to help you very much, but there is already a module written to interface with google reader:


You can peek into the source code and see how the author has written the script.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Thanks Kevin for responding. That did not give me what I was looking for but further searching on Google did the trick. I had to mash a number of different scripts together, but can now efficiently process hundreds of Google Starred links and dump the selected links to a text file for further processing. It is in Ruby, and you may be able to convert it to perl for us?

Your Google Starred items must be set for Public access, and you can find the [bold]xxxxxxxxxxxxxxxxxxxxx[/bold] Google Reader Account Number in the public web link when you change the Starred view to public (Under Settings->Folder & Tags in Google Reader)

I start the script with:

[bold]ruby GRS.rb [skip][/bold]

[skip] is optional if you have already parsed the Google Links and want to redo the download links

#############################
# START OF SCRIPT
#############################


require 'net/http'
require 'uri'
require 'open-uri'
require 'rubygems'
require 'hpricot'
require 'simple-rss'


# Check if you wish to skip parsing Google Reader again

if ARGV[0] == "skip"

# Check if GoogleReaderStarredLinks.txt exists

#File.open("GoogleReaderStarredLinks.txt")

if File::exists?( "GoogleReaderStarredLinks.txt" )

puts "Skipping Google Reader parsing"

else

# If File does not exists, parse Google Reader RSS feed with simpleRSS and place it in a text file for further processing

feed = "[bold]xxxxxxxxxxxxxxxxxxxxx[/bold]/state/com.google/starred?n=500"
rss = SimpleRSS.parse open(feed)
rss.entries.each do |item|
open('GoogleReaderStarredLinks.txt', 'a') { |f|
f.puts item.link
f.close
}
end
end
else

# If File does not exists, parse Google Reader RSS feed with simpleRSS and place it in a text file for further processing

feed = "[bold]xxxxxxxxxxxxxxxxxxxxx[/bold]/state/com.google/starred?n=500"
rss = SimpleRSS.parse open(feed)
rss.entries.each do |item|
open('GoogleReaderStarredLinks.txt', 'a') { |f|
f.puts item.link
f.close
}
end
end

#include UrlUtils

# Push all URL's in file to an Array. This could have been done directly, but will
# not allow a repeat without having to parse Google Reader again, which takes time and bandwidth

urls = []
File.open('GoogleReaderStarredLinks.txt', 'r') do |file|
file.readlines.each do |line|
urls.push(line.chomp)
end
end

# Loop through each of the URL's in the Array urls[]

urls.each do |url|
puts "Google Reader Link : " + url

# Open the URL and check for errors (timeouts and HTTP)
# If any, skip to the next URL

begin
url_object = open(url)
rescue Timeout::Error
puts "The request for a page at #{url} timed out...skipping."
next
rescue OpenURI::HTTPError
puts "The request for a page at #{url} returned an error"
next
end

# Parse the link with Hpricot i.e. open the webpage linked to the original URL and
# read that into a variable doc, which holds an equivalent of the webpage source code

next if url_object == nil
doc = nil
doc = Hpricot(url_object)

# Look for an URL link

doc.search('a[@href]').map do |x|
new_url = x['href'].split('#')[0]
unless new_url == nil

# Checks if the webpage contains a link to one of the online file storage servers I have an account with
# and put the link into a text file for further processing, and loop to the next URL in the webpage

if new_url.include? 'abc.com/files'
open('DownloadLinks.txt', 'a') { |f|
f.puts new_url
puts " Download link : " + new_url
}
next
elsif new_url.include? 'pqr.com'
open('DownloadLinks.txt', 'a') { |f|
f.puts new_url
puts " Download link : " + new_url
}
next
elsif new_url.include? 'xyz.com'
open('DownloadLinks.txt', 'a') { |f|
f.puts new_url
puts " Download link : " + new_url
}
next
end
end
end
end

#############################
# START OF SCRIPT
#############################
 
It is in Ruby, and you may be able to convert it to perl for us?

Sorry, I can't. Maybe someone else can.

Kevin



------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Just shopping around. But no, seriously there is no need to convert, I just know this is a Perl forum, if people want to have it in Perl. It works great in Ruby, which in the Windows environment seems to have less overheads (no Cygwin)and integrates better into Scite in any case.
 
You don't need cygwin unless you want to use Unix commands with Windows. You can run activeperl or strawberry perl on Windows.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Besides, anyone who can actually understand Ruby well enough to convert it will probably declare that it is a far superior language anyway, so why would you want to?? :)

Annihilannic.
 
Thanks for the input. I have only started using Perl and Ruby in the last few days, and I am learning a lot. You are right, I can run Perl outside of Cygwin, which is nice to know. Time to go fix the path!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top