Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need print out to go to *.txt file instead of screen 1

Status
Not open for further replies.

DrMingle

Technical User
May 24, 2009
116
US
Need help with printing the results to a *.txt file.

Would I need to use the writelines() method?
or
Would I need to use the f.write(string) method?

any help would be appreciate...

Code:
#================================#
#File Name: Crawler.py                 
#Description: Spider with html parser; title and keywords
#Creator: unknown
#================================#

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
	try:
		crawling = tocrawl.pop()
		print crawling
	except KeyError:
		raise StopIteration
	url = urlparse.urlparse(crawling)
	try:
		response = urllib2.urlopen(crawling)
	except:
		continue
	msg = response.read()
	startPos = msg.find('<title>')
	if startPos != -1:
		endPos = msg.find('</title>', startPos+7)
		if endPos != -1:
			title = msg[startPos+7:endPos]
			print title
	keywordlist = keywordregex.findall(msg)
	if len(keywordlist) > 0:
		keywordlist = keywordlist[0]
		keywordlist = keywordlist.split(", ")
		print keywordlist
	links = linkregex.findall(msg)
	crawled.add(crawling)
	for link in (links.pop(0) for _ in xrange(len(links))):
		if link.startswith('/'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + link
		elif link.startswith('#'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + url[2] + link
		elif not link.startswith('http'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + '/' + link
		if link not in crawled:
			tocrawl.add(link)
 
Hi

I would prefer to be able to choose whether to write to file or the standard output.
Code:
[red]import[/red] sys
[red]import[/red] getopt

outfile[teal]=[/teal][green][i]'-'[/i][/green]

opt[teal],[/teal]arg[teal]=[/teal]getopt[teal].[/teal][COLOR=darkgoldenrod]getopt[/color][teal]([/teal]sys[teal].[/teal]argv[teal][[/teal][purple]1[/purple][teal]:],[/teal][green][i]'o:'[/i][/green][teal])[/teal]

[b]for[/b] key[teal],[/teal]val [b]in[/b] opt[teal]:[/teal]
  [b]if[/b] key[teal]==[/teal][green][i]'-o'[/i][/green][teal]:[/teal]
    outfile[teal]=[/teal]val

tocrawl[teal]=[/teal]arg

[b]if[/b] outfile[teal]==[/teal][green][i]'-'[/i][/green][teal]:[/teal]
  out[teal]=[/teal]sys[teal].[/teal]stdout
[b]else[/b][teal]:[/teal]
  out[teal]=[/teal][COLOR=darkgoldenrod]open[/color][teal]([/teal]outfile[teal],[/teal][green][i]'w'[/i][/green][teal])[/teal]

out[teal].[/teal][COLOR=darkgoldenrod]write[/color][teal]([/teal][green][i]"I'm just writing.\n"[/i][/green][teal])[/teal]
out[teal].[/teal][COLOR=darkgoldenrod]write[/color][teal]([/teal][green][i]"I don't care where.\n"[/i][/green][teal])[/teal]

[gray]# your crawling would come here[/gray]

[b]if[/b] outfile[teal]!=[/teal][green][i]'-'[/i][/green][teal]:[/teal]
  out[teal].[/teal][COLOR=darkgoldenrod]close[/color][teal]()[/teal]
Sample usage :
Crawler.py # print to standard outout
Crawler.py -o - # print to standard outout
Crawler.py -o writehere.txt # print to file writehere.txt

Note that I suggest to do some proper option parsing instead of [tt]tocrawl [teal]=[/teal] set[teal]([[/teal]sys[teal].[/teal]argv[teal][[/teal][purple]1[/purple][teal]]])[/teal][/tt]. My code is just a sample kept simple.

Note that you regular expressions are abit naive. You are supposing that
[ul]
[li]tags not contain no line wraps[/li]
[li]tags, attributes and values are written all lowercase[/li]
[li]attribute values are always surrounded with quotes[/li]
[li]attribute values not contain quotes[/li]
[li]all documents are XHTML[/li]
[li][tt]title[/tt] has no attributes[/li]
[li][tt]meta[/tt]'s first attribute is [tt]name[/tt] and the second is [tt]content[/tt][/li]
[li][tt]meta[/tt] has no other attributes beside [tt]name[/tt] and [tt]content[/tt][/li]
[li][tt]a[/tt]'s first attribute is [tt]href[/tt][/li]
[/ul]
The above enumerated situations can be met it valid documents. However, the wast majority of HTML documents are invalid, so containing even more situations for failure.

Better search for a suitable module to parse HTML.


Feherke.
 
Feherke:

I am running into StopIteration, line 45

If I go with this option in the command prompt:
Crawler.py -o writehere.txt
Is the below what you are suggesting:

Code:
##feherke begining code  
import sys
import getopt

outfile='-'

opt,arg=getopt.getopt(sys.argv[1:],'o:')

for key,val in opt:
  if key=='-o':
    outfile=val

tocrawl=arg

if outfile=='-':
  out=sys.stdout
else:
  out=open(outfile,'w')

out.write("I'm just writing.\n")
out.write("I don't care where.\n")

###My crawl as it was

#================================#
#File Name: Crawler.py                 
#Description: Spider with html parser; title and keywords
#Creator: unknown
#================================#

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
	try:
		crawling = tocrawl.pop()
		print crawling
	except KeyError:
		raise StopIteration
	url = urlparse.urlparse(crawling)
	try:
		response = urllib2.urlopen(crawling)
	except:
		continue
	msg = response.read()
	startPos = msg.find('<title>')
	if startPos != -1:
		endPos = msg.find('</title>', startPos+7)
		if endPos != -1:
			title = msg[startPos+7:endPos]
			print title
	keywordlist = keywordregex.findall(msg)
	if len(keywordlist) > 0:
		keywordlist = keywordlist[0]
		keywordlist = keywordlist.split(", ")
		print keywordlist
	links = linkregex.findall(msg)
	crawled.add(crawling)
	for link in (links.pop(0) for _ in xrange(len(links))):
		if link.startswith('/'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + link
		elif link.startswith('#'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + url[2] + link
		elif not link.startswith('http'):
			link = '[URL unfurl="true"]http://'[/URL] + url[1] + '/' + link
		if link not in crawled:
			tocrawl.add(link)
##feherke end code                        
if outfile!='-':
  out.close()
 
Hi

At the end I forgot to mention that my code contains its own assignment to tocrawl. That was necessary because the use of [tt]getopt[/tt] changed the situation abit.

Additionally, now tocrawl is list, not set. ( Why set anyway ? ) So I also changed crawled to list.

This works for me :
Code:
[gray]#================================#[/gray]
[gray]#File Name: Crawler.py[/gray]
[gray]#Description: Spider with html parser; title and keywords[/gray]
[gray]#Creator: unknown[/gray]
[gray]#================================#[/gray]

[red]import[/red] sys
[red]import[/red] getopt
[red]import[/red] re
[red]import[/red] urllib2
[red]import[/red] urlparse

outfile[teal]=[/teal][green][i]'-'[/i][/green]

opt[teal],[/teal]arg[teal]=[/teal]getopt[teal].[/teal][COLOR=darkgoldenrod]getopt[/color][teal]([/teal]sys[teal].[/teal]argv[teal][[/teal][purple]1[/purple][teal]:],[/teal][green][i]'o:'[/i][/green][teal])[/teal]

[b]for[/b] key[teal],[/teal]val [b]in[/b] opt[teal]:[/teal]
    [b]if[/b] key[teal]==[/teal][green][i]'-o'[/i][/green][teal]:[/teal]
        outfile[teal]=[/teal]val

tocrawl[teal]=[/teal]arg

[b]if[/b] outfile[teal]==[/teal][green][i]'-'[/i][/green][teal]:[/teal]
    out[teal]=[/teal]sys[teal].[/teal]stdout
[b]else[/b][teal]:[/teal]
    out[teal]=[/teal][COLOR=darkgoldenrod]open[/color][teal]([/teal]outfile[teal],[/teal][green][i]'w'[/i][/green][teal])[/teal]

crawled [teal]=[/teal] [teal][][/teal]
keywordregex [teal]=[/teal] re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>'[/i][/green][teal])[/teal]
linkregex [teal]=[/teal] re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'<a\s*href=[\'|"](.*?)[\'"].*?>'[/i][/green][teal])[/teal]

[b]while[/b] tocrawl[teal]:[/teal]
    [b]try[/b][teal]:[/teal]
        crawling [teal]=[/teal] tocrawl[teal].[/teal][COLOR=darkgoldenrod]pop[/color][teal]()[/teal]
        out[teal].[/teal][COLOR=darkgoldenrod]write[/color][teal]([/teal]crawling[teal]+[/teal][green][i]"\n"[/i][/green][teal])[/teal]
    [b]except[/b] KeyError[teal]:[/teal]
        [b]raise[/b] StopIteration
    url [teal]=[/teal] urlparse[teal].[/teal][COLOR=darkgoldenrod]urlparse[/color][teal]([/teal]crawling[teal])[/teal]
    [b]try[/b][teal]:[/teal]
        response [teal]=[/teal] urllib2[teal].[/teal][COLOR=darkgoldenrod]urlopen[/color][teal]([/teal]crawling[teal])[/teal]
    [b]except[/b][teal]:[/teal]
        [b]continue[/b]
    msg [teal]=[/teal] response[teal].[/teal][COLOR=darkgoldenrod]read[/color][teal]()[/teal]
    startPos [teal]=[/teal] msg[teal].[/teal][COLOR=darkgoldenrod]find[/color][teal]([/teal][green][i]'<title>'[/i][/green][teal])[/teal]
    [b]if[/b] startPos [teal]!=[/teal] [teal]-[/teal][purple]1[/purple][teal]:[/teal]
        endPos [teal]=[/teal] msg[teal].[/teal][COLOR=darkgoldenrod]find[/color][teal]([/teal][green][i]'</title>'[/i][/green][teal],[/teal] startPos[teal]+[/teal][purple]7[/purple][teal])[/teal]
        [b]if[/b] endPos [teal]!=[/teal] [teal]-[/teal][purple]1[/purple][teal]:[/teal]
            title [teal]=[/teal] msg[teal][[/teal]startPos[teal]+[/teal][purple]7[/purple][teal]:[/teal]endPos[teal]][/teal]
            out[teal].[/teal][COLOR=darkgoldenrod]write[/color][teal]([/teal]title[teal]+[/teal][green][i]"\n"[/i][/green][teal])[/teal]
    keywordlist [teal]=[/teal] keywordregex[teal].[/teal][COLOR=darkgoldenrod]findall[/color][teal]([/teal]msg[teal])[/teal]
    [b]if[/b] [COLOR=darkgoldenrod]len[/color][teal]([/teal]keywordlist[teal])[/teal] [teal]>[/teal] [purple]0[/purple][teal]:[/teal]
        keywordlist [teal]=[/teal] keywordlist[teal][[/teal][purple]0[/purple][teal]][/teal]
        keywordlist [teal]=[/teal] keywordlist[teal].[/teal][COLOR=darkgoldenrod]split[/color][teal]([/teal][green][i]", "[/i][/green][teal])[/teal]
        out[teal].[/teal][COLOR=darkgoldenrod]write[/color][teal]([/teal]keywordlist[teal]+[/teal][green][i]"\n"[/i][/green][teal])[/teal]
    links [teal]=[/teal] linkregex[teal].[/teal][COLOR=darkgoldenrod]findall[/color][teal]([/teal]msg[teal])[/teal]
    crawled[teal].[/teal][COLOR=darkgoldenrod]append[/color][teal]([/teal]crawling[teal])[/teal]
    [b]for[/b] link [b]in[/b] [teal]([/teal]links[teal].[/teal][COLOR=darkgoldenrod]pop[/color][teal]([/teal][purple]0[/purple][teal])[/teal] [b]for[/b] _ [b]in[/b] [COLOR=darkgoldenrod]xrange[/color][teal]([/teal][COLOR=darkgoldenrod]len[/color][teal]([/teal]links[teal]))):[/teal]
        [b]if[/b] link[teal].[/teal][COLOR=darkgoldenrod]startswith[/color][teal]([/teal][green][i]'/'[/i][/green][teal]):[/teal]
            link [teal]=[/teal] [green][i]'[URL unfurl="true"]http://'[/URL][/i][/green] [teal]+[/teal] url[teal][[/teal][purple]1[/purple][teal]][/teal] [teal]+[/teal] link
        [b]elif[/b] link[teal].[/teal][COLOR=darkgoldenrod]startswith[/color][teal]([/teal][green][i]'#'[/i][/green][teal]):[/teal]
            link [teal]=[/teal] [green][i]'[URL unfurl="true"]http://'[/URL][/i][/green] [teal]+[/teal] url[teal][[/teal][purple]1[/purple][teal]][/teal] [teal]+[/teal] url[teal][[/teal][purple]2[/purple][teal]][/teal] [teal]+[/teal] link
        [b]elif[/b] [b]not[/b] link[teal].[/teal][COLOR=darkgoldenrod]startswith[/color][teal]([/teal][green][i]'http'[/i][/green][teal]):[/teal]
            link [teal]=[/teal] [green][i]'[URL unfurl="true"]http://'[/URL][/i][/green] [teal]+[/teal] url[teal][[/teal][purple]1[/purple][teal]][/teal] [teal]+[/teal] [green][i]'/'[/i][/green] [teal]+[/teal] link
        [b]if[/b] link [b]not[/b] [b]in[/b] crawled[teal]:[/teal]
            tocrawl[teal].[/teal][COLOR=darkgoldenrod]append[/color][teal]([/teal]link[teal])[/teal]

[b]if[/b] outfile[teal]!=[/teal][green][i]'-'[/i][/green][teal]:[/teal]
    out[teal].[/teal][COLOR=darkgoldenrod]close[/color][teal]()[/teal]
Some more notes for your TODO list :
[ul]
[li]check the protocol to not try to follow [tt]ftp://[/tt], [tt]mailto:[/tt], [tt]javascript:[/tt] and similar URLs[/li]
[li]check for [tt]base[/tt] [tt]href[/tt] tag and use it when composing URL from link [tt]href[/tt]s[/li]
[/ul]


Feherke.
 
Excellent, Feherke. I appreciate your knowledge and the fact that you took the time to educate me a bit more...

I hope we cross paths again.

Any references you could post to address the TODO list you created for me would be might helpful...

Best of luck!
 
Hi

I spent some time with similar tasks, and my conclusion was that using an existing tool is the best way.

For productivity, I would take a look at twill.

For fun, I would try to use a generic approach :
Code:
[red]import[/red] re
[red]import[/red] urllib2

scriptre[teal]=[/teal]re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'<script\\b[\w\W]*?>.*?</script\s*>'[/i][/green][teal],[/teal]re[teal].[/teal]I[teal])[/teal]
stylere[teal]=[/teal]re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'<style\\b[\w\W]*?>.*?</style\s*>'[/i][/green][teal],[/teal]re[teal].[/teal]I[teal])[/teal]
tagre[teal]=[/teal]re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'<(\w+)[\w\W]*?>'[/i][/green][teal],[/teal]re[teal].[/teal]I[teal])[/teal]
attrre[teal]=[/teal]re[teal].[/teal][COLOR=darkgoldenrod]compile[/color][teal]([/teal][green][i]'(\w+)=(?:(["\'])(.*?)\\2|(\w*))'[/i][/green][teal],[/teal]re[teal].[/teal]I[teal])[/teal]

response[teal]=[/teal]urllib2[teal].[/teal][COLOR=darkgoldenrod]urlopen[/color][teal]([/teal][green][i]'[URL unfurl="true"]http://tek-tips.com/'[/URL][/i][/green][teal])[/teal]
html[teal]=[/teal]response[teal].[/teal][COLOR=darkgoldenrod]read[/color][teal]()[/teal]
html[teal]=[/teal]re[teal].[/teal][COLOR=darkgoldenrod]sub[/color][teal]([/teal]stylere[teal],[/teal][green][i]''[/i][/green][teal],[/teal]re[teal].[/teal][COLOR=darkgoldenrod]sub[/color][teal]([/teal]scriptre[teal],[/teal][green][i]''[/i][/green][teal],[/teal]html[teal]))[/teal]

[b]for[/b] tag [b]in[/b] re[teal].[/teal][COLOR=darkgoldenrod]finditer[/color][teal]([/teal]tagre[teal],[/teal]html[teal]):[/teal]

  [b]if[/b] tag[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]1[/purple][teal]).[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'meta'[/i][/green][teal]:[/teal]
    name[teal]=[/teal]content[teal]=[/teal][green][i]''[/i][/green]
    [b]for[/b] attr [b]in[/b] re[teal].[/teal][COLOR=darkgoldenrod]finditer[/color][teal]([/teal]attrre[teal],[/teal]tag[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]()):[/teal]
      [b]if[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]1[/purple][teal]).[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'name'[/i][/green][teal]:[/teal]
        name[teal]=[/teal]attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]3[/purple][teal])[/teal] [b]or[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]4[/purple][teal])[/teal]
      [b]elif[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]1[/purple][teal]).[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'content'[/i][/green][teal]:[/teal]
        content[teal]=[/teal]attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]3[/purple][teal])[/teal] [b]or[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]4[/purple][teal])[/teal]
    [b]if[/b] name[teal].[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'keywords'[/i][/green][teal]:[/teal]
      [b]print[/b] [green][i]'keywords\t= '[/i][/green][teal]+[/teal]content

  [b]if[/b] tag[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]1[/purple][teal]).[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'a'[/i][/green][teal]:[/teal]
    [b]for[/b] attr [b]in[/b] re[teal].[/teal][COLOR=darkgoldenrod]finditer[/color][teal]([/teal]attrre[teal],[/teal]tag[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]()):[/teal]
      [b]if[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]1[/purple][teal]).[/teal][COLOR=darkgoldenrod]lower[/color][teal]()==[/teal][green][i]'href'[/i][/green][teal]:[/teal]
        href[teal]=[/teal]attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]3[/purple][teal])[/teal] [b]or[/b] attr[teal].[/teal][COLOR=darkgoldenrod]group[/color][teal]([/teal][purple]4[/purple][teal])[/teal]
        [b]print[/b] [green][i]'href\t= '[/i][/green][teal]+[/teal]href
The bad part of this approach is that the tags' innerHTML can not be obtained. Currently the attribute regular expression does not match minimized attributes. But for playing, I would continue this way.

Feherke.
 
Feherke:

Thanks again...wonderful information.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top