Striping html Tags 1

medreda · Nov 25, 2010

thread278-1604666

Hello every body.
it is really the best programming forum i've ever seen.

i've just took a look a this topic:

http://www.tek-tips.com/viewthread.cfm?qid=1604666&page=1

i want to strip html tags from a web site so i've just simply copied the code and changed the url, but nothing happened .

could you please retest the code or make some comments on it.

Thanks in advance

mikrom · Nov 25, 2010

medreda said:
i've just simply copied the code and changed the url, but nothing happened

Maybe you forgot first to download the python module html2text.

Other issue could be the encoding of your web page. In the example code you mentioned, encoding = 'utf-8' is used.
You should look at the encoding of your web page, otherwise you can get an error like:
[tt]UnicodeDecodeError: 'utf8' codec can't decode bytes in position 528-530: invalid data[/tt]

medreda · Nov 25, 2010

thanks a lot for this quick response

i have installed and imported the html2text properly.
also i've tested the example on a web site that's encoded by UTF-8 (the same web site in the example) but nothing happened too.

could you explain more or make some comments on the source code.

Thanks again.

mikrom · Nov 25, 2010

I tested the example given in thread you mentioned and other URLs too and everything worked for me fine.

Have you tried to test the web site on teh package home page?
When not, then go to the

http://www.aaronsw.com/2002/html2text/

and type in the URL: input box an URL and press Convert button.

medreda · Nov 26, 2010

thanks a lot mikrom

the code worked very good when i've changed the IDE. so it was a problem of indentation because i was using IDLE.

i'd like to say too that this piece of code works only with web sites that are coded in UTF-8.

could you give me a clue about how to make it workful with non coded UTF-8 web sites.

Thank you again

mikrom · Nov 26, 2010

medreda said:
could you give me a clue about how to make it workful with non coded UTF-8 web sites

I already wrote that in my first post. You should change in the example code given on

http://www.tek-tips.com/viewthread.cfm?qid=1604666

the line
encoding = 'utf-8'
to the
encoding = 'your_web_page_encoding'

medreda · Nov 28, 2010

thanks mikrom.

i mean to transform a non coded UTF-8 page to UTF-8 then apply directly this geat piece of code .

i've made a litte search and found that is possible with BeautifulSoup

so i add this code first to transform the content of the web page into utf-8 :

def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)

if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode

fin= decode_html("content_of_non_utf_web_page")
soup = BeautifulSoup(fin) # here it will be a UTF-8

than i apply yout piec of code.

What do you say.

medreda · Nov 28, 2010

i tested the code that i have told you many times it worked every time, so i have added something useful to your code lol.

i want to ask you a last question:
when saving the result to text file it displays it for example like this:

[36]: modules.php?name=Top
[37]: modules.php?name=Surveys

So can we strip the number displayed before each string.

Thank you again

mikrom · Nov 28, 2010

Yes, you can strip that for example using string operations

Code:

>>> my_string = "[36]: modules.php?name=Top"
>>> my_string[my_string.index("]:")+3:]
'modules.php?name=Top'
>>> my_string ="[37]: modules.php?name=Surveys"
>>> my_string[my_string.index("]:")+3:]
'modules.php?name=Surveys'

or using regular expression:

Code:

>>> import re
>>> my_string1 = "[36]: modules.php?name=Top"
>>> my_string2 ="[37]: modules.php?name=Surveys"
>>> re.sub(r"\[\d+\]:\s*","", my_string1)
'modules.php?name=Top'
>>> re.sub(r"\[\d+\]:\s*","", my_string2)
'modules.php?name=Surveys'

medreda · Nov 29, 2010

i appreciate all the help that you have provided me, i can get through python without any scares now.

i have tested this last one it was a master piece too.

but when puting all the code in another PC and executing the it, i got this stupid error :

AttributeError: 'module' object has no attribute 'wrapwrite'

does it linked to version of the html2text. because i don't use the same one the first computer

thanks a lot

mikrom · Nov 29, 2010

When you look in the source of the module html2text.py you will find the method

Code:

def wrapwrite(text): sys.stdout.write(text.encode('utf8'))

medreda · Nov 29, 2010

yes i saw it, i can run with python 2.5 and not with python 2.6

mikrom · Nov 29, 2010

I'm only using Python 2.5, but I thought that 2.6 should be backward compatible.

medreda · Nov 29, 2010

thank you
every time you give me a clue.

the problem was with the WingIDE (i don't know if you heard about it), when i execute the script simply (with notepad ++)
it works good.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Striping html Tags 1

medreda

Programmer

mikrom

Programmer

medreda

Programmer

mikrom

Programmer

medreda

Programmer

mikrom

Programmer

medreda

Programmer

medreda

Programmer

mikrom

Programmer

medreda

Programmer

mikrom

Programmer

medreda

Programmer

mikrom

Programmer

medreda

Programmer

Similar threads

Part and Inventory Search

Sponsor