Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Striping html Tags 1

Status
Not open for further replies.

medreda

Programmer
Nov 25, 2010
10
FR
thread278-1604666

Hello every body.
it is really the best programming forum i've ever seen.

i've just took a look a this topic:

i want to strip html tags from a web site so i've just simply copied the code and changed the url, but nothing happened .

could you please retest the code or make some comments on it.

Thanks in advance
 
medreda said:
i've just simply copied the code and changed the url, but nothing happened
Maybe you forgot first to download the python module html2text.

Other issue could be the encoding of your web page. In the example code you mentioned, encoding = 'utf-8' is used.
You should look at the encoding of your web page, otherwise you can get an error like:
[tt]UnicodeDecodeError: 'utf8' codec can't decode bytes in position 528-530: invalid data[/tt]


 
thanks a lot for this quick response

i have installed and imported the html2text properly.
also i've tested the example on a web site that's encoded by UTF-8 (the same web site in the example) but nothing happened too.

could you explain more or make some comments on the source code.

Thanks again.
 
I tested the example given in thread you mentioned and other URLs too and everything worked for me fine.

Have you tried to test the web site on teh package home page?
When not, then go to the and type in the URL: input box an URL and press Convert button.
 
thanks a lot mikrom

the code worked very good when i've changed the IDE. so it was a problem of indentation because i was using IDLE.

i'd like to say too that this piece of code works only with web sites that are coded in UTF-8.

could you give me a clue about how to make it workful with non coded UTF-8 web sites.

Thank you again

 
thanks mikrom.

i mean to transform a non coded UTF-8 page to UTF-8 then apply directly this geat piece of code .

i've made a litte search and found that is possible with BeautifulSoup

so i add this code first to transform the content of the web page into utf-8 :

def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)

if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode

fin= decode_html("content_of_non_utf_web_page")
soup = BeautifulSoup(fin) # here it will be a UTF-8

than i apply yout piec of code.

What do you say.
 
i tested the code that i have told you many times it worked every time, so i have added something useful to your code lol.


i want to ask you a last question:
when saving the result to text file it displays it for example like this:

[36]: modules.php?name=Top
[37]: modules.php?name=Surveys

So can we strip the number displayed before each string.

Thank you again
 
Yes, you can strip that for example using string operations
Code:
>>> my_string = "[36]: modules.php?name=Top"
>>> my_string[my_string.index("]:")+3:]
'modules.php?name=Top'
>>> my_string ="[37]: modules.php?name=Surveys"
>>> my_string[my_string.index("]:")+3:]
'modules.php?name=Surveys'
or using regular expression:
Code:
>>> import re
>>> my_string1 = "[36]: modules.php?name=Top"
>>> my_string2 ="[37]: modules.php?name=Surveys"
>>> re.sub(r"\[\d+\]:\s*","", my_string1)
'modules.php?name=Top'
>>> re.sub(r"\[\d+\]:\s*","", my_string2)
'modules.php?name=Surveys'
 
i appreciate all the help that you have provided me, i can get through python without any scares now.

i have tested this last one it was a master piece too.

but when puting all the code in another PC and executing the it, i got this stupid error :

AttributeError: 'module' object has no attribute 'wrapwrite'

does it linked to version of the html2text. because i don't use the same one the first computer

thanks a lot
 
When you look in the source of the module html2text.py you will find the method
Code:
def wrapwrite(text): sys.stdout.write(text.encode('utf8'))
 
yes i saw it, i can run with python 2.5 and not with python 2.6
 
I'm only using Python 2.5, but I thought that 2.6 should be backward compatible.
 
thank you
every time you give me a clue.

the problem was with the WingIDE (i don't know if you heard about it), when i execute the script simply (with notepad ++)
it works good.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top