Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

converting HTML markup code to text

Status
Not open for further replies.

santanudas

Technical User
Mar 18, 2002
121
0
0
GB
Greetings all,

I'm a python newbie, so my apology for being silly (if it is) with my question.

I was trying to convert converting HTML markup code to human readable text and the sample line I took form the iTumes music library, which is a .xml file.

Code:
import os, sys, string
import urllib;

xmlString = "<string>file://localhost/Volumes/DataCenter/iTunes/iTunes%20Music/George%20Michael/Ladies%20&%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Father%20Figure.m4a</string>

String = xmlString.split("/")
iX = urllib.unquote(String[-3])
print "Album name: " + iX

I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" in stead it returns "Ladies & Gentlemen_ The Best Of George Michael" i.e. it's converting the %* thing but not the & and stuff like that. Any one know what I'm doing wrong?

Thanks in advance for your help. Cheers!!!
 
Hi

santanudas said:
[small]was expecting to see [/small]
"Ladies & Gentlemen_ The Best Of George Michael" [small]in stead it returns [/small]
"Ladies & Gentlemen_ The Best Of George Michael"
As you can see, those two strings appear identically.

( Note that the TGML parser currently has a bug, it transforms certain character entities. It is tricky : the message appears correctly in the preview, then alters it. )

Please post again without previewing.


Feherke.
 
Silly me, I should have realized the html-code will be converted into normal character on the browser any way.

So, this is the sample line, taken for the .xml file
Code:
<key>Location</key><string>file://localhost/Volumes/DataCenter/nMedia/mMusic/iTunes/iTunes%20Music/George%20Michael/Ladies%20[COLOR=red][i]&+#+38_;[/i][/color]%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Je
sus%20To%20A%20Child.m4a</string>
(omit the + signs in the red text, between & and ;)
After the conversion, I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" but I got "Ladies &_#_38_; Gentlemen_ The Best Of George Michael" (again, ignore the plus signs) in stead. Did I put it in right way this time? Cheers!!!
 
I see, I still get it right. In short: using urllib.unquote(), "&#38;" is not being converted to "&" - what am I missing?
(finger crossed!!! hopefully this time it will come up correctly).
 
Is there any help from any one please? Is too tough to do?
Cheers!!!
 
Hi

That is because [tt]urllib.unquote()[/tt] handles only URL encoding ( those %XX things ). But your string has also character entities ( those &#XX; things ) which has to be handled separately.

Personally I would use the [tt]unescape()[/tt] function from Fredrik Lundh's article, Unescape HTML Entities. Just add the [tt]import[/tt] and [tt]def[/tt] as shown there, then change this line :
Code:
iX [teal]=[/teal] [highlight][COLOR=darkgoldenrod]unescape[/color][teal]([/teal][/highlight]urllib[teal].[/teal][COLOR=darkgoldenrod]unquote[/color][teal]([/teal]String[teal][-[/teal][purple]3[/purple][teal]])[/teal][highlight][teal])[/teal][/highlight]

Feherke.
 
Hi there,
Thanks for the link. The "unescape" did solve the problem for "*&#xx" but creating for problem for string like: Rai%CC%88 (Raï) or Beyonc%CC%81 (Beyoncé). This is what I get:

Code:
Traceback (most recent call last):
  File "./metadata.py", line 150, in <module>
    artist_dir="%s/%s/%s" % (media_dir, genre, artist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

Any solution to this issue? Cheers!!!
 
Just to mention that I already have # -*- coding: ISO-8859-15 -*- added at the beginning of the script but this isn't working. Cheers!!!
 
Hi

No idea what happens there. Anyway, reversing the function calls seems to solve something here ( not sure if this is your problem too ) :
Code:
iX [teal]=[/teal] urllib[teal].[/teal][COLOR=darkgoldenrod]unquote[/color][teal]([/teal][COLOR=darkgoldenrod]unescape[/color][teal]([/teal]String[teal][-[/teal][purple]3[/purple][teal]]))[/teal]

Feherke.
 
Hi feherke,
I already tried reversing the system call as you said; need not to say that didn't fix the problem here. Cheers!!!
 
No problem feherke, at least you tried to help. Many thanks for that. Cheers!!!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top