Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Convert HTML to Plain Text 1

Status
Not open for further replies.

DrMingle

Technical User
May 24, 2009
116
US
This works well in pulling down the HTML/Text from the requested page:

Code:
# -*- coding: utf-8 -*-
# Python

from urllib import urlopen
print urlopen('[URL unfurl="true"]http://www.fsmb.org').read()[/URL]

However, I need help converting the print urlopen().read output to plain text rather than HTML/Text.

Your help is appreciated.
 
JustinEzequiel:

Thanks for the response.

I have downloaded the HTML2Text.py when I run it a GUI window pops up with the phrase, PYTHONWIN. My only options are to fill it out (with what I don't know), select OK, or select CANCEL.

Any ideas?
 
if you open up the html2text.py file in your favorite editor then you'll see at the bottom how you can use it

Code:
if __name__ == "__main__":
    baseurl = ''
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('[URL unfurl="true"]http://'):[/URL]
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))
 
Code:
import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text
 
This is what I went with (this is a snippet of the whole):

Code:
if __name__ == "__main__":
    baseurl = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('[URL unfurl="true"]http://'):[/URL]
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))

I am still being prompted to input some sort of data...through a pop up window...any ideas? Should I be tweaking other elements of this code?
 
Justin:

Touchdown...I was a day late and a dollar short. Your code worked wonderfully:

Code:
import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text

Thank you.
 
try my previous post and do not modify the html2text.py file but incorporate my post into your own code
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top