Convert HTML to Plain Text 1

DrMingle · May 18, 2010

This works well in pulling down the HTML/Text from the requested page:

Code:

# -*- coding: utf-8 -*-
# Python

from urllib import urlopen
print urlopen('[URL unfurl="true"]http://www.fsmb.org').read()[/URL]

However, I need help converting the print urlopen().read output to plain text rather than HTML/Text.

Your help is appreciated.

JustinEzequiel · May 18, 2010

perhaps

http://www.aaronsw.com/2002/html2text/

DrMingle · May 18, 2010

JustinEzequiel:

Thanks for the response.

I have downloaded the HTML2Text.py when I run it a GUI window pops up with the phrase, PYTHONWIN. My only options are to fill it out (with what I don't know), select OK, or select CANCEL.

Any ideas?

JustinEzequiel · May 18, 2010

if you open up the html2text.py file in your favorite editor then you'll see at the bottom how you can use it

Code:

if __name__ == "__main__":
    baseurl = ''
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('[URL unfurl="true"]http://'):[/URL]
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))

JustinEzequiel · May 18, 2010

Code:

import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text

DrMingle · May 18, 2010

This is what I went with (this is a snippet of the whole):

Code:

if __name__ == "__main__":
    baseurl = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('[URL unfurl="true"]http://'):[/URL]
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))

I am still being prompted to input some sort of data...through a pop up window...any ideas? Should I be tweaking other elements of this code?

DrMingle · May 18, 2010

Justin:

Touchdown...I was a day late and a dollar short. Your code worked wonderfully:

Code:

import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = '[URL unfurl="true"]http://www.fsmb.org'[/URL]
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text

Thank you.

JustinEzequiel · May 18, 2010

try my previous post and do not modify the html2text.py file but incorporate my post into your own code

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Convert HTML to Plain Text 1

DrMingle

Technical User

JustinEzequiel

Programmer

DrMingle

Technical User

JustinEzequiel

Programmer

JustinEzequiel

Programmer

DrMingle

Technical User

DrMingle

Technical User

JustinEzequiel

Programmer

Similar threads

Part and Inventory Search

Sponsor