Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need help on HTML tag remover...

Status
Not open for further replies.

eyesrglazed

Programmer
Jul 14, 2005
20
US
This program is supposed to remove the HTML tags from a downloaded HTML file. Here is the code :
Code:
#include <iostream>
#include <fstream>
#include <string>
#include <windows.h>
#include <wininet.h>
#pragma comment(lib, "wininet.lib")

using namespace std;

int main()
{
	ifstream inFile;
	ofstream outFile;
	ifstream inFiletemp;
	ofstream outFiletemp;
	char fileName[100];
	char fileName2[100];
	char temp;
	
	inFile.open("sites.txt");
	inFile >> fileName >> fileName2;

	//***Write HTML File****************************************************************************

	outFile.open("temphtml.txt");
	HINTERNET Initialize,Connection,File;
	DWORD dwBytes;
	char ch;
	Initialize = InternetOpen("HTTPGET", INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
	Connection = InternetConnect(Initialize, fileName, INTERNET_DEFAULT_HTTP_PORT,
								 NULL , NULL, INTERNET_SERVICE_HTTP, 0, 0);
	File = HttpOpenRequest(Connection, NULL, fileName2, NULL, NULL, NULL, 0, 0);
	if(HttpSendRequest(File, NULL, 0, NULL, 0))
	{
		while(InternetReadFile(File, &ch, 1, &dwBytes))
		{
			if(dwBytes != 1)
				break;
			outFile << ch;
		}
		cout << "Connected successfully." << endl;
	}
	InternetCloseHandle(File);
	InternetCloseHandle(Connection);
	InternetCloseHandle(Initialize);
	outFile.close();

	//***Write "No tags" File***********************************************************************

	inFiletemp.open("temphtml.txt");
	outFiletemp.open("temp.txt");
	inFiletemp >> temp;
	while (! inFiletemp.eof())
	{
		if (temp == '<')
		{
			while (temp != '>')
				inFiletemp >> temp;
			inFiletemp >> temp;
		}
		outFiletemp << temp;
		inFiletemp >> temp;
	}
	return 0;
}

In the "no tags" section, it reads a character from the file and checks if it is a '<'. If so, it continues reading until it finds a '>'. Then, it reads the next character and writes that one. I know this method of tag removing won't work, but I was wondering what is wrong with it.

Help would be appreciated.
 
Just thinking of your algorithm, what happens when there are back to back tags (e.g. <b><u>)? When it finds a '<', it should continue reading until it finds a '>', then continue where it left off instead of writing the next letter.

Other than checking for read errors inside the while (temp != '>') loop, all it would take to fix the algorithm is one more line of code.
 
Here's another version that uses a switch to fix the problem with the tags inside of tags. The only problem it has is how it deals with the '>'s. Just for kicks, compile this and look at the "temp.txt" file it creates. BTW : "sites.txt" contains two strings that are something like " and "/index.html".

Code:
#include <iostream>
#include <fstream>
#include <string>
#include <windows.h>
#include <wininet.h>
#pragma comment(lib, "wininet.lib")

using namespace std;

int main()
{
	ifstream inFile;
	ofstream outFile;
	ifstream inFiletemp;
	ofstream outFiletemp;
	char fileName[100];
	char fileName2[100];
	
	inFile.open("sites.txt");
	inFile >> fileName >> fileName2;

	//***Write HTML File****************************************************************************

	outFile.open("temphtml.txt");
	HINTERNET Initialize,Connection,File;
	DWORD dwBytes;
	char ch;
	Initialize = InternetOpen("HTTPGET", INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
	Connection = InternetConnect(Initialize, fileName, INTERNET_DEFAULT_HTTP_PORT,
								 NULL , NULL, INTERNET_SERVICE_HTTP, 0, 0);
	File = HttpOpenRequest(Connection, NULL, fileName2, NULL, NULL, NULL, 0, 0);
	if(HttpSendRequest(File, NULL, 0, NULL, 0))
	{
		while(InternetReadFile(File, &ch, 1, &dwBytes))
		{
			if(dwBytes != 1)
				break;
			outFile << ch;
		}
		cout << "Connected successfully." << endl;
	}
	InternetCloseHandle(File);
	InternetCloseHandle(Connection);
	InternetCloseHandle(Initialize);
	outFile.close();

	//***Write "No tags" File***********************************************************************

	char temp;
	int num = 0;

	inFiletemp.open("temphtml.txt");
	outFiletemp.open("temp.txt");
	inFiletemp >> temp;
	
	while (! inFiletemp.eof())
	{
		if (temp == '<')
			num++;
		if (temp == '>')
			num--;
		if (num == 0)
			outFiletemp << temp;
		inFiletemp >> temp;
	}

	return 0;
}

By the way, I do realize that the problem lies within the "if (temp == '>')" segment. I just don't know how to fix it.
 
I didn't run or test the code, but I think if you just make those else if's instead of if's it will work better for you. For example, if temp is '>', then num is decremented by one, but then you check num immediately after that. So with a simple single tag, after you find the '>' you decrement to zero and the third if is run and then the '>' is output. Make those else ifs and that won't happen.

Also, an easier way to do the loop that also checks for other errors besides eof is this:
Code:
    inFiletemp.open("temphtml.txt");
    outFiletemp.open("temp.txt");
    
    while (inFiletemp >> temp)
    {
        if (temp == '<')
            num++;
        else if (temp == '>')
            num--;
        else if (num == 0)
            outFiletemp << temp;
    }
 
So, do you think that if I did it like this it would work?

Code:
    while (inFiletemp >> temp)
    {
        if (temp == '<')
            num++;
        else if (temp == '>')
        {
            num--;
            if (num == 0)
                inFile >> temp;
        else if (num == 0)
            outFiletemp << temp;
    }
 
Well, i compiled the new code, and it works, but it ONLY removes the '<'s and the '>'s, along with the spaces. Can we get it to work correctly?
 
Code:
    while (inFiletemp >> temp)
    {
        if (temp == '<')
            num++;
        else if (temp == '>')
        {
            num--;
			if (num == 0)
				inFiletemp >> temp;
		}
        else if (num == 0)
            outFiletemp << temp;
    }

This one's a laugh. If you compile it, it works, but in the wrong direction. Instead of removing the brackets and the characters between them, like "<b>HI!</b>" to "HI!", it turns it around and writes "b/b".
 
What was wrong with the code I posted? Assuming you initialize num to 0 it should work. Here is an example program to test with. Just change data to whatever text you want to test:
Code:
#include <iostream>
#include <string>
#include <sstream>

int main()
{
    std::string data("<b>HI!</B>");
    std::istringstream inFiletemp(data);
    std::ostringstream outFiletemp;

    char temp;
    int num = 0;
    while (inFiletemp >> temp)
    {
        if (temp == '<')
            num++;
        else if (temp == '>')
            num--;
        else if (num == 0)
            outFiletemp << temp;
    }

    std::cout << outFiletemp.str() << std::endl;
}
By the way, when I put your latest code into that test program it outputs "I!", not "b/b".
 
Is there a difference between me using ifstream and ofstream and you using istringstream and ostringstream?

The input you're using in that demo program isn't the same as what I'm using. I'm just wondering if that also makes a difference or not.
 
No, there shouldn't be any different between the stringstreams and file streams (except of course for how they are initialized and stuff). If you want to test with file streams use the code below. Again, that works.. except for the part about the whitespace. I forgot about that... operator >> skips over whitespace. To print out the whitspace as well you can use noskipws or infile.get(). Either one will work.
Code:
#include <iostream>
#include <string>
#include <fstream>

int main()
{
    std::ifstream inFiletemp("in.txt");
    std::ofstream outFiletemp("out.txt");

    char temp;
    int num = 0;
    while (inFiletemp.get(temp))
    {
        if (temp == '<')
            num++;
        else if (temp == '>')
            num--;
        else if (num == 0)
            outFiletemp << temp;
    }
}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top