awk and html parsing

Tarquin · Jun 18, 2001

Hi

I've got a script which retrieves a web page (using cURL) and then splits the resultant file into sections (using csplit) but I'm having difficulty getting awk to strip the html tags out so that I am left with the required data.

I've tried using the following:

{ BEGIN {charchk = "[<&]"}
i = 1
while i <= NF
{
if ((substr($i,1,1) !~ charchk )
{
print $i
}
}
++i
}

(with various combinations of brackets and loops but it will not run:
"syntax error near unexpected token `((substr($i,1,1)"

I tried the recent related post but could get the link to freefriends to work.

Any advice appreciated.

Krunek · Jun 19, 2001

Hi, Tarquin!

Maybe you use the BEGIN pattern in the wrong way. The BEGIN pattern in awk lets you specify commands that take place before the first line is processed. For example, you can set the field separator to a colon:

BEGIN { FS = ":" }

# other
{ print $1, $2}

awk script consist of patterns and procedures:

pattern { procedure }
pattern { procedure }
...

If pattern is missing, procedure is applied to all input lines. So, you can enclose counter, while loop and other in the braces { }.

I hope this helps.

Bye!

KP.

Tarquin · Jun 19, 2001

Thanks for the reply.

Unfortunately I was having the same problem before I started using the BEGIN statement so I don't think it's the issue.

vgersh99 · Jun 19, 2001

I don't know _why_ you do it THAT way, but here's your
answer - you'd misbalanced parenthesis

BEGIN {
charchk="[<&]"
}

{
i=1
while (i <= NF) {
if (substr($i,1,1) !~ charchk)
print $i ;
i++;
}
}

Tarquin · Jun 19, 2001

Thanks

I have re-coded as detailed as now get the following errors:

atoxx.awk: BEGIN{: command not found
atoxx.awk: syntax error near unexpected token `}'
atoxx.awk: ato03.awk: line 3: `}'

Any further advice?

vgersh99 · Jun 20, 2001

OK - how do you run your script?

you should save the code from my previous post
in say file "foo.awk" and run the awk as following:

nawk -f foo.awk yourDataFile.txt

vlad

Tarquin · Jun 20, 2001

I am running the awk as :

"awk -f atoxx.awk atoxx > atoxxout"

I've dropped the use of BEGIN and the program now looks like this:

{
i=1
while(i<=NF)
{
s=substr($i,1,1)
if s !~ /[<&]/)
{
print $i;
}
++i;
}
}

And it runs, thanks.

Tarquin · Jun 20, 2001

At least it does when I add the missing bracket

.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

awk and html parsing

Tarquin

Technical User

Krunek

Programmer

Tarquin

Technical User

vgersh99

Programmer

Tarquin

Technical User

vgersh99

Programmer

Tarquin

Technical User

Tarquin

Technical User

Similar threads

Part and Inventory Search

Sponsor