Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

awk and html parsing

Status
Not open for further replies.

Tarquin

Technical User
Jun 17, 2001
38
AU
Hi

I've got a script which retrieves a web page (using cURL) and then splits the resultant file into sections (using csplit) but I'm having difficulty getting awk to strip the html tags out so that I am left with the required data.

I've tried using the following:

{ BEGIN {charchk = &quot;[<&]&quot;}
i = 1
while i <= NF
{
if ((substr($i,1,1) !~ charchk )
{
print $i
}
}
++i
}

(with various combinations of brackets and loops but it will not run:
&quot;syntax error near unexpected token `((substr($i,1,1)&quot;

I tried the recent related post but could get the link to freefriends to work.

Any advice appreciated.
 

Hi, Tarquin!

Maybe you use the BEGIN pattern in the wrong way. The BEGIN pattern in awk lets you specify commands that take place before the first line is processed. For example, you can set the field separator to a colon:

BEGIN { FS = &quot;:&quot; }

# other
{ print $1, $2}


awk script consist of patterns and procedures:

pattern { procedure }
pattern { procedure }
...

If pattern is missing, procedure is applied to all input lines. So, you can enclose counter, while loop and other in the braces { }.

I hope this helps.

Bye!

KP.
 
Thanks for the reply.

Unfortunately I was having the same problem before I started using the BEGIN statement so I don't think it's the issue.
 

I don't know _why_ you do it THAT way, but here's your
answer - you'd misbalanced parenthesis

BEGIN {
charchk=&quot;[<&]&quot;
}

{
i=1
while (i <= NF) {
if (substr($i,1,1) !~ charchk)
print $i ;
i++;
}
}
 

Thanks

I have re-coded as detailed as now get the following errors:

atoxx.awk: BEGIN{: command not found
atoxx.awk: syntax error near unexpected token `}'
atoxx.awk: ato03.awk: line 3: `}'

Any further advice?
 
OK - how do you run your script?


you should save the code from my previous post
in say file &quot;foo.awk&quot; and run the awk as following:

nawk -f foo.awk yourDataFile.txt

vlad
 

I am running the awk as :

&quot;awk -f atoxx.awk atoxx > atoxxout&quot;

I've dropped the use of BEGIN and the program now looks like this:

{
i=1
while(i<=NF)
{
s=substr($i,1,1)
if s !~ /[<&]/)
{
print $i;
}
++i;
}
}

And it runs, thanks.
 

At least it does when I add the missing bracket ;).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top