Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

print all lines between "start" and "stop" strings (inclusiv 1

Status
Not open for further replies.

Ternion

MIS
Aug 13, 2003
16
US
Hi everyone,
I have an HTML log file which I would like to strip the HTML tags away, then search for a "begin" string, then print that and all subsequent lines until it reaches the "end" string. Kind of awking out a block of text. I've made a little 'C' program which can strip the HTML markup out (although doing it with awk would be nice), but now need to get the block of text from the remaining text file.
Thank you for any help!
Ternion
 
Hi

I do not understand clearly what you done with the HTML tags. Anyway, the cutting out part of the problem could be solved easyer with [tt]sed[/tt] :

Code:
sed -n '/start/,/stop/p' inputfile

But while you asked on the [tt]awk[/tt] list :

Code:
awk '/start/{ok=1} ok{print} /stop/{ok=0}' inputfile

If this is not what you want, please post some input text example.

Feherke.
 
Thank you Feherke. Sed would be fine, too, but it didn't seem to work (probably because my "start" and "stop" strings have spaces and backslashes). Here is a clipping from the input text and I want to pull out all the lines between the:

Backup of "\\PCABC\C: "

and the next occurence of:

Server -

Basically, I want to be able to pull out the information about each system's backups and email it to the user.

****BEGIN INPUT TEXT SAMPLE (after HTML stripping):
Server -

PCABC
Network control connection is established between 10.0.0.15:1379 <--> 10.0.0.78:10000

Network data connection is established between 10.0.0.15:1383 <--> 10.0.0.78:2106

Set

Information - \\PCABC\C:


Backup

Set Information

Family Name: "Media created on 6/23/2005 at 2:00:01 PM on system PCBAK"
Family Name: "Media created on 6/23/2005 at 2:00:01 PM on system PCBAK"

Backup of "\\PCABC\C: "

Backup set #19 on storage media #8

Backup set description: "Incr. Backup of All Systems since the last backup"

Backup Type: Incremental - Changed Files - Reset Archive Bit

Backup started on 7/20/2005 at 7:14:50 PM.
Backup

Set Detail Information

Unable to open the item \\PCABC\C:\Documents and Settings\gbw.TERNION\Application Data\Adobe\Acrobat\7.0\Updater\udlog.txt - skipped.

WARNING: "\\PCABC\C:\Documents and Settings\abc.TERNION\Local Settings\Application Data\MDaemon GroupWare\Mailboxes\Mailbox.pst" is a corrupt file.

This file cannot verify.

WARNING: "\\PCABC\C:\Documents and Settings\abc.TERNION\Local Settings\Application Data\Microsoft\Outlook\Outlook.pst" is a corrupt file.

This file cannot verify.

Backup completed on 7/20/2005 at 7:15:07 PM.

Backup

Set Summary

Backed up 532 files in 541 directories.

2 corrupt files were backed up

1 item was skipped.

Processed 180505415 bytes in 17 seconds.

Throughput rate: 608 MB/min

Server -

PCDEF

Network control connect.....

****END INPUT TEXT SAMPLE

Thank you again for any help.
Ternion
 
Hi

Ternion said:
it didn't seem to work (probably because my "start" and "stop" strings have spaces and backslashes)

No problem with the spaces, they have no special meaning in regular expressions. But the backslashes must be escaped. Even if you search substring instead of using regular expressions :

Code:
awk 'index($0,"start"){ok=1} ok{print} index($0,"stop"){ok=0}' inputfile

Could you post the code which will do this job ?

By the way. Did you seen [tt]csplit[/tt] ?

Feherke.
 
Code:
# Remove html tags.
{ gsub(/<[^>]+>/, "") }

# Print range of lines.
/Backup of "\\\\PCABC\\C: "/,  /^[ \t]*Server -/

Save as "block.awk" and run with

nawk -f block.awk infile >outfile

Under Solaris, use /usr/xpg4/bin/awk.
 
Hi

futurelet said:
# Remove html tags.
{ gsub(/<[^>]+>/, "") }

But HTML tags can be wrapped to multiple lines. ( For example in the PHP documentation. ) Have any simple solution for this ?

Anyway, good tip to use pattern ranges in [tt]awk[/tt], I forgot them completely.

Feherke.
 
Thanks everyone for all your help! Feherke, the sed ended up being the simplest way to do what I need. You're right, I just had to escape the backslashes. Here's what I used:
Code:
sed -n '/Backup of "\\\\PCABC/,/Server -/p' logfile.txt
This pulls out just the "PCABC" section.
The logfile.txt was the result of running this little 'C' program to strip the html markup from the BackupExec output logfile:
Code:
#include "stdio.h"

int main(int argc, char* argv[])
{
	int bIgnore = 0;
	char ch = 0;
	
	while (ch != EOF)
	{
	ch = getc (stdin);
	if (ch == '<')
		bIgnore = 1;
	else if (ch == '>')
		bIgnore = 0;
	else
		if (bIgnore == 0)
			putc (ch, stdout);
	}
	return 0;
}
I apologize for all this non-awk stuff in this forum, but I wanted to let everyone know what eventually worked for me.
Thanks again for everyone's help!
Ternion
 
feherke said:
But HTML tags can be wrapped to multiple lines. ( For example in the PHP documentation. ) Have any simple solution for this ?
Code:
# Presumes there won't be an empty line in tag.
# Records will be delimited by empty lines.
BEGIN { RS="" }
# Remove html tags.
{ gsub(/<[^>]+>/, "") }
I think the following will work in any case:
Code:
{
  gsub(/<[^>]+>/, "")
  if (tagopen && (tagopen = !sub(/^.*>/, "")))
    next
  tagopen = sub(/<.*$/, "")
  print
}
 
Hi

In fact, when I wrote simple solution I thinked to something like the [tt]m[/tt] ( multiline ) modifier in [tt]Perl[/tt]'s regex substitution with [tt]s///[/tt].

But I like your second code, most precisely, the direct using of [tt]sub()[/tt]'s result. In my approach there should be a lot if [tt]if[/tt]s. Thanks.

Feherke.
 
If you want to remove html tags without doing any other processing:
Code:
BEGIN {RS="<[^>]*>"; ORS=""}
8
You must use an Awk that lets RS be more than one character; e.g., mawk or gawk.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top