print all lines between "start" and "stop" strings (inclusiv 1

Ternion · Jul 21, 2005

Hi everyone,
I have an HTML log file which I would like to strip the HTML tags away, then search for a "begin" string, then print that and all subsequent lines until it reaches the "end" string. Kind of awking out a block of text. I've made a little 'C' program which can strip the HTML markup out (although doing it with awk would be nice), but now need to get the block of text from the remaining text file.
Thank you for any help!
Ternion

feherke · Jul 21, 2005

Hi

I do not understand clearly what you done with the HTML tags. Anyway, the cutting out part of the problem could be solved easyer with [tt]sed[/tt] :

Code:

sed -n '/start/,/stop/p' inputfile

But while you asked on the [tt]awk[/tt] list :

Code:

awk '/start/{ok=1} ok{print} /stop/{ok=0}' inputfile

If this is not what you want, please post some input text example.

Feherke.

http://rootshell.be/~feherke/

Ternion · Jul 21, 2005

Thank you Feherke. Sed would be fine, too, but it didn't seem to work (probably because my "start" and "stop" strings have spaces and backslashes). Here is a clipping from the input text and I want to pull out all the lines between the:

Backup of "\\PCABC\C: "

and the next occurence of:

Server -

Basically, I want to be able to pull out the information about each system's backups and email it to the user.

****BEGIN INPUT TEXT SAMPLE (after HTML stripping):
Server -

PCABC
Network control connection is established between 10.0.0.15:1379 <--> 10.0.0.78:10000

Network data connection is established between 10.0.0.15:1383 <--> 10.0.0.78:2106

Set

Information - \\PCABC\C:

Backup

Set Information

Family Name: "Media created on 6/23/2005 at 2:00:01 PM on system PCBAK"
Family Name: "Media created on 6/23/2005 at 2:00:01 PM on system PCBAK"

Backup of "\\PCABC\C: "

Backup set #19 on storage media #8

Backup set description: "Incr. Backup of All Systems since the last backup"

Backup Type: Incremental - Changed Files - Reset Archive Bit

Backup started on 7/20/2005 at 7:14:50 PM.
Backup

Set Detail Information

Unable to open the item \\PCABC\C:\Documents and Settings\gbw.TERNION\Application Data\Adobe\Acrobat\7.0\Updater\udlog.txt - skipped.

WARNING: "\\PCABC\C:\Documents and Settings\abc.TERNION\Local Settings\Application Data\MDaemon GroupWare\Mailboxes\Mailbox.pst" is a corrupt file.

This file cannot verify.

WARNING: "\\PCABC\C:\Documents and Settings\abc.TERNION\Local Settings\Application Data\Microsoft\Outlook\Outlook.pst" is a corrupt file.

This file cannot verify.

Backup completed on 7/20/2005 at 7:15:07 PM.

Backup

Set Summary

Backed up 532 files in 541 directories.

2 corrupt files were backed up

1 item was skipped.

Processed 180505415 bytes in 17 seconds.

Throughput rate: 608 MB/min

Server -

PCDEF

Network control connect.....

****END INPUT TEXT SAMPLE

Thank you again for any help.
Ternion

feherke · Jul 21, 2005

Hi

Ternion said:
it didn't seem to work (probably because my "start" and "stop" strings have spaces and backslashes)

No problem with the spaces, they have no special meaning in regular expressions. But the backslashes must be escaped. Even if you search substring instead of using regular expressions :

Code:

awk 'index($0,"start"){ok=1} ok{print} index($0,"stop"){ok=0}' inputfile

Could you post the code which will do this job ?

By the way. Did you seen [tt]csplit[/tt] ?

Feherke.

http://rootshell.be/~feherke/

futurelet · Jul 21, 2005

Code:

# Remove html tags.
{ gsub(/<[^>]+>/, "") }

# Print range of lines.
/Backup of "\\\\PCABC\\C: "/,  /^[ \t]*Server -/

Save as "block.awk" and run with

nawk -f block.awk infile >outfile

Under Solaris, use /usr/xpg4/bin/awk.

feherke · Jul 21, 2005

Hi

futurelet said:
# Remove html tags.
{ gsub(/<[^>]+>/, "") }

But HTML tags can be wrapped to multiple lines. ( For example in the PHP documentation. ) Have any simple solution for this ?

Anyway, good tip to use pattern ranges in [tt]awk[/tt], I forgot them completely.

Feherke.

http://rootshell.be/~feherke/

Ternion · Jul 22, 2005

Thanks everyone for all your help! Feherke, the sed ended up being the simplest way to do what I need. You're right, I just had to escape the backslashes. Here's what I used:

Code:

sed -n '/Backup of "\\\\PCABC/,/Server -/p' logfile.txt

This pulls out just the "PCABC" section.
The logfile.txt was the result of running this little 'C' program to strip the html markup from the BackupExec output logfile:

Code:

#include "stdio.h"

int main(int argc, char* argv[])
{
	int bIgnore = 0;
	char ch = 0;
	
	while (ch != EOF)
	{
	ch = getc (stdin);
	if (ch == '<')
		bIgnore = 1;
	else if (ch == '>')
		bIgnore = 0;
	else
		if (bIgnore == 0)
			putc (ch, stdout);
	}
	return 0;
}

I apologize for all this non-awk stuff in this forum, but I wanted to let everyone know what eventually worked for me.
Thanks again for everyone's help!
Ternion

futurelet · Jul 22, 2005

feherke said:
But HTML tags can be wrapped to multiple lines. ( For example in the PHP documentation. ) Have any simple solution for this ?

Code:

# Presumes there won't be an empty line in tag.
# Records will be delimited by empty lines.
BEGIN { RS="" }
# Remove html tags.
{ gsub(/<[^>]+>/, "") }

I think the following will work in any case:

Code:

{
  gsub(/<[^>]+>/, "")
  if (tagopen && (tagopen = !sub(/^.*>/, "")))
    next
  tagopen = sub(/<.*$/, "")
  print
}

feherke · Jul 23, 2005

Hi

In fact, when I wrote simple solution I thinked to something like the [tt]m[/tt] ( multiline ) modifier in [tt]Perl[/tt]'s regex substitution with [tt]s///[/tt].

But I like your second code, most precisely, the direct using of [tt]sub()[/tt]'s result. In my approach there should be a lot if [tt]if[/tt]s. Thanks.

Feherke.

http://rootshell.be/~feherke/

futurelet · Jul 23, 2005

If you want to remove html tags without doing any other processing:

Code:

BEGIN {RS="<[^>]*>"; ORS=""}
8

You must use an Awk that lets RS be more than one character; e.g., mawk or gawk.

feherke · Jul 23, 2005

Hi

Wow ! This is really simple. Good answer, futurelet [medal]

.

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

print all lines between "start" and "stop" strings (inclusiv 1

Ternion

MIS

feherke

Programmer

Ternion

MIS

feherke

Programmer

futurelet

Programmer

feherke

Programmer

Ternion

MIS

futurelet

Programmer

feherke

Programmer

futurelet

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

print all lines between &quot;start&quot; and &quot;stop&quot; strings (inclusiv 1

MIS

Programmer

MIS

Programmer

Programmer

Programmer

MIS

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

print all lines between "start" and "stop" strings (inclusiv 1