Parsing Report with multiple blocks and colums of data

knowlgo · Dec 27, 2004

I have a report that I am beating my head against a wall about to get parsed correctly.

I has variable length data after colons and the data is in blocks.

Name: John Addr1: x way PBL: Y ON CD: N
Exp DATE: 000000 Enr Date: 000000 Ded Dtn: 40402
Addr2: LNAME: Van Risen

etc. . .

then new block begins.

I can parse it into separate files using grep or awk, easier with grep. Then concatenate and take out unwanted data. OR! I can get an all inclusive solution with awk, something i've not been able to do given the variable length of spaces AND field names AND data. So $1 and $2 and $3 have little meaning until I can do something with this file.

Any help is appreciated.

kHz · Dec 27, 2004

Example output would be extremely useful.

futurelet · Dec 27, 2004

[tt]
BEGIN {
_fpos = "1 20 40 52 99"
split( _fpos, fpos )
}

function f( n, to_end ,s )
{ if ( to_end )
s = substr( $0, fpos[n] )
else
s = substr( $0, fpos[n], fpos[n+1] - fpos[n] )
gsub( /[^:]*: | +$/, "", s )
return s
}

1==NR%3 { name=f(1); addr1=f(2); pbl=f(3); on_cd=f(4) }
2==NR%3 { exp_date=f(1); enr_date=f(2); ded_dtn=f(3,1) }
0==NR%3 { addr2=f(1); lname=f(2)
print name,lname,addr1,addr2,ded_dtn
}
[/tt]
Let me know whether or not this helps.

If you have nawk, use it instead of awk because on some systems awk is very old and lacks many useful features. Under Solaris, use /usr/xpg4/bin/awk.

For an introduction to Awk, see faq271-5564.

knowlgo · Dec 27, 2004

Sorry,

Desired output is columnar:

Name Address City State
data data data data

futurelet,

I will give that a go!

Thank you.

knowlgo · Dec 27, 2004

I'm not quite sure I understand what that does, especially the fpos function. I ran it exactly the way you posted it, just to see what the output was and modify from there. I changed what appear the be the positional parameters "fpos "1 15. . ." and it seems to repeat the first word from the file. And, I had to grep the row before I ran it or the output came out garbled. I think this is on the right track, I just don't seem to moving with the train enough to figure out what I need to change. Output from the file.

CLIENT GRP NBR GRP NBR CLIENT B
CLIENT GRP NBR GRP NBR CLIENT B
CLIENT GRP NBR GRP NBR CLIENT B
CLIENT GRP NBR GRP NBR CLIENT B

That second client is a repeat of the first client, as no other "client" exists in the file. The B is from the very end of each record.

I understand it's definitely outputting good data, I just need to reposition a little. I counted positions in regular and octal bits using od -c but for some reason I'm not positioning correctly.

I hate to ask, but can you walk me through that a little? Specifically the fpos since I wasn't able to reference the function anywhere.

sorry to take this time from your day.

-knowlgo

futurelet · Dec 27, 2004

This is the input data that you posted:
[tt]
Name: John Addr1: x way PBL: Y ON CD: N
Exp DATE: 000000 Enr Date: 000000 Ded Dtn: 40402
Addr2: LNAME: Van Risen
[/tt]
We see that "Addr1:" begins in the same column as "Enr Date:" (column 20). _fpos is a string that holds the starting columns of the field names. It is split into array fpos, so that fpos[2] is 20.

The output of
[tt]
awk -f block.awk input.txt
[/tt]
is
[tt]
John Van Risen x way 40402
[/tt]
If the data file is formatted as you showed it, there was no reason to change the 20 to 15.

You did not show us a complete block. Therefore, I can't write a program for you that parses a complete block. To get a program that handles your file, you've got to give a complete picture of the file layout; show the first two blocks of the file (with fictional, not actual data); if there is no blank line between blocks, show it that way; if there is 1 blank line show that; if there are 2 blank lines, show them.

[tt]
BEGIN {
# How many lines per block (record)?
BL = 3
_fpos = "1 20 40 52 99"
split( _fpos, fpos )
# How to print the data.
format = "%-25s%-30s%-20s%s\n"
}

function f( n, to_end ,s )
{
if ( to_end )
s = substr( $0, fpos[n] )
else
s = substr( $0, fpos[n], fpos[n+1] - fpos[n] )
gsub( /[^:]*: | +$/, "", s )
return s
}

1==NR%BL { name=f(1); addr1=f(2); pbl=f(3); on_cd=f(4) }
2==NR%BL { exp_date=f(1); enr_date=f(2); ded_dtn=f(3,1) }
0==NR%BL { addr2=f(1); lname=f(2)
# Last line of block. Print.
printf( format, lname ", " name,
addr1 (addr2? "; " addr2 : ""), city, state )
}
[/tt]

knowlgo · Dec 28, 2004

Gotcha.

I have a cleaned up version, but the font is coming out too large for this screen. there doesn't seem to be any size tags for the tgml. Suggestions?

PHV · Dec 28, 2004

Simply prefix each line with line number or suffix each line with \n.
Example1:[tt]
1_Name: John Addr1: x way PBL: Y ON CD: N
2_Exp DATE: 000000 Enr Date: 000000 Ded Dtn: 40402
3_Addr2: LNAME: Van Risen[/tt]
Example2:[tt]
Name: John Addr1: x way PBL: Y ON CD: N\n
Exp DATE: 000000 Enr Date: 000000 Ded Dtn: 40402 \n
Addr2: LNAME: Van Risen \n[/tt]

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

knowlgo · Dec 28, 2004

Ok! Sorry for being such a rookie guys. There are three header rows followed by a space. Everything is marked and numbered. As you may or may not be able to tell, line 9 doesn't line up with the rest of the rows. Thanks a million.

Header
H
H
Space
1 CLIENT XXX XXX: XXXXXXX EFFECTIVE DATE: 20041001 XXX XXX: XXXXX XXX XXX NBR: XXXXXXX PAYSUB REIMB: B
2 NAME: XXXXXXXX XXXXXXX XXXXXXXXXXX EXPIRATION DATE: 00000000 PLAN TYPE: 0000002 RX COVERXXX: Y PAR RATE: 100
3 ADR1: XXX XXXXX XXXXXXX CONTR EFF DATE: 00000000 CALENDAR YEAR: Y DENTAL RX CVG: Y NON-PAR RATE: 100
4 ADR2: SUITE 501 ADMIN HOLD EFF: 00000000 XXX MAX EDIT: 0 ADULT DEP CVG: Y SEND CARD TO: S
5 ADR3: ADMIN HOLD EXP: 00000000 DEPENDENT XXX: 00 STUDENT XXX: 00 NBR OF CARDS: 00
6 ADR4: DED DTN: 164387 4TH QTR ROLLOVER: N DEPENDENTS ON FILE: Y PRINT IND:
7 CITY: CHICAGO MOPS DTN: 164387 MASTER GRP: 01 AUTO ELIG: AID CAT NO: 000
8 STATE: IL ZIP: 60610 ZIP EXT: 0000 PSL DTN: 164387 CLIENT XXX: AUTO DAYS: 0000
9 STD XXX FL: 0 DIS DEP FL: N GRP CVG FL: 03
Space
Next Block Repeating

Ygor · Dec 28, 2004

Seems that each line is labelled and has a fixed format, so you could try something like...

Code:

BEGIN {
          OFS = sprintf("\011")
}
/^NAME/ {
          Name = substr($0, 7, 30)
}
/^ADR1/ {
          Adr1 = substr($0, 7, 30)
}
/^STATE/ {
          State = substr($0, 8, 2)
          Zip = substr($0, 17, 5)
}
/^STD/ {
          print Name, Adr1, State, Zip
}

futurelet · Dec 29, 2004

With this input

CLIENT XXX XXX: XXXXXXX EFFECTIVE DATE: 20041001 XXX XXX: XXXXX XXX XXX NBR: XXXXXXX PAYSUB REIMB: B\n
NAME: THOMAS ROBERT EDWARDS, JR. EXPIRATION DATE: 00000000 PLAN TYPE: 0000002 RX COVERXXX: Y PAR RATE: 100\n
ADR1: 222 SOUTH CENTRAL BLVD. CONTR EFF DATE: 00000000 CALENDAR YEAR: Y DENTAL RX CVG: Y NON-PAR RATE: 100\n
ADR2: SUITE 501 ADMIN HOLD EFF: 00000000 XXX MAX EDIT: 0 ADULT DEP CVG: Y SEND CARD TO: S\n
ADR3: ADMIN HOLD EXP: 00000000 DEPENDENT XXX: 00 STUDENT XXX: 00 NBR OF CARDS: 00\n
ADR4: DED DTN: 164387 4TH QTR ROLLOVER: N DEPENDENTS ON FILE: Y PRINT IND:\n
CITY: CHICAGO MOPS DTN: 164387 MASTER GRP: 01 AUTO ELIG: AID CAT NO: 000\n
STATE: IL ZIP: 60610 ZIP EXT: 0000 PSL DTN: 164387 CLIENT XXX: AUTO DAYS: 0000\n
STD XXX FL: 0 DIS DEP FL: N GRP CVG FL: 03 \n

the output is
[tt]
Name Address City State
THOMAS ROBERT EDWARDS, J 222 SOUTH CENTRAL BLVD.;SUITE CHICAGO IL
[/tt]

Program:
[tt]
BEGIN {
# How to print the data.
format = "%-24.24s %-29.29s %-18.18s %s\n"
printf format, "Name", "Address", "City", "State"
}

/^NAME:/ { name=get() }

/^ADR[1-4]:/ { addr = addr (addr?";":"") get() }

/^CITY:/ { city=get() }

/^STATE:/ { state=substr($0,8,2)
zip=substr($0,17,5)
printf format, name, addr, city, state
}

function get()
{ return trim( substr($0,7,32) )
}
function trim(s)
{ sub( /[ \t]+$/, "", s )
return s
}
[/tt]
If you want ouput more than 80 columns wide, you can change the numbers in the line

format = "%-24.24s %-29.29s %-18.18s %s\n"

futurelet · Dec 29, 2004

Correction:
[tt]
BEGIN {
# How to print the data.
format = "%-24.24s %-29.29s %-18.18s %s\n"
printf format, "Name", "Address", "City", "State"
}

/^NAME:/ { name=get(); addr="" }

/^ADR[1-4]:/ {
if ( get() )
addr = addr (addr?";":"") get() }

/^CITY:/ { city=get() }

/^STATE:/ { state=substr($0,8,2)
zip=substr($0,17,5)
printf format, name, addr, city, state
}

function get()
{ return trim( substr($0,7,32) )
}
function trim(s)
{ sub( /[ \t]+$/, "", s )
return s
}
[/tt]

knowlgo · Dec 31, 2004

Thanks for your assistance on this guys. I haven't had a chance to run it, I got sidetracked on another project. As soon as I know if it works I'll let you know.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parsing Report with multiple blocks and colums of data

knowlgo

MIS

kHz

MIS

futurelet

Programmer

knowlgo

MIS

knowlgo

MIS

futurelet

Programmer

knowlgo

MIS

PHV

MIS

knowlgo

MIS

Ygor

Programmer

futurelet

Programmer

futurelet

Programmer

knowlgo

MIS

Similar threads

Part and Inventory Search

Sponsor