awk I think!!

unixguy303 · Jan 29, 2007

I have a large file with multiple header, detail, & Trailer records

I need to split the file into smaller files say 4 parts

I must start each file with a header and end with a trailer

I can have 1 or more detail records

ie:

h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321

any ideas?

Thanks
Brandt

PHV · Jan 29, 2007

What have you tried so far ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

unixguy303 · Jan 29, 2007

olded · Jan 29, 2007

Use the split command to break your file into the required files:

MAN split

for example, if bigfile is 8 lines long:

Code:

split -l 2 bigfile

creates four files xaa, xab, xac, xad

Then, for each file append a header to a temp file, append the file in question to the temp file, and, finally, append the trailer to the temp file.

unixguy303 · Jan 29, 2007

split won't work for it would not know what record to split on. each record can be variable length (multiple detail records. a split must occur between at trailer record and a header record.

h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321

feherke · Jan 30, 2007

Hi

Is not clear to me how do you know which line in trailer. Here I assumed that it begin with letter "t". You can give the number of parts as value of [tt]s[/tt] as parameter. You must test it before use. I only done some basic tests.

Code:

awk -v s=[green][i]parts[/i][/green] 'BEGIN{f=0}NR==1{c="wc -l<"FILENAME;c|getline l;close(c)}{b=int(NR/(l/s));print $0>FILENAME"."f;if(f!=b){while($0!~/[green][i]^t[/i][/green]/){getline;print $0>FILENAME"."f};close(FILENAME"."f);f=b}}' [green][i]/input/file[/i][/green]

Tested with [tt]gawk[/tt].

Note that is all that is one line.

Feherke.

http://rootshell.be/~feherke/

unixguy303 · Jan 31, 2007

Feherke,

I don't have a clue what you attempting to show me here.
Could you please explain so I might be able to get something working. Do you have anny questions? I really need to get this!!

Thanks in advance
Brandt

Ygor · Jan 31, 2007

All you need to do is match your header records to a pattern, e.g. assuming all header records begin with "h"...

Code:

$ awk '/^h/{close(f); f=sprintf("outfile.%03d",++n)}{print $0 > f}' infile
$ head outfile.*
==> outfile.001 <==
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321

==> outfile.002 <==
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
:
etc.

... or use csplit instead.

Ygor · Jan 31, 2007

Syntax for csplit is...

Code:

$ csplit -f 'outfile.' -n 3 infile '/^h/' '{*}'
0
102
136
170
102
136
170
$ head outfile.*
==> outfile.000 <==

==> outfile.001 <==
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321

==> outfile.002 <==
h1234567890 1234567890 1234567890
dqwertyuiop qwertyuiop qwertyuiop
dqwertyuiop qwertyuiop qwertyuiop
t0987654321 0987654321 0987654321
:
etc

feherke · Feb 1, 2007

.Hi

Brandt said:
Could you please explain so I might be able to get something working.

Code:

awk -v s=[green][i]parts[/i][/green] '            [gray]# set the number of desired parts[/gray]

BEGIN {                     [gray]# do it before input processing[/gray]
  f=0                       [gray]# initialize the part number[/gray]
}

NR==1 {                     [gray]# when processing the first line[/gray]
  c=[i]"wc -l<"[/i]FILENAME        [gray]# compose shell command to get line count[/gray]
  c|[b]getline[/b] l               [gray]# execute the command and store the result[/gray]
  [b]close[/b](c)                  [gray]# close the pipe to the command[/gray]
}

{
  b=[b]int[/b](NR/(l/s))           [gray]# calculate to which part belongs this line[/gray]
  [b]print[/b] $0>FILENAME[i]"."[/i]f     [gray]# write the line to the f[sup]th[/sup] part file[/gray]
  [b]if[/b] (f!=b) {               [gray]# if the calculated part is not the real one[/gray]
    [b]while[/b] ($0!~/[green][i]^t[/i][/green]/) {      [gray]# repeat while the line is not a trailer[/gray]
      [b]getline[/b]               [gray]# read the next input line[/gray]
      [b]print[/b] $0>FILENAME[i]"."[/i]f [gray]# write the line to the f[sup]th[/sup] part file[/gray]
    }
    [b]close[/b](FILENAME[i]"."[/i]f)     [gray]# close the f[sup]th[/sup] part file[/gray]
    f=b                     [gray]# step forward to the calculated part[/gray]
  }
}

' [green][i]/input/file[/i][/green]

[gray]### variables :
# b - calculated part number of the line
# c - command to get the number of input lines
# f - currently writing part file's number
# l - total number of lines in the input file
# s - desired number of resulted parts[/gray]

Feherke.

http://rootshell.be/~feherke/

unixguy303 · Feb 1, 2007

Feherke,

Thanks This looks great and I believe with your logic this is what I need.

My question now is in your use of "FILENAME" is this supposed to be a variable?

How does this line work? with FILENAME not a variable? or you looking for the literal file name of my file.

c="wc -l<"FILENAME # compose shell command to get line count

Thanks again!!!!

Brandt

feherke · Feb 1, 2007

Hi

The [tt]FILENAME[/tt] built-in variable holds the name of the current input file. I maintained by [tt]awk[/tt] just like the other built-in variables. I think this will explain better :

Code:

[blue]master #[/blue] cat letter.txt
a
b
c

[blue]master #[/blue] cat number.txt
1
2
3

[blue]master #[/blue] awk '{print "line "FNR" of file "FILENAME" : "$0}' letter.txt number.txt
line 1 of file letter.txt : a
line 2 of file letter.txt : b
line 3 of file letter.txt : c
line 1 of file number.txt : 1
line 2 of file number.txt : 2
line 3 of file number.txt : 3

Feherke.

http://rootshell.be/~feherke/

unixguy303 · Feb 1, 2007

Thanks!! I did not know that one!!!

unixguy303 · Feb 1, 2007

Here is what I get when I run it.

Again Thanks!!

# ./foo
awk: Syntax error
at line 15 of program << # se ... >>
context is
print >>> $0>FILENAME"." <<< f # write the line to the fth part f
ile
awk: illegal statement
at line 15 of program << # se ... >>
#
# pg foo
### variables :
# b - calculated part number of the line
# c - command to get the number of input lines
# f - currently writing part file's number
# l - total number of lines in the input file
# s - desired number of resulted parts

awk -v s=4 ' # set the number of desired parts

BEGIN { # do it before input processing
f=0 # initialize the part number
}

NR==1 { # when processing the first line
c="wc -l<"FILENAME # compose shell command to get line count
c|getline l # execute the command and store the result
close(c) # close the pipe to the command
}

{
b=int(NR/(l/s)) # calculate to which part belongs this line
print $0>FILENAME"."f # write the line to the fth part file
if (f!=b) { # if the calculated part is not the real one
while ($0!~/^t/) { # repeat while the line is not a trailer
getline # read the next input line
print $0>FILENAME"."f # write the line to the fth part file
}
close(FILENAME"."f) # close the fth part file
f=b # step forward to the calculated part
}
}

' ./mis07026_pad.txt

feherke · Feb 1, 2007

Hi

Try to enclose the filename expressions in parenthesis ( () ).

Code:

print $0>[red]([/red]FILENAME"."f[red])[/red]

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

awk I think!!

unixguy303

IS-IT--Management

PHV

MIS

unixguy303

IS-IT--Management

olded

Programmer

unixguy303

IS-IT--Management

feherke

Programmer

unixguy303

IS-IT--Management

Ygor

Programmer

Ygor

Programmer

feherke

Programmer

unixguy303

IS-IT--Management

feherke

Programmer

unixguy303

IS-IT--Management

unixguy303

IS-IT--Management

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor