Break a file into multiple files

mrr · May 26, 2012

I have a file that contains some headers at the first of the file starting with "H" that I want to write to the first of all of the following files that are created based on the non H records field 2.
Everything works like I want it to except for the multiple H records printed to each file.

Any help is appreciated.

Here's the data file:
H1
H 2
H 3
H
H 44
GROUP1 AA 1
GROUP1 AA 2
GROUP1 AA 10
GROUP1 BB 1
GROUP1 CC 1
GROUP2 AA 3
GROUP2 AA 4
GROUP3 AA 5
GROUP10 AA 5

and here's my script so far:
BEGIN {
# if (substr($0,1,1) == "H"){print $0 > $2".DAT"}
old=$2
}
{
new=$2
#if (substr($0,1,1) != "H")
{
if (new != old) { print $0 >> $2".DAT"}
if (new == old) { print $0 >> $2".DAT"}
old=new
}
}

feherke · May 26, 2012

Hi

For that sample input how many output files should be created, how should they be called and what should they contain ?

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

mrr · May 26, 2012

Thanks Feherke for responding.
The script would create 3 files - AA.DAT, BB.DAT AND CC.DAT.
Each file contains the records that relate to the field 2 values.

Thanks

feherke · May 26, 2012

Hi

You mean something like this ?

Code:

awk '[teal]![/teal][fuchsia]/^H/[/fuchsia][teal]{[/teal][COLOR=chocolate]print[/color][teal]>[/teal][navy]$2[/navy][green][i]".DAT"[/i][/green][teal]}[/teal]' /input/file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

mrr · May 26, 2012

Thanks Feherke.
this works except I still would like to pass the H records at the begining of the input file to each of the output files.

Thanks again.

mrr · May 26, 2012

Feherke/others
I changed the code and now I am able to get the "H" records printed out at the top of each output file but now I only get the last record for the ones per group based on field 2 of data.
Here is the input file I am using:
H1
H2
H3
H
H 44
GROUP1 aa 1
GROUP1 aa 2
GROUP1 aa 10
GROUP1 bb 1
GROUP1 bb 2
GROUP2 cc 3
GROUP2 cc 4
GROUP3 dd 5
GROUP10 ee 1
GROUP10 ee 2
GROUP10 ee 3
GROUP1 aa 11
GROUP1 aa 22
GROUP1 aa 100
and here is my script:

FNR==1 {
hdr_count = 0;
while (substr($1,1,1) == "H") {
header[++hdr_count] = $0;
getline;
}
}
substr($1,1,1) != "H" {
if (filename != "") close(filename);
filename = $2 ".dat";
for (h=1; h<=hdr_count; h++)
print header[h] > filename;

}
{
print $0 > filename;
{

here is the output I am currently geting - only the last record and not all for each group based on field 2 identifier aa for output file aa.dat
H1
H2
H3
H
H 44
GROUP1 aa 100

feherke · May 27, 2012

Hi

mrr said:
The script would create 3 files - AA.DAT, BB.DAT AND CC.DAT.
Each file contains the records that relate to the field 2 values.

From here I understand you not want the header lines anywhere.

mrr said:
this works except I still would like to pass the H records at the begining of the input file to each of the output files.

From this I understand that you want the header lines everywhere.

To correct your latest code, just remove the call of the [tt]close()[/tt] function.

But I am not sure whether it will give what you want. I suppose you want the header lines at the beginning of each file, while your code will insert the header lines before every record. So here is what I would do :

Code:

awk '[fuchsia]/^H/[/fuchsia][teal]{[/teal]h[teal]=[/teal]h[navy]$0ORS[/navy][teal];[/teal][COLOR=chocolate]next[/color][teal]}[/teal][teal]![/teal]f[teal][[/teal][navy]$2[/navy][teal]][/teal][teal]{[/teal][COLOR=chocolate]printf[/color][green][i]"%s"[/i][/green][teal],[/teal]h[teal]>[/teal][navy]$2[/navy][green][i]".DAT"[/i][/green][teal];[/teal]f[teal][[/teal][navy]$2[/navy][teal]]=[/teal][purple]1[/purple][teal]}{[/teal][COLOR=chocolate]print[/color][teal]>[/teal][navy]$2[/navy][green][i]".DAT"[/i][/green][teal]}[/teal]' /input/file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

mrr · May 27, 2012

Thanks Feherke,

This part of my code works properly with the H records being passed to all output files but I cant get the non H records to print entirely for each output file:

FNR==1 {
hdr_count = 0;
while (substr($1,1,1) == "H")
{
header[++hdr_count] = $0;
getline;
}
}
substr($1,1,1) != "H"
{
if (filename != "") close(filename);
filename = $2 ".dat";
for (h=1; h<=hdr_count; h++)
print header[h] > filename;
}

taking the input of:
H1
H2
H3
H
H 44
GROUP1 aa 1
GROUP1 aa 2
GROUP1 aa 10
GROUP1 bb 1
GROUP1 bb 1
ROUP1 aa 11
GROUP1 aa 22
GROUP1 aa 100

I want the output to have 2 files and they would look like:
file aa.dat:
H1
H2
H3
H
H 44
GROUP1 aa 1
GROUP1 aa 2
GROUP1 aa 10
GROUP1 aa 22
GROUP1 aa 100

and file bb.dat would look like:
H1
H2
H3
H
H 44
GROUP1 bb 1
GROUP1 bb 1

I just cant seem to get the code right to print the data records for each output file.

Thanks again.

feherke · May 27, 2012

Hi

mrr said:
I cant get the non H records to print entirely for each output file

That is why I wrote in my previous post :

Feherke said:
To correct your latest code, just remove the call of the [tt]close()[/tt] function.

Thanks to the output you posted, we got the answer to my next doubt :

Feherke said:
I suppose you want the header lines at the beginning of each file, while your code will insert the header lines before every record.

So your code has one more glitch. It needs an additional array where to store the already opened files, so to be able to output the headers only before writing the first record. ( That is what the f array in my latest code serves for. )

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

FlorianAwk · May 27, 2012

Bash:

awk '/GROUP/{print>$2".DAT"}' /input/file;for f in *.DAT; do grep ^H /input/file |cat - $f > tmp;cat tmp>$f;done

mrr · May 28, 2012

I would like to have this code in a file so I can re-run by the -f command rather than using command line statement.
I now have the code to work properly with printing the H records but they only print on the first file output and not the following files.

Here's my current code:
{
hdr_count = 0;
while (substr($1,1,1) == "H")
{ header[++hdr_count] = $0
getline
}
{ if (substr($1,1,1) != "H")
if (filename != "") close(filename)
filename = $2 ".dat"
for (h=1; h<=hdr_count; h++)
print header[h] > filename; print $0 >> filename}
}

I can't seem to get it to work if i remove the close function....
Here's my latest test data:
H1
H2
H 3
H
H 44
GROUP1 aa 1
GROUP1 aa 2
GROUP1 aa 10
GROUP1 bb 1
GROUP1 bb 10
GROUP1 aa 11
GROUP2 aa 22
GROUP2 aa 100

Here's my current output for file aa.dat and it looks good....
H1
H2
H 3
H
H 44
GROUP1 aa 1
GROUP1 aa 2
GROUP1 aa 10
GROUP1 aa 11
GROUP2 aa 22
GROUP2 aa 100

and here is bb.dat without the header h records i want to include...

GROUP1 bb 1
GROUP1 bb 10

Thanks to all for the assistance on this...

feherke · May 29, 2012

Hi

mrr said:
I would like to have this code in a file so I can re-run by the -f command rather than using command line statement.

Then put it in a file :

Code:

[fuchsia]/^H/[/fuchsia][teal]{[/teal]h[teal]=[/teal]h[navy]$0ORS[/navy][teal];[/teal][COLOR=chocolate]next[/color][teal]}[/teal][teal]![/teal]f[teal][[/teal][navy]$2[/navy][teal]][/teal][teal]{[/teal][COLOR=chocolate]printf[/color][green][i]"%s"[/i][/green][teal],[/teal]h[teal]>[/teal][navy]$2[/navy][green][i]".dat"[/i][/green][teal];[/teal]f[teal][[/teal][navy]$2[/navy][teal]]=[/teal][purple]1[/purple][teal]}{[/teal][COLOR=chocolate]print[/color][teal]>[/teal][navy]$2[/navy][green][i]".dat"[/i][/green][teal]}[/teal]

And run it :

Code:

awk -f mrr.awk /input/file

Or give it execute permission :

Code:

[gray]#!/usr/bin/awk -f[/gray]

[fuchsia]/^H/[/fuchsia] [teal]{[/teal]
  h[teal]=[/teal]h [navy]$0[/navy] [blue]ORS[/blue]
  [COLOR=chocolate]next[/color]
[teal]}[/teal]

[teal]![/teal] f[teal][[/teal][navy]$2[/navy][teal]][/teal] [teal]{[/teal]
  [COLOR=chocolate]printf[/color] [green][i]"%s"[/i][/green][teal],[/teal]h [teal]>[/teal] [teal]([/teal][navy]$2[/navy] [green][i]".dat"[/i][/green][teal])[/teal]
  f[teal][[/teal][navy]$2[/navy][teal]]=[/teal][purple]1[/purple]
[teal]}[/teal]

[teal]{[/teal]
  [COLOR=chocolate]print[/color] [teal]>[/teal] [teal]([/teal][navy]$2[/navy] [green][i]".dat"[/i][/green][teal])[/teal]
[teal]}[/teal]

And run it :

Code:

./mrr.awk /input/file

Note that you may need to edit the shebang line in case your [tt]awk[/tt] executable is installed elsewhere.

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Break a file into multiple files

mrr

Technical User

feherke

Programmer

mrr

Technical User

feherke

Programmer

mrr

Technical User

mrr

Technical User

feherke

Programmer

mrr

Technical User

feherke

Programmer

FlorianAwk

Programmer

mrr

Technical User

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor