Hi,
I'm new to awk, but confident it can do what I want. I just got to tell it the right thing. I hope I can get your help.
I'm trying to load a multi-record layout formatted file into my database. I'm using greenplum (a varient of postgres) and the loader will process the input based on the specific column layout. A loader can only specify 1 layout, so I'm trying to use awk to split the file into different streams - 1 for each record type - that will feed concurrently executing loaders. The solution works as long as I'm using regular files, but because of the variable file size I'd prefer to use named pipes.
When using named pipes, the solution seems to be stalling. the loaders run indefinitely and no data actually gets loaded. I think my issue is that the awk program is not closing the named pipes when all of the file data has been read.
here's the details
the file has 3 record types - indicated by the 1st byte of the record. The values are H (header), E (Trailer/ending), 0 (data).
The loaders use a feature called external table which specifies the location and layout of the source data. For each named pipe I've created an external table def. The named pipe is referenced in the location parameter eg
create external table ex_header (record_cd varchar(1), id varchar(20), agency varchar(40), filedate date, filetime time)
location ('gpfdist::localhost:9001/BN.H);
and the commands
mkfifo /tmp/BN.H
mkfifo /tmp/BN.0
mkfifo /tmp/BN.E
#start gpfdist
gpfdist -d /tmp/ -p 9001 -l gpfdist.log & ## gpfdist will read whatever file is specified by the external table def. It can read different files concurrently.
awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN.H") close("/tmp/BN.0") close("/tmp/BN.E") }' filename
# for each record in filename, print the record to a filename derived from the 1st byte of the record. At end close the explicit filenames
in sql I have
insert into header select * from ex_header;
insert into detail select * from ex_detail;
insert into trailer select * from ex_trailer;
the 3 inserts are running concurrently from different db sessions.
the inserts just sit there and run forever. eventually I have to kill the inserts, and the awk
In the awk guide for close(file|command) is says that
"Its value must exactly match the string that was used to open the file or start the command"
so I also tried awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN"substr($0,1,1) }' filename
with the same results. in this case, I believe that the substring won't work, since it's executing at the end when $0 has no data. even if it does, it would only be the last record.
also, the file is sorted, so that H record is first, followed by 0's, then finally E.
any suggestions would really be appreciated
doug L
I'm new to awk, but confident it can do what I want. I just got to tell it the right thing. I hope I can get your help.
I'm trying to load a multi-record layout formatted file into my database. I'm using greenplum (a varient of postgres) and the loader will process the input based on the specific column layout. A loader can only specify 1 layout, so I'm trying to use awk to split the file into different streams - 1 for each record type - that will feed concurrently executing loaders. The solution works as long as I'm using regular files, but because of the variable file size I'd prefer to use named pipes.
When using named pipes, the solution seems to be stalling. the loaders run indefinitely and no data actually gets loaded. I think my issue is that the awk program is not closing the named pipes when all of the file data has been read.
here's the details
the file has 3 record types - indicated by the 1st byte of the record. The values are H (header), E (Trailer/ending), 0 (data).
The loaders use a feature called external table which specifies the location and layout of the source data. For each named pipe I've created an external table def. The named pipe is referenced in the location parameter eg
create external table ex_header (record_cd varchar(1), id varchar(20), agency varchar(40), filedate date, filetime time)
location ('gpfdist::localhost:9001/BN.H);
and the commands
mkfifo /tmp/BN.H
mkfifo /tmp/BN.0
mkfifo /tmp/BN.E
#start gpfdist
gpfdist -d /tmp/ -p 9001 -l gpfdist.log & ## gpfdist will read whatever file is specified by the external table def. It can read different files concurrently.
awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN.H") close("/tmp/BN.0") close("/tmp/BN.E") }' filename
# for each record in filename, print the record to a filename derived from the 1st byte of the record. At end close the explicit filenames
in sql I have
insert into header select * from ex_header;
insert into detail select * from ex_detail;
insert into trailer select * from ex_trailer;
the 3 inserts are running concurrently from different db sessions.
the inserts just sit there and run forever. eventually I have to kill the inserts, and the awk
In the awk guide for close(file|command) is says that
"Its value must exactly match the string that was used to open the file or start the command"
so I also tried awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN"substr($0,1,1) }' filename
with the same results. in this case, I believe that the substring won't work, since it's executing at the end when $0 has no data. even if it does, it would only be the last record.
also, the file is sorted, so that H record is first, followed by 0's, then finally E.
any suggestions would really be appreciated
doug L