awk help with closing multiple named pipes

littldo · Mar 16, 2015

Hi,
I'm new to awk, but confident it can do what I want. I just got to tell it the right thing. I hope I can get your help.

I'm trying to load a multi-record layout formatted file into my database. I'm using greenplum (a varient of postgres) and the loader will process the input based on the specific column layout. A loader can only specify 1 layout, so I'm trying to use awk to split the file into different streams - 1 for each record type - that will feed concurrently executing loaders. The solution works as long as I'm using regular files, but because of the variable file size I'd prefer to use named pipes.

When using named pipes, the solution seems to be stalling. the loaders run indefinitely and no data actually gets loaded. I think my issue is that the awk program is not closing the named pipes when all of the file data has been read.

here's the details
the file has 3 record types - indicated by the 1st byte of the record. The values are H (header), E (Trailer/ending), 0 (data).

The loaders use a feature called external table which specifies the location and layout of the source data. For each named pipe I've created an external table def. The named pipe is referenced in the location parameter eg
create external table ex_header (record_cd varchar(1), id varchar(20), agency varchar(40), filedate date, filetime time)
location ('gpfdist::localhost:9001/BN.H);

and the commands
mkfifo /tmp/BN.H
mkfifo /tmp/BN.0
mkfifo /tmp/BN.E

#start gpfdist
gpfdist -d /tmp/ -p 9001 -l gpfdist.log & ## gpfdist will read whatever file is specified by the external table def. It can read different files concurrently.

awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN.H") close("/tmp/BN.0") close("/tmp/BN.E") }' filename
# for each record in filename, print the record to a filename derived from the 1st byte of the record. At end close the explicit filenames

in sql I have
insert into header select * from ex_header;
insert into detail select * from ex_detail;
insert into trailer select * from ex_trailer;
the 3 inserts are running concurrently from different db sessions.

the inserts just sit there and run forever. eventually I have to kill the inserts, and the awk

In the awk guide for close(file|command) is says that
"Its value must exactly match the string that was used to open the file or start the command"
so I also tried awk '{print $0 >"/tmp/BN."substr($0,1,1)} END {close("/tmp/BN"substr($0,1,1) }' filename
with the same results. in this case, I believe that the substring won't work, since it's executing at the end when $0 has no data. even if it does, it would only be the last record.

also, the file is sorted, so that H record is first, followed by 0's, then finally E.

any suggestions would really be appreciated

doug L

taupirho · Mar 24, 2015

I have never used gpfdist, but with my "programmers hat" on, this is what I would look at.

First of all does the location gpfdist::localhost:9001 specified in the create external table command definitely point to /tmp on the server?

Also what happens when you try and read the named pipes with a regular unix utiliy instead of using gpfdist. DO you get any data back from that ?

In the gpfdist command should the path of the -d option be /tmp/ or just /tmp or does it matter ?

I'm from an Oracle background and in there we now have multi-table inserts, does your DB have that? - it will be under INSERT ALL command
if it does exist you could use that and not need any pipes. Probably not quite as fast as parallel load with gpfdist but in Oracle
its efficient and only one table scan required.

In order to understand recursion, you must first understand recursion.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

awk help with closing multiple named pipes

littldo

MIS

taupirho

Programmer

Similar threads

Part and Inventory Search

Sponsor