Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

NAWK Performance 1

Status
Not open for further replies.

Dagon

MIS
Jan 30, 2002
2,301
GB
I'm trying to replace a convoluted piece of processing, which involves reading the same file 6 times, with a single nawk script.

However, I'm having severe performance problems with the first stage of the processing. This involves cutting a set of fields from the file using a comma-delimited list supplied as a parameter.

Code:
BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    outputline=""
    for (i=1; i <= no_fields; i++)
    {
       outputline=outputline $columnList[i] FS
    }
    printf("%s\n", outputline) >> outputFile
}

I call the command using:

Code:
export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"

nawk -F"|" -f decomposeFile.awk -v inColumnList=$_columnList -v outputFile=$_outputFile $_inputFile

The script takes a good two hours to run on a million row file (compared to about 10 minutes using something like "cut" to achieve the same effect). If I hard-code the column list using:

Code:
{
   printf("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n", $3,$16,$17,$18,$19,$20,$21,$22,$25,$26,$29,$30,$33,$41,$
44,$45,$48,$52,$53,$54,$55,$57,$58,$59,$61,$63,$64,$66,$67,$70,$71,$72,$73,$74,$75,$76,$77,$81,$82,$84,$85,$86,$87,$88,$
89,$90,$91,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$
118,$119,$120,$121,$122,$123,$124,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$
143,$144,$145,$146,$147,$148,$149,$150,$151,$152,$153,$154,$155,$156,$157,$163,$164,$165,$166,$167,$168,$170,$171,$172,$
174,$175,$176,$177,$181,$184,$191,$192,$196,$198,$205,$206,$208,$220,$223,$225,$226,$228) >> outputFile
}

it takes around ten minutes. So it seems to be the looping and variable concatenation which is slow.

Anyone got any ideas how I can speed this up ? Or would I be better off using something like PERL to do this ? I am on Solaris 5.8
 
Hi

Try to remove the outputline variable. Its use certainly speeds up JavaScript, but I am not sure about [tt]awk[/tt]. ( Even less about [tt]nawk[/tt]. )
Code:
BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    ORS=","
    for (i=1; i <= no_fields; i++)
    {
       print $columnList[i] >> outputFile
    }
    ORS="\n"
    print "" >> outputFile
}
Another thing I would try is to let the shell handles the file writing :
Code:
BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    ORS=","
    for (i=1; i <= no_fields; i++)
    {
       print $columnList[i]
    }
    ORS="\n"
    print ""
}
And run it as
Code:
nawk -F"|" -f decomposeFile.awk -v inColumnList=$_columnList $_inputFile > $_outputFile
Perl compiles the code and run the resulted bytecode, which speeds it up considerably. I read that some [tt]awk[/tt] interpreters does the same, but I do not remember which. :-(

Feherke.
 
String concatination is one of those actions which is easy to code at a high level but actually quite difficult at the microcode level. I strongly suspect that this is where your problem lies. As such changing the language to, for example, perl may not make that much difference. Having said that you could try something like
Code:
#!/usr/bin/perl -w
use strict;
my @cols = split /,/, $ENV{'_columnList'}
while (<>)
  {
  my @bits = split /\s+/;
  foreach ( $cols ) { print $bits[$_], " "; }
  print "\n";
  }
And then run
Code:
export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"
./perlscriptname input_file > output_file
Note that your column list starts at column 0, NOT column 1. I don't honestly know if the repeated calls to print will be any faster than the string concatination.

Ceci n'est pas une signature
Columb Healy
 
Thanks for you help, but I already tried both of those things and the performance was every bit as bad. The original version of the script used printed out each element (although I used printf rather than print and "ORS"). I also tried removing the output file from the script and putting redirection on the command line (although that's not really an option because I need to write out more than one file).

 
Hmm... It looks like feherke and I both came to the same general solution - replacing the string concatination with repeated calls to print. feherke, as ever, is the awk guru - I lean towards perl.

Ceci n'est pas une signature
Columb Healy
 
You say that 'cut' works quickly. Is there a reason why you can't use
Code:
export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"
cut -d " " -f $_columnList < inFile > outFile

Ceci n'est pas une signature
Columb Healy
 
This is only the first stage of the processing the job has to do. It then has to break the file into 4 different files using search strings and one of those then has to be subpartitioned into 6 separate files so that it can be loaded into Oracle using parallel streams. At the moment this is all being done using a series of different utilities like cut, egrep and split. Each one incurs a new read of the file, so my plan was to merge it into a single program that would do it all in one pass.

Unfortunately, nawk doesn't seem to be fast enough for the job. I'm presently trying out PERL but it's a bit painful because my experience of it was limited to doing some exercises from an online course about 4 years ago.
 
Can you help me out with this ? I can't get the split of the file on the "|" (pipe) delimiter to work. I've written it out more fully so I can understand what is going on better. So far I've got:

Code:
#!/usr/bin/perl -w
open(INPUT, $ENV{'_inputFile'});
@cols = split(/,/, $ENV{'_columnList'});

while ($line = <INPUT>)
{
  $line = <INPUT>;
  print $line;
  @bits = split(/|/, $line);
  if ($bits[0] eq "H")
  {

  }
  elsif ($bits[0] eq "T")
  {
  }
  else
  {
    foreach $col (@cols)
    { 
      $field=$col-1;
      print $bits[$field], "|"; 
    } 
    print "\n";
  }
}
close(INPUT);

But the second split command just isn't working. It puts everything in $cols[0]. Do I need to escape the pipe or something like that ?
 
Just tried it and it seems to be working:

@bits = split(/\|/, $line);
 
It seems to be running better than nawk. It got about half-way through in 10 minutes before falling over with:

Use of uninitialized value at ./perlvers.pl line 18, <INPUT> chunk 943793.

I think it may be because some of the lines don't have the full complement of fields e.g. they've been truncated. Is there a way to check whether an array element exists before printing it out and print out nothing if it doesn't i.e. something like the ${var} syntax in shell ?
 
Try
Code:
#!/usr/bin/perl -w
open(INPUT, $ENV{'_inputFile'});
@cols = split(/,/, $ENV{'_columnList'});

while (<INPUT>)
{
  print;
  @bits = split(/|/);
   
  $bits[0] eq 'H' and next;
  $bits[0] eq 'T' and next;
  
  foreach $col (@cols)
    {
    $bits[$col-1] and print $bits[$col-1], "|";
    }
    print "\n";
  }
}
close(INPUT);

Ceci n'est pas une signature
Columb Healy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top