NAWK Performance 1

Dagon · Jul 12, 2007

I'm trying to replace a convoluted piece of processing, which involves reading the same file 6 times, with a single nawk script.

However, I'm having severe performance problems with the first stage of the processing. This involves cutting a set of fields from the file using a comma-delimited list supplied as a parameter.

Code:

BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    outputline=""
    for (i=1; i <= no_fields; i++)
    {
       outputline=outputline $columnList[i] FS
    }
    printf("%s\n", outputline) >> outputFile
}

I call the command using:

Code:

export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"

nawk -F"|" -f decomposeFile.awk -v inColumnList=$_columnList -v outputFile=$_outputFile $_inputFile

The script takes a good two hours to run on a million row file (compared to about 10 minutes using something like "cut" to achieve the same effect). If I hard-code the column list using:

Code:

{
   printf("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,
%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n", $3,$16,$17,$18,$19,$20,$21,$22,$25,$26,$29,$30,$33,$41,$
44,$45,$48,$52,$53,$54,$55,$57,$58,$59,$61,$63,$64,$66,$67,$70,$71,$72,$73,$74,$75,$76,$77,$81,$82,$84,$85,$86,$87,$88,$
89,$90,$91,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$
118,$119,$120,$121,$122,$123,$124,$126,$127,$128,$129,$130,$131,$132,$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$
143,$144,$145,$146,$147,$148,$149,$150,$151,$152,$153,$154,$155,$156,$157,$163,$164,$165,$166,$167,$168,$170,$171,$172,$
174,$175,$176,$177,$181,$184,$191,$192,$196,$198,$205,$206,$208,$220,$223,$225,$226,$228) >> outputFile
}

it takes around ten minutes. So it seems to be the looping and variable concatenation which is slow.

Anyone got any ideas how I can speed this up ? Or would I be better off using something like PERL to do this ? I am on Solaris 5.8

feherke · Jul 12, 2007

Hi

Try to remove the outputline variable. Its use certainly speeds up JavaScript, but I am not sure about [tt]awk[/tt]. ( Even less about [tt]nawk[/tt]. )

Code:

BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    ORS=","
    for (i=1; i <= no_fields; i++)
    {
       print $columnList[i] >> outputFile
    }
    ORS="\n"
    print "" >> outputFile
}

Another thing I would try is to let the shell handles the file writing :

Code:

BEGIN {
    no_fields=split(inColumnList, columnList, ",")
}

{
    ORS=","
    for (i=1; i <= no_fields; i++)
    {
       print $columnList[i]
    }
    ORS="\n"
    print ""
}

And run it as

Code:

nawk -F"|" -f decomposeFile.awk -v inColumnList=$_columnList $_inputFile > $_outputFile

Perl compiles the code and run the resulted bytecode, which speeds it up considerably. I read that some [tt]awk[/tt] interpreters does the same, but I do not remember which. :-(

Feherke.

http://rootshell.be/~feherke/

columb · Jul 12, 2007

String concatination is one of those actions which is easy to code at a high level but actually quite difficult at the microcode level. I strongly suspect that this is where your problem lies. As such changing the language to, for example, perl may not make that much difference. Having said that you could try something like

Code:

#!/usr/bin/perl -w
use strict;
my @cols = split /,/, $ENV{'_columnList'}
while (<>)
  {
  my @bits = split /\s+/;
  foreach ( $cols ) { print $bits[$_], " "; }
  print "\n";
  }

And then run

Code:

export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"
./perlscriptname input_file > output_file

Note that your column list starts at column 0, NOT column 1. I don't honestly know if the repeated calls to print will be any faster than the string concatination.

Ceci n'est pas une signature
Columb Healy

Dagon · Jul 12, 2007

Thanks for you help, but I already tried both of those things and the performance was every bit as bad. The original version of the script used printed out each element (although I used printf rather than print and "ORS"). I also tried removing the output file from the script and putting redirection on the command line (although that's not really an option because I need to write out more than one file).

columb · Jul 12, 2007

Hmm... It looks like feherke and I both came to the same general solution - replacing the string concatination with repeated calls to print. feherke, as ever, is the awk guru - I lean towards perl.

Ceci n'est pas une signature
Columb Healy

columb · Jul 12, 2007

You say that 'cut' works quickly. Is there a reason why you can't use

Code:

export _columnList="3,16,17,18,19,20,21,22,25,26,29,30,33,41,44,45,48,52,53,54,55,57,58,59,61,63,64,66,67,70,71,72,73,74
,75,76,77,81,82,84,85,86,87,88,89,90,91,93,94,95,96,97,98,99,100,101,102,103,104,107,108,109,110,111,112,113,114,115,116
,117,118,119,120,121,122,123,124,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147
,148,149,150,151,152,153,154,155,156,157,163,164,165,166,167,168,170,171,172,174,175,176,177,181,184,191,192,196,198,205
,206,208,220,223,225,226,228"
cut -d " " -f $_columnList < inFile > outFile

Ceci n'est pas une signature
Columb Healy

Dagon · Jul 12, 2007

This is only the first stage of the processing the job has to do. It then has to break the file into 4 different files using search strings and one of those then has to be subpartitioned into 6 separate files so that it can be loaded into Oracle using parallel streams. At the moment this is all being done using a series of different utilities like cut, egrep and split. Each one incurs a new read of the file, so my plan was to merge it into a single program that would do it all in one pass.

Unfortunately, nawk doesn't seem to be fast enough for the job. I'm presently trying out PERL but it's a bit painful because my experience of it was limited to doing some exercises from an online course about 4 years ago.

Dagon · Jul 12, 2007

Can you help me out with this ? I can't get the split of the file on the "|" (pipe) delimiter to work. I've written it out more fully so I can understand what is going on better. So far I've got:

Code:

#!/usr/bin/perl -w
open(INPUT, $ENV{'_inputFile'});
@cols = split(/,/, $ENV{'_columnList'});

while ($line = <INPUT>)
{
  $line = <INPUT>;
  print $line;
  @bits = split(/|/, $line);
  if ($bits[0] eq "H")
  {

  }
  elsif ($bits[0] eq "T")
  {
  }
  else
  {
    foreach $col (@cols)
    { 
      $field=$col-1;
      print $bits[$field], "|"; 
    } 
    print "\n";
  }
}
close(INPUT);

But the second split command just isn't working. It puts everything in $cols[0]. Do I need to escape the pipe or something like that ?

Dagon · Jul 12, 2007

I mean $bits[0].

Dagon · Jul 12, 2007

Just tried it and it seems to be working:

@bits = split(/\|/, $line);

Dagon · Jul 12, 2007

It seems to be running better than nawk. It got about half-way through in 10 minutes before falling over with:

Use of uninitialized value at ./perlvers.pl line 18, <INPUT> chunk 943793.

I think it may be because some of the lines don't have the full complement of fields e.g. they've been truncated. Is there a way to check whether an array element exists before printing it out and print out nothing if it doesn't i.e. something like the ${var} syntax in shell ?

columb · Jul 12, 2007

Try

Code:

#!/usr/bin/perl -w
open(INPUT, $ENV{'_inputFile'});
@cols = split(/,/, $ENV{'_columnList'});

while (<INPUT>)
{
  print;
  @bits = split(/|/);
   
  $bits[0] eq 'H' and next;
  $bits[0] eq 'T' and next;
  
  foreach $col (@cols)
    {
    $bits[$col-1] and print $bits[$col-1], "|";
    }
    print "\n";
  }
}
close(INPUT);

Ceci n'est pas une signature
Columb Healy

feherke · Jul 12, 2007

Hi

Code:

[gray] ...[/gray]
  @bits = split(/[red]\[/red]|/);
[gray] ...[/gray]

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

NAWK Performance 1

Dagon

MIS

feherke

Programmer

columb

IS-IT--Management

Dagon

MIS

columb

IS-IT--Management

columb

IS-IT--Management

Dagon

MIS

Dagon

MIS

Dagon

MIS

Dagon

MIS

Dagon

MIS

columb

IS-IT--Management

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor