Extraction problem with performance in mind

maxtektips6 · Mar 9, 2008

Hi all,

With a routine producing a flow like this

header 01
0 X 52
0 G 78
0 T 44
0 B 42
0 Q 70
..
0 A 77
..
0 C 48
0 M 67
0 F 12
..
trailer 01
header 02
0 M 24
0 R 45
..
0 A 45
..
0 C 36
..
trailer 02
etc.

the desired output is like that:

77,42,48
45,,36
etc.

It's basically an extraction of the values in the A, B and C records.

The first solution i've tried was:

Code:

routine_producing_data | egrep "A|B|C" | sed 's/^0 //g' | paste -d, - - - | 
sed "s/[A-Z]*//g; s/ //g"

But then I noticed the records can come in any order and that sometimes the A and B records can be missing.
This produced an ouput like this in such a situation:

42,77,48
45,36,

Therefore i've tried both of the solutions below but the performance is seriously impacted.
Can these be fine-tuned or the first solution corrected so as to return the correct output?

Code:

routine_producing_data | awk 'BEGIN{a="";b=""}
/A /{a=$3;next}
/B /{b=$3;next}
/C /{print a","b","$3;a="";b=""}
'

Code:

routine_producing_data | perl -ane '$,=",";
@a[0] = @F[2] if (@F[1] eq "A") ;
@a[1] = @F[2] if (@F[1] eq "B") ;
if (@F[1] eq "C") { @a[2] = @F[2] ; print @a ; print "\n" ; @a = () } ;
'

Annihilannic · Mar 9, 2008

I can't think of any reason why those scripts would affect the performance; if anything they should be more efficient. How many lines of data does the original routine produce?

What is the difference if you use the time command to time the first version and the awk and perl versions?

Annihilannic.

PHV · Mar 10, 2008

Without regex:

Code:

routine_producing_data | awk 'BEGIN{a="";b=""}
$2=="A"{a=$3;next}
$2=="B"{b=$3;next}
$2=="C"{print a","b","$3;a="";b=""}
'

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

maxtektips6 · Mar 10, 2008

Hi,

The input to be filtered out can be considered as a stream. Well, in reality the routine produces files which have over 1M lines in average. And there's always lots of them (100 000's at a time). The benchmarking with a single file gives the following:

grep+paste -> 30 sec.
awk regexp -> 5mins 30sedc
awk without regexp -> approx 5mins
perl -> approx. 10 mins

The current solution i'm considering is rather ugly but it gives output in 45sec.:

Code:

egrep "A|B|C" |
awk 'BEGIN{a="";b=""}
/A /{a=$3;next}
/B  /{b=$3;next}
/C /{print a","b","$3; a="";b=""}

PHV · Mar 10, 2008

And this ?

Code:

routine_producing_data | awk 'BEGIN{a="";b=""}
$2~/^[ABC]/{
  if($2=="A"){a=$3;next}
  if($2=="B"){b=$3;next}
  print a","b","$3;a="";b=""
}'

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

maxtektips6 · Mar 10, 2008

It still takes approximately 5 mins

Annihilannic · Mar 10, 2008

Something seriously strange going on here, what operating system and hardware are you on?

I created a sample data file using your example data above, but concatenated it about 90,000 times to create a 1M+ line file.

Using your solution takes me 4.3 seconds to run. PHV's takes 8 seconds or so. And changing your solution to use grep "^[ABC]" instead of egrep shortens it to 1.5 seconds. So why does it take so long on your system?

I'm testing with the built-in utilities on a HP-UX 11.11 system, model RP5740 running at 750MHz.

Annihilannic.

maxtektips6 · Mar 11, 2008

Guess it must be the amount of data to filter out that makes the difference. The results above are obtained with "real" data, containing lots of unwanted rows. I've run tests similar to yours (test file obtained by concatenating the first record until over 1M lines are reached), here are the results:

1 second for both your solution with grep "^0 [ABC]" and the egrep one

4 seconds for PHV's with only awk

For the hardware bits: sparc SUNW,Sun-Fire-880 running at 1200 MHz. No other heavy process are running during the tests, 8 processors pretty much idle.

Thanks all for helping

PHV · Mar 11, 2008

If Solaris then try with nawk.

maxtektips6 · Mar 11, 2008

The results are nearly equivalent (yes it's solaris). Here's how a record really is if you'd like to give it a try. A would be "calling_number", B "dialled_number" and C "EL_ANM_TIMEPOINT_RECEIVED"

Code:

RECORD
#input_id 1202842387x001_23
#output_id
#input_type end_of_call_cdb
#output_type B2B_CDR
#addkey
#source_id C02
#filename cdr_20071107111257_033006.bin
F cdb_timepoint 1194426908
F call_reference_id 473181FD00653E66
F unique_call_correlator_id 075A9FA68DDB11DCBE660003BA127B8A
F iam_timepoint_received 1194426877
F iam_timepoint_received_ms 500
F anm_timepoint_received 1194426885
F anm_timepoint_received_ms 292
F first_rel_timepoint 1194426908
F first_rel_timepoint_ms 194
F rlc_timepoint_received 1194426908
F rlc_timepoint_received_ms 295
F originating_trunk_group 9001
F calling_number 0123456789
F dialled_number 1234567890
F called_number 234567890
F terminating_trunk_group 1102
F ingress_origination_point_code 0
F egress_destination_point_code 56
F calling_number_noa 2
F called_number_noa 3
F reason_code 32912
F EL_ANM_TIMEPOINT_RECEIVED 20071107111445
F EL_DURATION_MS 22902
F EL_INCOMING_TRUNK_TYPE TT
F EL_OUTGOING_TRUNK_TYPE IC
F EL_SWITCH_ID C02
F REJECTED Duration_LT_1000_or_Call_incoming_on_IMT
.

Annihilannic · Mar 12, 2008

I tried that and had very similar results to you, max. So where did the 4/5 minutes thing come from, was that processing lots of files or something?

Annihilannic.

maxtektips6 · Mar 13, 2008

In order to compare the different solutions i've been using a big file which is not in the average. It has 39,359,150 lines (1.1G / 1132930175 bytes). Using a smaller test file may not have provided a clear cut answer about the performance.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extraction problem with performance in mind

maxtektips6

Programmer

Annihilannic

MIS

PHV

MIS

maxtektips6

Programmer

PHV

MIS

maxtektips6

Programmer

Annihilannic

MIS

maxtektips6

Programmer

PHV

MIS

maxtektips6

Programmer

Annihilannic

MIS

maxtektips6

Programmer

Similar threads

Part and Inventory Search

Sponsor