Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding values in a hash that match but returning the keys 3

Status
Not open for further replies.

travs69

MIS
Dec 21, 2006
1,431
US
I know there is a lot going on here so i'm sorry for the huge amount to read.. but I appreciate any help.

I have upwards of 20M lines of data that I have to read in. The current code reads it in, creates 2 hashes of epoch times (start and stop) with a common ID (this can match any number of other recoreds) and a unique ID (should only ever match this record) from the code $start{$common}{$unique} = $start_epoch; $stop{$common}{$unique} = $stop_epoch;

The code then loops through the file a second time comparing every line in the file against all the unique start/stop times that match the common id and then set of rules that try and tell if the current line is close in start/stop time to the other lines that match the common id.

My issue with the code is that even though the commonID matches there might not ever be a chance of the uniques matching because their time stamps might be way differnt. I'd like to figure out a set up where i only compare the lines on the second pass to the common id's and only if the start/stop times are with in a 360 second window of the common id's start/stop times.

I know I can do this with a simple next, but it still requires going through 20M combinations and saying next which takes up a lot of time. I also can't have duplicate data around because the current script ends up using about 12G of ram between the original hashes it builds and the output hashes it needs to finish up the code with.


I would like something like this where I loop through the file once and build the hashes and then have something like
while ($line = <FILE>) {
$common = 'blahblah';
$start_epoch = '12345';
for my $unique (keys %{%start{$common} if $start{$common}{$a} <= $start_epoch - 60)

now I know that really doesn't work that way, but I'm trying to only gets the keys back out of that hash that could possibly work via looping through every key and nexting if the condition isn't met.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Travis

I appreciate you want to change the minimum, but it seems awfully complicated. What exactly are you trying to get out of the analysis? Maybe there a a simpler way? Rather than describing what it does now, can you post a few lines of data (desensitised if necessary) and show what you want to get out of the other end?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Thanks Steve.. I will later today.

Trust me, dealing with this huge data set is enough to drive you crazy (make a change, wait 2 hrs to see how it did), I didn't have this issue before because I actually get the data in smaller files, but the script was written to do a day at a time, and they want to see my numbers match their numbers. And after that I can make what ever changes I want :) So I'm starting to plan for the future..

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Here is a script that works and has some fake data in it. What I would like to avoid is ever looping through the 1000 ID's when dealing with the lower ID's as there is no chance they will ever match, currently everything works because of the IF statement on the time stamp math, but when dealing with 20M rows just doing 20M if's per row starts to take a while. The source data can be in any order (and I guess it could be ordered if needed, currently it is 1 file, but will eventually be broken up into 5 min files) Thanks!!

Code:
use strict;
use Time::Local;
#Example data
my @data = (
"1,abc,01/01/2010,00:00:00.5",
"2,cba,01/01/2010,00:00:10.5",
"3,bac,01/01/2010,00:00:11.5",
"4,abc,01/01/2010,00:00:10.5",
"5,cba,01/01/2010,00:00:11.5",
"6,bac,01/01/2010,00:00:12.5",
"7,cab,01/01/2010,00:00:10.5",
"8,cab,01/01/2010,00:00:10.6",
"9,acb,01/01/2010,00:00:10.5",
"10,cba,01/01/2010,00:00:9.5",
"1000,cba,01/01/2010,01:00:00.5",
"1001,bac,01/01/2010,01:00:00.5",
"1002,abc,01/01/2010,01:00:00.5",
"1003,cab,01/01/2010,01:00:00.5",
);

#1 is a finish
#2 is a second attempt for 10
#3 is a first attempt for 6
#4 is a finish
#5 is a finish
#6 is a finish
#7 is a first attempt for 8
#8 is a finish
#9 is a finish
#10 is a first attempt for 10
#1000 is a finish
#1001 is a finish
#1002 is a finish
#1003 is a finish

#Example code
#First pass
my (%start);
for my $line (@data) {
	my @tmp = split(/\,/, $line);
	my ($month, $day, $year) = split(/\//, $tmp[2]);
	my ($hour, $min, $sec, $fsec) = $tmp[3] =~  /(.*):(.*):(.*)(\..*)/;
	my $start_time = &revert_to_epoch($sec, $min, $hour, $day, $month, $year) + $fsec;
	#Hash is common key then unique key
	$start{$tmp[1]}{$tmp[0]} = $start_time;

}

#Second pass

for my $line (@data) {
	my @tmp = split(/\,/, $line);
	my ($month, $day, $year) = split(/\//, $tmp[2]);
	my ($hour, $min, $sec, $fsec) = $tmp[3] =~  /(.*):(.*):(.*)(\..*)/;
	my $start_time = &revert_to_epoch($sec, $min, $hour, $day, $month, $year) + $fsec;

	#We only want to check against common keys that match our common key
	my $bad;
	for my $unique (keys %{$start{$tmp[1]}}){
		#We don't want compare against ourself
		next if $unique == $tmp[0];
		#print "$unique\n";
		if ($start_time - $start{$tmp[1]}{$unique} >= -2 && $start_time - $start{$tmp[1]}{$unique} <= 0) {
			$bad++;
		}

	}
	if ($bad == 0) {
		print "$tmp[0] is a finish\n";
	}
	else {
		print "$tmp[0] is a attempt\n";
	}

}

sub revert_to_epoch {
	my ($sec, $min, $hour, $day, $month, $year) = @_;
	$month--;
	$year -= 1900;
	my $output = timegm($sec,$min,$hour,$day,$month,$year);
	return($output);

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
First let me say that your code can be made faster in various points. Example: this line

[tt]if($start_time-$start{$tmp[1]}{$unique}>=-2&&$start_time-$start{$tmp[1]}{$unique}<=0){[/tt]

could be written as

[tt]my$tmpval=abs($start_time-$start{$tmp[1]}{$unique}+1);
if($tmpval<=1){[/tt]

Second remark is that I can't really understand the logic behind your data. Take 'cba': you have three records 2,5,10 at 1 sec interval: how can you tell where is the start and the finish? And where is the start for 1000? And how can you classify as finish an event that has no attempts before? I'll assume in the following that all events in sequence that are less than 2 seconds from the preceding one should be squashed in a single event with the time of the last one. Also the first event is an attempt, the subsequent a finish, and so on.

Third much more important observation is that, as you have in your hash the full information contained in the file, you can do your task in a single pass!

By using an auxiliary hash, I would do that this way:
Code:
my (%start,%close,$bad,$tmpval,$unique);
for(@data) {
  my@tmp=split/\,/;
  my ($month, $day, $year) = split(/\//, $tmp[2]);
  my ($hour, $min, $sec, $fsec) = $tmp[3] =~  /(.*):(.*):(.*)(\..*)/;
  my $start_time = &revert_to_epoch($sec, $min, $hour, $day, $month, $year) + $fsec;
  $bad=0;
  for$unique(keys%{$start{$tmp[1]}}){
    $tmpval=$start_time-$start{$tmp[1]}{$unique};
    if(abs$tmpval<=2){
      if($tmpval<0){
        $close{$tmp[1]}{$tmp[0]}=1;
      }else{
        $close{$tmp[1]}{$unique}=1;
      }
    }
  }
  $start{$tmp[1]}{$tmp[0]}=$start_time;
}
for$unique(keys%start){
  print"$unique:\n";
  my$tog=0;
  for(sort{$start{$unique}{$a}<=>$start{$unique}{$b}}keys%{$start{$unique}}){
    unless(exists$close{$unique}{$_}){
      if($tog){
        print"  $_:finish at $start{$unique}{$_}\n";
      }else{
        print"  $_:attempt at $start{$unique}{$_}\n";
      }
      $tog=!$tog;
    }
  }
}
This should require less than half the time of your solution for comparing the times, and spares the time for the second pass over the file.
Other tricks can be devised to spare more computing time. As an example, you can store in other auxiliary hashes the minimum and maximum time met so far for each unique key, then compare against all times for the same key only when the actual time is within the interval.
But of course, if the file was sorted on time, this would be an elementary and very fast task!

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Franco,
Thanks! First a answer to a few of your questions: I guess attempt and finish maybe wasn't the correct wording, how about "Sucessfully attempt" vs "Retry attempt". all records are attempt's, sometimes it just takes multiple tries before a attempt is successfull and I am trying to weed out retry's that are related (via the time math) and only report the last attempt that has the newest time stamp (with in a 2 second window).


I had no idea of a few of those efficiency's.. of course during minimizing the code to put on the net I left out a few things that I didn't think were important that will break with your code.

There is a second time stamp in the data that is used to compare against the first. The @data is a csv line that has approximately 150 fields in it and contains a lot of data. I should have done this:
if ($bad == 0) { print "$tmp[0] is a finish\n"; print "$line\n" } else { print "$tmp[0] is a attempt\n"; print "$line\n"; }

on the second pass of the original code when it is decided that it is a retry or not a retry attempt then a lot of the original data is needed to figure out other things. I think the two pass was done to keep from trying to store all the original data in memory.

I truthfully don't understand your math changes, I'm going to go sit down and play with it some. When I ran your code 100X lines matched even though they are an hour away from the other line.

.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Admittedly, I'm not understanding exactly what you're looking for in the time comparisons, Though I do understand correctly that you are looping over the same data twice, right?

Can you determine whether a run was a success or failure on the first pass and only store the values in your hashes if it was a success? That might make the second run faster. If I understand this correctly, you'd only need to see if $hash{'common id'}{'unique id'} existed and, if it did, the record is good.
 
I can't do only one run (I think) because the data is not ordered. So if at line 1 is a retry record of an attempt that is line 1,000,000 I won't know with a one pass attempt with out storing the whole record in memory and I don't know if that is smart thing to do considering the size of it.

I think I have a speed solution though as I'm getting the data in 5 min intervals now (still unordered though) and I can narrow my computing down to a 15 min window (the 5min I'm working on plus 5in the future and 5 in the past), while only writing out the records in the middle 5 min each time. I'll try and de-sensitise some more of the math and put it on here..

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Can I check my understanding of your data? Some process(es) tries to do something. Each time it tries it writes a record. If it fails, it retries at < 2 second intervals until such time that it succeeds. There is no way to determine if it succeeds from the record, apart from the fact that it has stopped retrying. Some time later (more than 2 seconds), it will run again (with the same ID), retrying as required. You want to work out which are successes and which are failures?

Can you sort the data first (like with a system sort, not Perl) and then have a simple Perl to just rip through it using the sort breaks and examining the timestamps? You could use hashes to collect retry stats as you go through it. This still gives you two passes, but one of them is the sort.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
You are correct in the way the data works.

I've never tried to sort through so many records, but they are in a few different formats and I had to put a lot of exclusions into perl just to deal with bad data (embedded new lines and such) gotta love Text::CSV_XS it handles everything but even doing a cat | wc -l complains because of the formatting of the lines.

I actually have the data being split up into 5 min files now and I'm thought about doing my first pass on the 3 of the files at a time and but just the second pass on the middle of the files.

Again.. thanks for all your guys help.. just talking about it with others has helped me out.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Actually to add a little to what steve said, there are 2 id's a more common id that is used to match attempt types (and I only compare attempts that have the same attempt type) and a unique ID that will alwyas be unique in a 30 day period.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top