I know there is a lot going on here so i'm sorry for the huge amount to read.. but I appreciate any help.
I have upwards of 20M lines of data that I have to read in. The current code reads it in, creates 2 hashes of epoch times (start and stop) with a common ID (this can match any number of other recoreds) and a unique ID (should only ever match this record) from the code $start{$common}{$unique} = $start_epoch; $stop{$common}{$unique} = $stop_epoch;
The code then loops through the file a second time comparing every line in the file against all the unique start/stop times that match the common id and then set of rules that try and tell if the current line is close in start/stop time to the other lines that match the common id.
My issue with the code is that even though the commonID matches there might not ever be a chance of the uniques matching because their time stamps might be way differnt. I'd like to figure out a set up where i only compare the lines on the second pass to the common id's and only if the start/stop times are with in a 360 second window of the common id's start/stop times.
I know I can do this with a simple next, but it still requires going through 20M combinations and saying next which takes up a lot of time. I also can't have duplicate data around because the current script ends up using about 12G of ram between the original hashes it builds and the output hashes it needs to finish up the code with.
I would like something like this where I loop through the file once and build the hashes and then have something like
while ($line = <FILE>) {
$common = 'blahblah';
$start_epoch = '12345';
for my $unique (keys %{%start{$common} if $start{$common}{$a} <= $start_epoch - 60)
now I know that really doesn't work that way, but I'm trying to only gets the keys back out of that hash that could possibly work via looping through every key and nexting if the condition isn't met.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
![[noevil] [noevil] [noevil]](/data/assets/smilies/noevil.gif)
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
I have upwards of 20M lines of data that I have to read in. The current code reads it in, creates 2 hashes of epoch times (start and stop) with a common ID (this can match any number of other recoreds) and a unique ID (should only ever match this record) from the code $start{$common}{$unique} = $start_epoch; $stop{$common}{$unique} = $stop_epoch;
The code then loops through the file a second time comparing every line in the file against all the unique start/stop times that match the common id and then set of rules that try and tell if the current line is close in start/stop time to the other lines that match the common id.
My issue with the code is that even though the commonID matches there might not ever be a chance of the uniques matching because their time stamps might be way differnt. I'd like to figure out a set up where i only compare the lines on the second pass to the common id's and only if the start/stop times are with in a 360 second window of the common id's start/stop times.
I know I can do this with a simple next, but it still requires going through 20M combinations and saying next which takes up a lot of time. I also can't have duplicate data around because the current script ends up using about 12G of ram between the original hashes it builds and the output hashes it needs to finish up the code with.
I would like something like this where I loop through the file once and build the hashes and then have something like
while ($line = <FILE>) {
$common = 'blahblah';
$start_epoch = '12345';
for my $unique (keys %{%start{$common} if $start{$common}{$a} <= $start_epoch - 60)
now I know that really doesn't work that way, but I'm trying to only gets the keys back out of that hash that could possibly work via looping through every key and nexting if the condition isn't met.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
![[noevil] [noevil] [noevil]](/data/assets/smilies/noevil.gif)
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;