Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reference as a Hash value 2

Status
Not open for further replies.

basildon

Programmer
Oct 22, 2001
58
0
0
GB
I have a large file (several million records), with some records having duplicate ID numbers ($idnum).

I want to extract just the unique records, however as the records are very large I can't, for memory reasons, use the $record, in the below script, as a value to the hash.

I have therefore tried using a reference (see below) however this does not dereference as it would do with a normal scalar.

I've not used a reference as a value before and would appreciate any help (if there's a better way please feel free to suggest too.)

Many Thanks


open (INFILE, &quot;<c:\\file1.dat&quot;) or die;

%seen = ();

while ($record = <INFILE>) {
$idnum = substr($record, 8, 8); #position of the unique id
$ref = \$record;
$seen{$idnum} = $ref;
}

foreach $var (values %seen) {
print $$var.&quot;\n&quot;;
}
 
I can't think of a way to use a reference, but the following code will search the file and copy each line out to another file unless the id number has already been processed once before.

Code:
#!perl 
$in=&quot;c:\\file1.dat&quot;;
open (INFILE, &quot;$in&quot;) or die; 

my $out=&quot;c:\\file1_uniqe_ids.dat&quot;;
open OUT, &quot;>$out&quot; or die &quot;Can't open $out: $!\n&quot;;

my @id_numbers=();

$.=0;
while (my $record = <INFILE>) {
	
	my $idnum = substr($record, 8, 8);  #position of the unique id
	my $repeat=0;

	foreach my $id (@id_numbers) {
		if ($id == $idnum) {
			$repeat = 1;
	        last;
		}		
	}	

	push(@id_numbers, $idnum);

	if ($repeat!=1) {
		print OUT &quot;$record&quot;;  
	} 
	else {
		#do nothing and continue to next line
	}

}

close INFILE;

Hope that helps

MattMcGinnis

 
foreach my $id (@id_numbers) {
if ($id == $idnum) {
$repeat = 1;
last;
}
}


This is an extremely inefficient way to do this. For several million records, when you get up to record 1,234,523 you are doing a linear search through your id array of on the order of 1 million other values. If you propose using an array then the array should be kept sorted and a different (more efficient) algorithm should be used to search the array.

basildon is correct in using a hash. You should use Matt's suggestion of printing unique values to a file as you loop through the original value.
Code:
open (INFILE, &quot;<file1.dat&quot;) or die $!;
open (OUT, &quot;> unique.dat&quot;) or die $!;

my %seen = ();

while (my $record = <INFILE>) {

    my $idnum = substr($record, 8, 8);  #position of the unique id
    ++$seen{$idnum} and print OUT $record unless exists $seen{$idnum};

}

jaa
 
jaa,

Thanks for the correction. I'm relatively new to programming and I'm unfamiliar with optimizing code. I didn't know that searching through a zillion hash values is faster than searching through a zillion array values.

Its a little counter-intuitive since a hash has a zillion times two associated values (key and value).

Anyway, I'm admittedly a hack, but I'm learning.
 
The hash lookup is done only on the keys unless you explicitly do a search through the values. But the unique id is the hash key so that's what needs to be looked up. A lookup in an array (an unordered array) is O(n) (i.e. the look up time scales with the number of elements). A hash lookup is effectively O(1) because of the way that the keys are stored internally. So it would take you 10,000 times longer to search for an elemnent in a 1,000,000 element array vs one in a 100 element array; whereas it would take you aproximately the same amount of time to search for an element in two hashes of 100 and 1,000,000 elements.

Caveat: This is only true to the best of my knowledge based on my limited understanding of how Perl hashes work internally. I have yet to find a thorough reference on this. Please correct me if any of this is not correct. Or post a reference if you find one.

jaa
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top