Reference as a Hash value 2

basildon · Sep 24, 2002

I have a large file (several million records), with some records having duplicate ID numbers ($idnum).

I want to extract just the unique records, however as the records are very large I can't, for memory reasons, use the $record, in the below script, as a value to the hash.

I have therefore tried using a reference (see below) however this does not dereference as it would do with a normal scalar.

I've not used a reference as a value before and would appreciate any help (if there's a better way please feel free to suggest too.)

Many Thanks

open (INFILE, "<c:\\file1.dat&quot

or die;

%seen = ();

while ($record = <INFILE>) {
$idnum = substr($record, 8, 8); #position of the unique id
$ref = \$record;
$seen{$idnum} = $ref;
}

foreach $var (values %seen) {
print $$var."\n";
}

MattMcGinnis · Sep 24, 2002

I can't think of a way to use a reference, but the following code will search the file and copy each line out to another file unless the id number has already been processed once before.

Code:

#!perl 
$in=&quot;c:\\file1.dat&quot;;
open (INFILE, &quot;$in&quot;) or die; 

my $out=&quot;c:\\file1_uniqe_ids.dat&quot;;
open OUT, &quot;>$out&quot; or die &quot;Can't open $out: $!\n&quot;;

my @id_numbers=();

$.=0;
while (my $record = <INFILE>) {
	
	my $idnum = substr($record, 8, 8);  #position of the unique id
	my $repeat=0;

	foreach my $id (@id_numbers) {
		if ($id == $idnum) {
			$repeat = 1;
	        last;
		}		
	}	

	push(@id_numbers, $idnum);

	if ($repeat!=1) {
		print OUT &quot;$record&quot;;  
	} 
	else {
		#do nothing and continue to next line
	}

}

close INFILE;

Hope that helps

MattMcGinnis

justice41 · Sep 24, 2002

foreach my $id (@id_numbers) {
if ($id == $idnum) {
$repeat = 1;
last;
}
}

This is an extremely inefficient way to do this. For several million records, when you get up to record 1,234,523 you are doing a linear search through your id array of on the order of 1 million other values. If you propose using an array then the array should be kept sorted and a different (more efficient) algorithm should be used to search the array.

basildon is correct in using a hash. You should use Matt's suggestion of printing unique values to a file as you loop through the original value.

Code:

open (INFILE, &quot;<file1.dat&quot;) or die $!;
open (OUT, &quot;> unique.dat&quot;) or die $!;

my %seen = ();

while (my $record = <INFILE>) {

    my $idnum = substr($record, 8, 8);  #position of the unique id
    ++$seen{$idnum} and print OUT $record unless exists $seen{$idnum};

}

jaa

MattMcGinnis · Sep 24, 2002

jaa,

Thanks for the correction. I'm relatively new to programming and I'm unfamiliar with optimizing code. I didn't know that searching through a zillion hash values is faster than searching through a zillion array values.

Its a little counter-intuitive since a hash has a zillion times two associated values (key and value).

Anyway, I'm admittedly a hack, but I'm learning.

justice41 · Sep 24, 2002

The hash lookup is done only on the keys unless you explicitly do a search through the values. But the unique id is the hash key so that's what needs to be looked up. A lookup in an array (an unordered array) is O

(i.e. the look up time scales with the number of elements). A hash lookup is effectively O(1) because of the way that the keys are stored internally. So it would take you 10,000 times longer to search for an elemnent in a 1,000,000 element array vs one in a 100 element array; whereas it would take you aproximately the same amount of time to search for an element in two hashes of 100 and 1,000,000 elements.

Caveat: This is only true to the best of my knowledge based on my limited understanding of how Perl hashes work internally. I have yet to find a thorough reference on this. Please correct me if any of this is not correct. Or post a reference if you find one.

jaa

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reference as a Hash value 2

basildon

Programmer

MattMcGinnis

Programmer

justice41

Programmer

MattMcGinnis

Programmer

justice41

Programmer

Similar threads

Part and Inventory Search

Sponsor