Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help counting duplicates

Status
Not open for further replies.

Xaqte

IS-IT--Management
Oct 4, 2002
971
US
I have a list of links, some of which contain the same content. I'm trying to weed out the duplicates based on the differences in the directory and filename of each link.

Here is my current code:
Code:
my @processed;
my @urls = (<<"END_OF_URLS" =~ m/^\s*(.+)/gm);
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89907/'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89908'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89909'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89910'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89911'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89912'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89913'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89914'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89915'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54321'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54322'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54323'[/URL]
	'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54324'[/URL]
END_OF_URLS

my $dupes = 0;
for my $i ( 0 .. $#urls ) { 
	my $match_count = 0;
	my $link = $urls[$i];
	my($scheme, $authority, $path, $query, $fragment) = $link =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
	my $file = substr $link, rindex($link, '/') + 1;
	my $directories = substr $path, 0, rindex($path, '/'); 
	my @dirs = split(/\//, $directories);
	splice(@dirs, 0, 3); # remove everything from the domain down
	@dirs = grep /\S/, @dirs; # remove empties
	my @file_parts = split(/_/, $file);
	my @all_parts = (@dirs, @file_parts);
	if(scalar(@processed)) { #array of possible matches is loaded, check for matche
        	for my $poss(@processed) {
			$match_count++ while $path =~ /$poss/g;
		}
		if($match_count) {
		       	$dupes++;
		        my $orig_count = scalar(@all_parts);	
			my $dif_count = $match_count - $orig_count;
			print "diff: $dif_count\n";
			if($dif_count <= 0) {
				print "DISTINCT: $link\n";
			} else {
				print "DUPE: $link\n";
			}
	       	}
	}
	$processed[$i] = join('|', @all_parts);
}

print "Total dupes: $dupes";

Basically, I was just thinking that I could separate each part of each link and store it in array for matching (also separating the filename as well).
Then, I could count the matches found for each link.

Using the array above, I should only have two valid links (one for the red crayon, and one for the blue crayon).

If anyone can please assist me with this, I would greatly appreciate it.

X
 
hmmmm...

something like this?

Code:
my @FULL_URL_LIST; # holds all your procesed URLs

my $url = '[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89909';[/URL]
my $domain = '[URL unfurl="true"]http://www.crayola.com/';[/URL]

# Take out [URL unfurl="true"]http://www.crayola.com;[/URL]
$url = s/$domain//i;

# split the url into an aray 
my @URL_PARTS = split($url);

# loop and make sure your not duplacateing
my $URL_TO_ADD = Search_Loop(@URL_PARTS, $url);
if ($URL_TO_ADD){
# add this url to your list 
}


# and just use Search_Loop to check your other DIR list too

my @Dir;
my $dir = "$URL_PARTS[0]\/$URL_PARTS[1]";
my $check_dir = Search_Loop(@Dir, $dir);


sub Search_Loop(){
my (@URL_LENGTH, $URL) = @_;
	for ($i;$i<=$#URL_LENGTH;$i++){
		if ($URL eq $URL_LENGTH[$i]){return;}
	}
return $URL;
}
 
it sould be easy to come up with a solution for the links you posted, but watif te list of links is different? All the links you listed have similarities but also have a unique part, the digits on the end.

- Kevin, perl coder unexceptional!
 
Seems massively complicated for what it needs to do, or am I missing the point entirely? (which is very possible)
Code:
my %realContent

for (@links) {
   $realContent{$1}++ $if (/(\d+(_[A-Za-z]+)+)/);
}

print for (sort keys %realContent);
Note: untested, no perl on this machine...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Thanks for all the responses!

After trying your suggestions, and even a shot using Text::Levenshtein, I realized there is going to be no fool proof way of verifying the duplicates without checking the content of the urls.

The next plan:

Each of the urls may have different HTML markup, but the same images/photos. Build a identifier based on the images of each url's page (like a key), then insert this identifier in a unique column in my db. This way, the keys will be duplicates only if the images/photos are duplicates.

I'm in the process of searching for a module that will aid in the key generation.

Any criticism or thoughts are welcomed!

Thanks again,

X
 
>> no fool proof way of verifying the duplicates without checking the content of the urls

Are you saying you need to compare images to images and see if they are the same image or not? I have no idea how that would be done with perl.

- Kevin, perl coder unexceptional!
 
Back to the duplicate checking, you could try the following. This will work if there is uniformity in the addresses. A sort is necessary before removing duplicates. On an Intel P-3 machine, a 108,000 line file was read and processed in 4 seconds.

open (IN,"<xxx.in");
open (OUT,">xxx.out");
while (chomp($arr[$index]=<IN>)){$index++;}
@arr1=sort @arr;
$prev = 'nothing';
@out = grep($_ ne $prev && (($prev) = $_), @arr1);
foreach $i (@out){printf OUT ("%s\n",$i);}
 
If you need to compare the images, JPEGs may have EXIF-type metadata embedded in them that you could use for comparison. There must be a perl module that can read them and suck this information out somewhere on CPAN.

Another alternative is to read each image file, and create some kind of checksum (MD5 hash would be ideal). You can then use this as a key into a (perl) hash to look for duplicates. This also works for any kind of file, not just JPEGs. The chances of two non-identical images hashing down to the same value are zillions to one...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
I've been away for the holiday, sorry for not posting back any sooner.

stevexff -
Another alternative is to read each image file, and create some kind of checksum (MD5 hash would be ideal).
This is exactly what I was thinking of doing... thanks for the confirmation!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top