I have a list of links, some of which contain the same content. I'm trying to weed out the duplicates based on the differences in the directory and filename of each link.
Here is my current code:
Basically, I was just thinking that I could separate each part of each link and store it in array for matching (also separating the filename as well).
Then, I could count the matches found for each link.
Using the array above, I should only have two valid links (one for the red crayon, and one for the blue crayon).
If anyone can please assist me with this, I would greatly appreciate it.
X
Here is my current code:
Code:
my @processed;
my @urls = (<<"END_OF_URLS" =~ m/^\s*(.+)/gm);
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89907/'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89908'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89909'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89910'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89911'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89912'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89913'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89914'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/651_big_red_crayon_89915'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54321'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54322'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54323'[/URL]
'[URL unfurl="true"]http://www.crayola.com/products/classics/crayons/782_big_blue_crayon_54324'[/URL]
END_OF_URLS
my $dupes = 0;
for my $i ( 0 .. $#urls ) {
my $match_count = 0;
my $link = $urls[$i];
my($scheme, $authority, $path, $query, $fragment) = $link =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
my $file = substr $link, rindex($link, '/') + 1;
my $directories = substr $path, 0, rindex($path, '/');
my @dirs = split(/\//, $directories);
splice(@dirs, 0, 3); # remove everything from the domain down
@dirs = grep /\S/, @dirs; # remove empties
my @file_parts = split(/_/, $file);
my @all_parts = (@dirs, @file_parts);
if(scalar(@processed)) { #array of possible matches is loaded, check for matche
for my $poss(@processed) {
$match_count++ while $path =~ /$poss/g;
}
if($match_count) {
$dupes++;
my $orig_count = scalar(@all_parts);
my $dif_count = $match_count - $orig_count;
print "diff: $dif_count\n";
if($dif_count <= 0) {
print "DISTINCT: $link\n";
} else {
print "DUPE: $link\n";
}
}
}
$processed[$i] = join('|', @all_parts);
}
print "Total dupes: $dupes";
Basically, I was just thinking that I could separate each part of each link and store it in array for matching (also separating the filename as well).
Then, I could count the matches found for each link.
Using the array above, I should only have two valid links (one for the red crayon, and one for the blue crayon).
If anyone can please assist me with this, I would greatly appreciate it.
X