Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove Duplicates Ignore Dates 1

Status
Not open for further replies.

mte0910

Programmer
Apr 20, 2005
232
US
I have this code that I am using to remove the duplicate entries from a text file. The problem is, I had to remove the date from the entries, otherwise it would make them unique. If I wanted to keep the date, how could I compare each line and remove the duplicates while ignoring the date?

[quote:]
open (IN, 'c:\tmp00.txt');
open (OUT,'>c:\tmp01.txt');
my %hTmp;
while (my $sLine = <IN>) {
print OUT $sLine unless ($hTmp{$sLine}++);
}
close (IN); close (OUT);
[/quote]
 
can u share ur data structure, it's hard to tell w/o that.
 
I should have posted this code instead...

open (IN, 'C:\tmp00.txt');
open (OUT,'>C:\tmp01.txt');
my %hTmp;
while (my $sLine = <IN>) {
print OUT $sLine unless ($hTmp{lc($sLine)}++);
}
close (IN); close (OUT);
 
Right now, the file looks like...

alphadata,10.10.10.110,server1
betadata,10.10.10.111,server1
charliedata,10.10.10.112,server1


I would like to use a file that looks like this...
9/9/2008,alphadata,10.10.10.110,server1
9/15/2008,betadata,10.10.10.111,server1
9/24/2008,charliedata,10.10.10.112,server1


Assuming there were two lines...
9/24/2008,charliedata,10.10.10.112,server1
10/18/2008,charliedata,10.10.10.112,server1

I would like to delete one of them, and preserve the "newest" for a result of...

10/18/2008,charliedata,10.10.10.112,server1
 
Code:
open (IN, "infile");
open (OUT, ">outfile");
my $dup = {};
while (<IN>) {
   my ($uniq) = $_ =~ /^[0-9\/]+\,(.+)$/;
   print OUT $_ unless ($dup->{$uniq}++);
}
close (OUT);
close (IN);

Code:
[kirsle@epsilon ~]$ cat <<EOF > infile
> 9/9/2008,alphadata,10.10.10.110,server1
> 9/15/2008,betadata,10.10.10.111,server1
> 9/24/2008,charliedata,10.10.10.112,server1
> 9/24/2008,charliedata,10.10.10.112,server1
> 10/18/2008,charliedata,10.10.10.112,server1
> EOF
[kirsle@epsilon ~]$ perl
open (IN, "infile");
open (OUT, ">outfile");
my $dup = {};
while (<IN>) {
   my ($uniq) = $_ =~ /^[0-9\/]+\,(.+)$/;
   print OUT $_ unless ($dup->{$uniq}++);
}
close (OUT);
close (IN);
__END__
[kirsle@epsilon ~]$ cat outfile
9/9/2008,alphadata,10.10.10.110,server1
9/15/2008,betadata,10.10.10.111,server1
9/24/2008,charliedata,10.10.10.112,server1

That should get ya a step further. If you wanted the date too:

Code:
my ($date,$uniq) = $_ =~ /^([0-9\/]+)\,(.+)$/;

and then do whatever with the $date.

-------------
Cuvou.com | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
*bored*

Code:
[kirsle@epsilon ~]$ perl
open (IN, "infile");
open (OUT, ">outfile");
my $dup = {};
while (<IN>) {
   my ($date,$uniq) = $_ =~ /^([0-9\/]+)\,(.+)$/;
   my @dates = split(/\//, $date, 3);
   my $int = join("",
      sprintf("%02d", $dates[2]),
      sprintf("%02d", $dates[0]),
      sprintf("%02d", $dates[1]),
   );
   unless ($dup->{$uniq} && $dup->{$uniq}->{int} > $int) {
      $dup->{$uniq}->{value} = $_;
      $dup->{$uniq}->{int} = $int;
   }
}
foreach my $line (sort keys %{$dup}) {
   print OUT $dup->{$line}->{value};
}
close (OUT);
close (IN);
__END__
[kirsle@epsilon ~]$ cat outfile
9/9/2008,alphadata,10.10.10.110,server1
9/15/2008,betadata,10.10.10.111,server1
10/18/2008,charliedata,10.10.10.112,server1

Code explanation:

It reads the lines, separates the date from the unique content, turns the date into an integer (i.e. 9/15/2008 becomes 20080915) so that it can easily compare dates by numbers, and then it starts sorting them away into $dup->{$uniq} by the unique half of the line (and stores the full value of the line as {value} and the integer it got as {int}

And on subsequent loops about the same unique part of a line, it compares of the integer it got from this line is bigger than the one it got the last time. If so it sets the {value} and {int} to the values of this new line.

Since it can't predict whether it will get a newer version of the current line, it won't be able to output your data in any particular order (if it printed every time it saw a newer date on a line, you'd still have duplicates in the final file). To that end I just had it sort the results alphabetically by unique parts in them.

Questions? :)

-------------
Cuvou.com | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
if the dates in the file are always in ascending order, the highest date will automatically be retained. If the dates are in no particular order, then you will have to use something like kirsle has posted.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Worked like a charm!
Now, I just have to figure out why:)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top