Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing duplicate lines in a file

Status
Not open for further replies.

MGoel

Programmer
Jan 26, 2002
17
US
How can i remove duplicate lines in a file, ther lines look like this

203510 x x x x
203510 yyyyyyy
203525 zzzz
203510 x x x x
203510 yyyyyyy
203525 zzzz

I don't want the second 203510 line.

Thanks

Manoj
 
If order of lines isn't a problem then do:
sort -u -o outfile infile
;-)
Dickie Bird
db@dickiebird.freeserve.co.uk
 
Dickie,

The order of lines is important. I just want the duplicate lines to disappear without changing the order of lines.

Thanks very much for your help

Manoj


 
perl script something like this

use strict;
use warnings;

my $prev_line;

while(<>){
print $prev_line if $prev_line ne $_;
$prev_line = $_;
}
print $prev_line if $prev_line ne $_;
Mike
________________________________________________________________

&quot;Experience is the comb that Nature gives us, after we are bald.&quot;

Is that a haiku?
I never could get the hang
of writing those things.
 
You say that you don't want the second 203510 line, but does that mean the &quot;yyyyyy&quot; line? I'm assuming that there are more duplicate lines than you showed here. A good sample input with expected sample output would be the best way to identify what can be done.

We want to help, but it is hard when the question is so vague. Einstein47
(How come we never see the headline, &quot;Psychic Wins Lottery&quot;?)
 
Hi,
you could do this with perl


while ( <STDIN> )
{
chomp;
if ( $junk{$_} == undef )
{
print &quot;$_\n&quot;;
}
$junk{$_} = 1;
}

I don't remember if it is $junk{$_} or %junk{$_} Howvere the trick is to use the curly braces {} which cause it to be a associative array and should do what you want.
 
I would do it in AWK and some piping:
Code:
awk '{lines[$0]=i++}
     END{for(j in lines) print lines[j],j}
    ' file1 | sort -n | cut -d&quot; &quot; -f2-
Jayr
 
Hi,
Actually come to think about it I have done this with sort.

egrep -n '.' <file> | sort -t':' +1 -2 +2 -3 +0n -1 | sed -e 's/:/ /' | uniq -1 | sort +0n -1 | cut -d' ' -f2-

What this does is add a linenumber to the beginning of each line sort by your fields and only when they are identical does it sort line number.

Then it UNIQifies your fields

Then resorts them into line number order

Then it removes the line numbers..

-----
 
UNIQify - I like that word.

I also like Integerize which is the process of making an integer out of a float, double, time, or string value.

I was on a project and we had a number of these new words. They really are needed in the programming vernacular. Einstein47
(How come we never see the headline, &quot;Psychic Wins Lottery&quot;?)
 
Hi guys,

Thank you very much for your replies.
Somebody asked me to clarify how the out put should look like

original:
203510 x x x x
203510 yyyyyyy
203525 zzzz
203510 x x x x
203510 yyyyyyy
203525 zzzz

Output after one of you people's logic:
203510 x x x x
203525 zzzz
203510 x x x x
203525 zzzz

The first 203510 is the line I want. I hope it is clear now.
sorry I was in vacation and did not have time to try out your solutions. I will try soon as I am not good in perl and unix scripting but this is an opportunity for me to learn.

Thank you guys once again for excellent solutions and help.

Regards

Manoj


 
Hi,
OK I am lost. why is

203510 yyyyyyy

considered a duplicate line and


203510 x x x x
203525 zzzz


not? There seem to be 2 of each of these lines what am I missing? Are not they all duplicate lines?

-----
 
Hello All,

The file used to previously look like this:
203500 abcd abcd
203510 x x x x
203525 zzzz
203510 x x x x
203525 zzzz

After some programming changes from some other system,it became like this:
203500 abcd abcd
203510 x x x x
203510 yyyyyyy
203525 zzzz
203500 abcd abcd
203510 x x x x
203510 yyyyyyy
203525 zzzz
203500 ....

So you see, that there was a unwanted second line starting with number 203510. We eeded to remove this second line in each repeated sequence of recordset.

I hope this is clear to you all now.

Thanks once again.

Manoj

 
What about something like
Code:
open ( DUP, &quot;dups.txt&quot; ) or die;
open ( NODUP, &quot;>nodups.txt&quot; ) or die;
my $line_num;
my $last_line_num = &quot;this is not a line num&quot;;
while( <DUP> ) {
    ($line_num) = split;

    if ( $line_num ne $last_line_num ) {
        print NODUP
    }

$last_line_num = $line_num

}

close( DUP ) or die;
close( NODUP ) or die;
 
Oops - I just realised I wasn't in the Perl forum. It's not going to be hugely different, except for the file IO.

The logic is the same ...
 


PERL is a perfect valid language to write UNIX scripts in. You will see that my first post was PERL and so were a couple others.


If the forum was called UNIX SHELL SCRIPTING than I guess there would be issues.

----
 
tdatgod - I agree. Unfortunately, not everyone does.

I often have &quot;disagreements&quot; regarding UNIX scripting. I prefer to use Perl but I understand that option is not available to a lot of people.

Anyway, to keep the peace, here's the (untested) Perl script in ksh:

Code:
last_line_num=&quot;not valid&quot;

cat dups.txt | while read
do
    if [ $1 != &quot;$last_line_num&quot; ]; then
        # copy the entire input line to nodups.txt
        echo $* > nodups.txt         # Not sure about this
    fi

    last_line_num=$1
done

Like I said, I can't remember the right way to copy the entire output line. This is probably a dodgy way (if it works) as it loses whitespace. Maybe you don't mind ...

Next step would be to use awk, where we have $1, $2 and $0 for the whole line.
 
Hi,

Thank you very much for the help you have given me.

I was looking to use Unix Scripting, but Perl solution will work too. Well the main thing is I got the idea how to do it using any kind of script.

It was interesting to see how people would approach to solve
a problem. I have seen in my own group people prefer to use
Perl as quick solution to these kind of things.

Once again I am very much obliged to you all for helping me with great ideas.

Manoj

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top