Removing duplicate lines in a file

MGoel · Jun 26, 2002

How can i remove duplicate lines in a file, ther lines look like this

203510 x x x x
203510 yyyyyyy
203525 zzzz
203510 x x x x
203510 yyyyyyy
203525 zzzz

I don't want the second 203510 line.

Thanks

Manoj

dickiebird · Jun 27, 2002

If order of lines isn't a problem then do:
sort -u -o outfile infile
;-)
Dickie Bird
db@dickiebird.freeserve.co.uk

MGoel · Jun 28, 2002

Dickie,

The order of lines is important. I just want the duplicate lines to disappear without changing the order of lines.

Thanks very much for your help

Manoj

MikeLacey · Jun 29, 2002

perl script something like this

use strict;
use warnings;

my $prev_line;

while(<>){
print $prev_line if $prev_line ne $_;
$prev_line = $_;
}
print $prev_line if $prev_line ne $_;
Mike
________________________________________________________________

"Experience is the comb that Nature gives us, after we are bald."

Is that a haiku?
I never could get the hang
of writing those things.

Einstein47 · Jul 1, 2002

You say that you don't want the second 203510 line, but does that mean the "yyyyyy" line? I'm assuming that there are more duplicate lines than you showed here. A good sample input with expected sample output would be the best way to identify what can be done.

We want to help, but it is hard when the question is so vague. Einstein47
(How come we never see the headline, "Psychic Wins Lottery"?)

tdatgod · Jul 2, 2002

Hi,
you could do this with perl

while ( <STDIN> )
{
chomp;
if ( $junk{$_} == undef )
{
print "$_\n";
}
$junk{$_} = 1;
}

I don't remember if it is $junk{$_} or %junk{$_} Howvere the trick is to use the curly braces {} which cause it to be a associative array and should do what you want.

jmagave · Jul 5, 2002

I would do it in AWK and some piping:

Code:

awk '{lines[$0]=i++}
     END{for(j in lines) print lines[j],j}
    ' file1 | sort -n | cut -d&quot; &quot; -f2-

Jayr

tdatgod · Jul 5, 2002

Hi,
Actually come to think about it I have done this with sort.

egrep -n '.' <file> | sort -t':' +1 -2 +2 -3 +0n -1 | sed -e 's/:/ /' | uniq -1 | sort +0n -1 | cut -d' ' -f2-

What this does is add a linenumber to the beginning of each line sort by your fields and only when they are identical does it sort line number.

Then it UNIQifies your fields

Then resorts them into line number order

Then it removes the line numbers..

-----

Einstein47 · Jul 7, 2002

UNIQify - I like that word.

I also like Integerize which is the process of making an integer out of a float, double, time, or string value.

I was on a project and we had a number of these new words. They really are needed in the programming vernacular. Einstein47
(How come we never see the headline, "Psychic Wins Lottery"?)

MGoel · Jul 8, 2002

Hi guys,

Thank you very much for your replies.
Somebody asked me to clarify how the out put should look like

original:
203510 x x x x
203510 yyyyyyy
203525 zzzz
203510 x x x x
203510 yyyyyyy
203525 zzzz

Output after one of you people's logic:
203510 x x x x
203525 zzzz
203510 x x x x
203525 zzzz

The first 203510 is the line I want. I hope it is clear now.
sorry I was in vacation and did not have time to try out your solutions. I will try soon as I am not good in perl and unix scripting but this is an opportunity for me to learn.

Thank you guys once again for excellent solutions and help.

Regards

Manoj

tdatgod · Jul 8, 2002

Hi,
OK I am lost. why is

203510 yyyyyyy

considered a duplicate line and

203510 x x x x
203525 zzzz

not? There seem to be 2 of each of these lines what am I missing? Are not they all duplicate lines?

-----

MGoel · Jul 8, 2002

Hello All,

The file used to previously look like this:
203500 abcd abcd
203510 x x x x
203525 zzzz
203510 x x x x
203525 zzzz

After some programming changes from some other system,it became like this:
203500 abcd abcd
203510 x x x x
203510 yyyyyyy
203525 zzzz
203500 abcd abcd
203510 x x x x
203510 yyyyyyy
203525 zzzz
203500 ....

So you see, that there was a unwanted second line starting with number 203510. We eeded to remove this second line in each repeated sequence of recordset.

I hope this is clear to you all now.

Thanks once again.

Manoj

wardy66 · Jul 9, 2002

What about something like

Code:

open ( DUP, &quot;dups.txt&quot; ) or die;
open ( NODUP, &quot;>nodups.txt&quot; ) or die;
my $line_num;
my $last_line_num = &quot;this is not a line num&quot;;
while( <DUP> ) {
    ($line_num) = split;

    if ( $line_num ne $last_line_num ) {
        print NODUP
    }

$last_line_num = $line_num

}

close( DUP ) or die;
close( NODUP ) or die;

wardy66 · Jul 9, 2002

Oops - I just realised I wasn't in the Perl forum. It's not going to be hugely different, except for the file IO.

The logic is the same ...

tdatgod · Jul 9, 2002

PERL is a perfect valid language to write UNIX scripts in. You will see that my first post was PERL and so were a couple others.

If the forum was called UNIX SHELL SCRIPTING than I guess there would be issues.

----

wardy66 · Jul 9, 2002

tdatgod - I agree. Unfortunately, not everyone does.

I often have "disagreements" regarding UNIX scripting. I prefer to use Perl but I understand that option is not available to a lot of people.

Anyway, to keep the peace, here's the (untested) Perl script in ksh:

Code:

last_line_num=&quot;not valid&quot;

cat dups.txt | while read
do
    if [ $1 != &quot;$last_line_num&quot; ]; then
        # copy the entire input line to nodups.txt
        echo $* > nodups.txt         # Not sure about this
    fi

    last_line_num=$1
done

Like I said, I can't remember the right way to copy the entire output line. This is probably a dodgy way (if it works) as it loses whitespace. Maybe you don't mind ...

Next step would be to use awk, where we have $1, $2 and $0 for the whole line.

MGoel · Jul 9, 2002

Hi,

Thank you very much for the help you have given me.

I was looking to use Unix Scripting, but Perl solution will work too. Well the main thing is I got the idea how to do it using any kind of script.

It was interesting to see how people would approach to solve
a problem. I have seen in my own group people prefer to use
Perl as quick solution to these kind of things.

Once again I am very much obliged to you all for helping me with great ideas.

Manoj

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing duplicate lines in a file

MGoel

Programmer

dickiebird

Programmer

MGoel

Programmer

MikeLacey

MIS

Einstein47

Programmer

tdatgod

Programmer

jmagave

Programmer

tdatgod

Programmer

Einstein47

Programmer

MGoel

Programmer

tdatgod

Programmer

MGoel

Programmer

wardy66

Programmer

wardy66

Programmer

tdatgod

Programmer

wardy66

Programmer

MGoel

Programmer

Similar threads

Part and Inventory Search

Sponsor