Duplicates large file 1

nabana2 · Oct 4, 2005

Hi

I have a large file approx. 10 million records or 600mb.
I would like to remove dupicates based on the first 10
fields. I have a script which builds an array to match
against which works fine on smaller files but my pc falls
over if I try to do the same thing with large files.

Is there a better way to proccess a large file like this with awk?

thanks

feherke · Oct 4, 2005

Hi

Interesting problem. I would try a file based solution. Will certainly work slower, but will not fail.

Code:

while read str; do
  grep -q "$str" outputfile || echo "$str" >> outputfile
do < inputfile

The above is just a draft, ignoring your request for check only the first 10 fields. Sorry for the off-topic.

Feherke.

http://rootshell.be/~feherke/

PHV · Oct 4, 2005

man sort (-k and -u options)

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

nabana2 · Oct 4, 2005

Thanks PH

Had to stick uniq in there to see what the duplicates were.
This seems like it does the trick.

sort -k 1,10 large.txt | uniq -d > results.txt

Presume there would be a way to use sort and awk to do
a similar thing.

PHV · Oct 4, 2005

Why not simply this ?
sort -k 1,10 -u large.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

nabana2 · Oct 4, 2005

Thanks PH

I did do that at first, which was fine but then I realised
I needed to see which lines were duplicated aswell as which
lines were unique.

Sorry, deviated from my original question.

nabana2 · Oct 6, 2005

Hi

I now need to add the last field for the dupilcate columns.
so for data like:

aaa bbb ccc 10
aaa ddd ddd 10
aaa bbb ccc 20

I would like to get

aaa bbb ccc 30
aaa ddd ddd 10

Yet again i can't use arrays so i need to sort an then match
the current record with following one.

My attempt but nowhere close:

sort -k 1,3 | awk '{
rows = $1 "\t" $2 "\t" $3
qty = $4
}
{
if ( rows == $1 "\t" $2 "\t" $3 ) {
qty += $4
print rows, qty
} else {
print rows, qty
}

}' large.txt

PHV · Oct 6, 2005

sort -k 1,3 large.txt | awk '
$1"\t"$2"\t"$3!=rows{
if(NR>1)print rows"\t"qty
rows=$1"\t"$2"\t"$3;qty=0
}
{qty+=$4}
END{print rows"\t"qty}
'

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

nabana2 · Oct 6, 2005

Thanks PH
Works perfectly.
That was a stupid mistake with no file for sort.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Duplicates large file 1

nabana2

Technical User

feherke

Programmer

PHV

MIS

nabana2

Technical User

PHV

MIS

nabana2

Technical User

nabana2

Technical User

PHV

MIS

nabana2

Technical User

Similar threads

Part and Inventory Search

Sponsor