Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Duplicates large file 1

Status
Not open for further replies.

nabana2

Technical User
Sep 26, 2005
21
ZA
Hi

I have a large file approx. 10 million records or 600mb.
I would like to remove dupicates based on the first 10
fields. I have a script which builds an array to match
against which works fine on smaller files but my pc falls
over if I try to do the same thing with large files.

Is there a better way to proccess a large file like this with awk?

thanks
 
Hi

Interesting problem. I would try a file based solution. Will certainly work slower, but will not fail.
Code:
while read str; do
  grep -q "$str" outputfile || echo "$str" >> outputfile
do < inputfile
The above is just a draft, ignoring your request for check only the first 10 fields. Sorry for the off-topic.

Feherke.
 
man sort (-k and -u options)

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Thanks PH

Had to stick uniq in there to see what the duplicates were.
This seems like it does the trick.

sort -k 1,10 large.txt | uniq -d > results.txt

Presume there would be a way to use sort and awk to do
a similar thing.
 
Why not simply this ?
sort -k 1,10 -u large.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Thanks PH

I did do that at first, which was fine but then I realised
I needed to see which lines were duplicated aswell as which
lines were unique.

Sorry, deviated from my original question.
 
Hi

I now need to add the last field for the dupilcate columns.
so for data like:

aaa bbb ccc 10
aaa ddd ddd 10
aaa bbb ccc 20

I would like to get

aaa bbb ccc 30
aaa ddd ddd 10

Yet again i can't use arrays so i need to sort an then match
the current record with following one.

My attempt but nowhere close:

sort -k 1,3 | awk '{
rows = $1 "\t" $2 "\t" $3
qty = $4
}
{
if ( rows == $1 "\t" $2 "\t" $3 ) {
qty += $4
print rows, qty
} else {
print rows, qty
}

}' large.txt
 
sort -k 1,3 large.txt | awk '
$1"\t"$2"\t"$3!=rows{
if(NR>1)print rows"\t"qty
rows=$1"\t"$2"\t"$3;qty=0
}
{qty+=$4}
END{print rows"\t"qty}
'

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Thanks PH
Works perfectly.
That was a stupid mistake with no file for sort.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top