grep a big file 1

sasuser2006 · Jun 3, 2006

I've got a large file that I want to bounce against another large file to see which records exist on both file. Only problem is one of the files is over a half billion and the other is 200 million. I'm worried about using a grep command and exhausting the memory on the server. Does anybody have an alternate method to accomplish the grep functionality that may be less resource intensive than grep...i.e...a shell or perl script???

Salem · Jun 4, 2006

You could try using the 'comm' utility.
[tt]comm -12 file1 file2[/tt]

--

Annihilannic · Jun 5, 2006

Are the files sorted?

Annihilannic.

sasuser2006 · Jun 5, 2006

No, unfortunately they aren't sorted.

Annihilannic · Jun 5, 2006

You'll need to sort them first before using comm.

Annihilannic.

p5wizard · Jun 6, 2006

IMHO this is not something to attempt with plain unix commands. Amount of records is too high. Consider loading the files into two database tables and try to get your results with an SQL JOIN QUERY.

HTH,

p5wizard

mrn · Jun 6, 2006

Could you not merge the files into one then use uniq; wouldn't be fast though.

It's a bit more work but I like the look of p5's solution & would make a useful FAQ for others.

Mike

"A foolproof method for sculpting an elephant: first, get a huge block of marble, then you chip away everything that doesn't look like an elephant."

http://www.airport-parking-site.co.uk/

bigoldbulldog · Jun 7, 2006

Not just memory; but also disk writes could excede the tmp volumn size(s).

Don't forget to index your database tables. And use something fast like sql*loader for populating the tables.

Normally,
sort file1 file2 | uniq -d

Cheers,
ND [smile]

[small]bigoldbulldog AT hotmail[/small]

chipperMDW · Jun 7, 2006

bigoldbulldog said:
Normally,
sort file1 file2 | uniq -d

Ah, but what if a record exists twice in [tt]file1[/tt], but not at all in [tt]file2[/tt]? That pipeline would report a false duplicate. If such a situation could occur, then, in bash:

Code:

sort -u file1 |cat - <(sort -u file2) |sort |uniq -d

Too bad it needs three [tt]sort[/tt]s.

KenCunningham · Jun 8, 2006

But then again, they do say it takes all sorts! ;-)

bigoldbulldog · Jun 8, 2006

Good point and one to watch if using SQL for the more tenable database route - a potentially fun thread.

I see the logic but AFAIK only zsh and bash will run it.

Code:

sort -u file1 |cat - <(sort -u file2) |sort |uniq -d

How about - and in case of UUOC pundits:

Code:

(sort -u file1; sort -u file2) | sort | uniq -d

Cheers,
ND [smile]

[small]bigoldbulldog AT hotmail[/small]

chipperMDW · Jun 8, 2006

Nice; more portable, more readable, and simpler.

I think I could have talked my way out of a ticket by the UUOC police, but I'd have been in trouble if it were the KISS police.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

grep a big file 1

sasuser2006

Technical User

Salem

Programmer

Annihilannic

MIS

sasuser2006

Technical User

Annihilannic

MIS

p5wizard

IS-IT--Management

mrn

MIS

bigoldbulldog

Programmer

chipperMDW

Programmer

KenCunningham

Technical User

bigoldbulldog

Programmer

chipperMDW

Programmer

Similar threads

Part and Inventory Search

Sponsor