Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

grep a big file 1

Status
Not open for further replies.

sasuser2006

Technical User
May 8, 2006
32
US
I've got a large file that I want to bounce against another large file to see which records exist on both file. Only problem is one of the files is over a half billion and the other is 200 million. I'm worried about using a grep command and exhausting the memory on the server. Does anybody have an alternate method to accomplish the grep functionality that may be less resource intensive than grep...i.e...a shell or perl script???
 
You could try using the 'comm' utility.
[tt]comm -12 file1 file2[/tt]

--
 
IMHO this is not something to attempt with plain unix commands. Amount of records is too high. Consider loading the files into two database tables and try to get your results with an SQL JOIN QUERY.


HTH,

p5wizard
 
Could you not merge the files into one then use uniq; wouldn't be fast though.

It's a bit more work but I like the look of p5's solution & would make a useful FAQ for others.

Mike

"A foolproof method for sculpting an elephant: first, get a huge block of marble, then you chip away everything that doesn't look like an elephant."

 
Not just memory; but also disk writes could excede the tmp volumn size(s).

Don't forget to index your database tables. And use something fast like sql*loader for populating the tables.

Normally,
sort file1 file2 | uniq -d

Cheers,
ND [smile]

[small]bigoldbulldog AT hotmail[/small]
 
bigoldbulldog said:
Normally,
sort file1 file2 | uniq -d
Ah, but what if a record exists twice in [tt]file1[/tt], but not at all in [tt]file2[/tt]? That pipeline would report a false duplicate. If such a situation could occur, then, in bash:

Code:
sort -u file1 |cat - <(sort -u file2) |sort |uniq -d

Too bad it needs three [tt]sort[/tt]s.
 
Good point and one to watch if using SQL for the more tenable database route - a potentially fun thread.

I see the logic but AFAIK only zsh and bash will run it.
Code:
sort -u file1 |cat - <(sort -u file2) |sort |uniq -d

How about - and in case of UUOC pundits:
Code:
(sort -u file1; sort -u file2) | sort | uniq -d

Cheers,
ND [smile]

[small]bigoldbulldog AT hotmail[/small]
 
Nice; more portable, more readable, and simpler.

I think I could have talked my way out of a ticket by the UUOC police, but I'd have been in trouble if it were the KISS police.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top