Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Efficient Bash Script Tip for Large Files? 2

Status
Not open for further replies.

jouell

MIS
Nov 19, 2002
304
US
Hi All,


I have 2 large files (~2GB each).

They both contain a list of URLS.

I believe they should roughly overlap.

So, I have a script to grep the URLs do exist in file1 but not in file2:


#!/bin/bash
file1=urls1.txt
file2=urls2.txt
file3=not_found_in_1.txt

while read line; do

grep "$line" $file2 || echo "$line not found file2" >> $file3

done < $file1


It works but I am sure there is something faster. (48+hours)

I know perl/python could probably speed this up, but I am more curious what tips/thought on keeping this using bash.

Thanks
-jouell
 
jouell,

You might be able to use diff to provide what doesn't exist in the both files and then carry out the rest of your logic accordingly.


Steven
 
If the files are sorted, you may want to try the comm command - have a look at the man page first.



HTH,

p5wizard
 
Another vote for [tt]comm[/tt]. That's exactly what it's designed to do and it's VERY fast. You'll have to [tt]sort[/tt] the files first, but you'll be looking at seconds or minutes, not hours.


 
Of course, sorting the files first may also take hours...


HTH,

p5wizard
 
A simple sort of URLs should be MUCH faster than [tt]grep[/tt]ping every occurance of one in the other, which is the method he's tried. Even if it's one hour for each sort, then minutes for the [tt]comm[/tt], that 2 hours plus is MUCH better than the 48+ hours he's getting now.
Code:
#!/bin/bash

file1=urls1.txt
file2=urls2.txt

file3=in_1_not_in_2.txt
file4=in_2_not_in_1.txt
file5=in_both.txt

sort ${file1} > ${file1}.sorted
sort ${file2} > ${file2}.sorted

comm -23 ${file1}.sorted ${file2}.sorted > ${file3}
comm -13 ${file1}.sorted ${file2}.sorted > ${file4}
comm -12 ${file1}.sorted ${file2}.sorted > ${file5}
Guaranteed to be much faster than grepping each line in a 2GB file against another 2GB file.

 
Thanks All

I think that's a great utility! I had both files sorted and clearly comm is very fast!

The big key seems to be to make sure the files are sorted.

Thanks again!
-jouell
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top