Efficient Bash Script Tip for Large Files? 2

jouell · Apr 23, 2009

Hi All,

I have 2 large files (~2GB each).

They both contain a list of URLS.

I believe they should roughly overlap.

So, I have a script to grep the URLs do exist in file1 but not in file2:

#!/bin/bash
file1=urls1.txt
file2=urls2.txt
file3=not_found_in_1.txt

while read line; do

grep "$line" $file2 || echo "$line not found file2" >> $file3

done < $file1

It works but I am sure there is something faster. (48+hours)

I know perl/python could probably speed this up, but I am more curious what tips/thought on keeping this using bash.

Thanks
-jouell

StevenR77 · Apr 23, 2009

jouell,

You might be able to use diff to provide what doesn't exist in the both files and then carry out the rest of your logic accordingly.

Steven

p5wizard · Apr 23, 2009

If the files are sorted, you may want to try the comm command - have a look at the man page first.

HTH,

p5wizard

SamBones · Apr 23, 2009

Another vote for [tt]comm[/tt]. That's exactly what it's designed to do and it's VERY fast. You'll have to [tt]sort[/tt] the files first, but you'll be looking at seconds or minutes, not hours.

p5wizard · Apr 23, 2009

Of course, sorting the files first may also take hours...

HTH,

p5wizard

SamBones · Apr 24, 2009

A simple sort of URLs should be MUCH faster than [tt]grep[/tt]ping every occurance of one in the other, which is the method he's tried. Even if it's one hour for each sort, then minutes for the [tt]comm[/tt], that 2 hours plus is MUCH better than the 48+ hours he's getting now.

Code:

#!/bin/bash

file1=urls1.txt
file2=urls2.txt

file3=in_1_not_in_2.txt
file4=in_2_not_in_1.txt
file5=in_both.txt

sort ${file1} > ${file1}.sorted
sort ${file2} > ${file2}.sorted

comm -23 ${file1}.sorted ${file2}.sorted > ${file3}
comm -13 ${file1}.sorted ${file2}.sorted > ${file4}
comm -12 ${file1}.sorted ${file2}.sorted > ${file5}

Guaranteed to be much faster than grepping each line in a 2GB file against another 2GB file.

jouell · Apr 24, 2009

Thanks All

I think that's a great utility! I had both files sorted and clearly comm is very fast!

The big key seems to be to make sure the files are sorted.

Thanks again!
-jouell

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Efficient Bash Script Tip for Large Files? 2

jouell

MIS

StevenR77

MIS

p5wizard

IS-IT--Management

SamBones

Programmer

p5wizard

IS-IT--Management

SamBones

Programmer

jouell

MIS

Similar threads

Part and Inventory Search

Sponsor