Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

search and count 2

Status
Not open for further replies.

demis001

Programmer
Aug 18, 2008
94
US
Hi guys,

Is there easy way to searcha and count the following.

File1
TCGTCTGCCGTTTTTT
TCTCTGAGGGTCGGT


File2

TCGTCTGCCGTTTTTT
TCGTCTGCCGTTTTTTCCTTG
TCGTCTGCCGTTTTTTCCTTTTCATCTTAAAAAAAA
TCGTCTGCCGTTTTTTCGTTGGCAACAATAAAGTCT
TCGTCTGCCGTTTTTTG
TCGTCTGCCGTTTTTTG
TCGTCTGCCGTTTTTTG
TCGTCTGCCGTTTTTTGATTTTCTCATGCCGACTTT
TCGTCTGCCGTTTTTTGC
TCGTCTGCCGTTTTTTGCT
TCGTCTGCCGTTTTTTGCTCTACCCCCAAAACCCTA
TCGTCTGCCGTTTTTTGCTT
TCGTCTGCCGTTTTTTGCTTG
TCGTCTGCCGTTTTTTGCTTG
TCGTCTGCCGTTTTTTGCTTG
TCGTCTGCCGTTTTTTGCTTGAAAACACACAAAATC
TCGTCTGCCGTTTTTTGCTTGGAATAAAGTCTTAGC
TCGTCTGCCGTTTTTTGCTTGTAAACATACCATCTT
TCGTCTGCCGTTTTTTGCTTGTAACTAAATAAGTAT
TCGTCTGCCGTTTTTTGCTTGTACACATT
TCGTCTGCCGTTTTTTGCTTT
TCGTCTGCCGTTTTTTGCTTT
TCGTCTGCCGTTTTTTGCTTT
TCGTCTGCCGTTTTTTGCTTTAAATATAA
TCGTCTGCCGTTTTTTGCTTTATAAA.AAAAATATA
TCGTCTGCCGTTTTTTGCTTTT
TCGTCTGCCGTTTTTTGCTTTTAAAAATAAAATTTT
TCGTCTGCCGTTTTTTGGTT
TCGTCTGCCGTTTTTTGGTTGAACAACACTACAAAA
TCTCTGAGGGTCGGT
TCTCTGAGGGTCGGT
TCTCTGAGGGTCGGT
TCTCTGAGGGTCGGTTCTTATGCCGTCTTCTGCTTT

I can do individual entry using grep as follows
grep "^TCTCTGAGGGTCGGT" file2.txt | wc -l

I need output like:

TCGTCTGCCGTTTTTT count
TCTCTGAGGGTCGGT count


I don't know how to search for multiple lines of file one in file2 and count which requires sort of loop. By the way, I did a count using perl and found sort of error in count while count using grep and want to check all entries using awk.

Thanks.

 
What about something like this ?
Code:
awk 'NR==FNR{a[$0]=0;next}{for(i in a)if($0~"^"i)++a[i]}END{for(i in a)print i,a[i]}' File1 File2

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Thanks as usual PHV, it does the job I want...

Demis001
 
Hi All,

The script by PHV works fine but memory ineffecient. If file2 is 10 millions lines, the Process tries to read in memory and print the final output. Is there any better option to print the result after each loop?

Demis001
 
PHV's solution will only read the first file into memory, I presume that file is quite small?

How do you know it is "memory inefficient"? How much memory is it using, and how much do you expect it to use?

Annihilannic.
 
Hi

My question would be, what [tt]awk[/tt] implementation are you using ? I do not think any memory optimization is needed for PHV's code.

For speed optimization you may try to
[ul]
[li]avoid using regular expression[/li]
[li]exit the loop as soon as possible[/li]
[/ul]
Code:
awk 'NR==FNR{a[$0]=0;next}{for(i in a)if([red]substr([/red]$0[red],1,length(i))==[/red]i)[red]{[/red]++a[i][red];break}[/red]}END{for(i in a)print i,a[i]}' File1 File2

[gray]# or[/gray]

awk 'NR==FNR{a[$0]=0;[red]l[$0]=length($0);[/red]next}{for(i in a)if([red]substr([/red]$0[red],1,l[i])==[/red]i)[red]{[/red]++a[i][red];break}[/red]}END{for(i in a)print i,a[i]}' File1 File2

[gray]# or[/gray]

awk 'NR==FNR{a[$0]=0;next}{for(i in a)if([red]index([/red]$0[red],[/red]i[red])[/red])[red]{[/red]++a[i][red];break}[/red]}END{for(i in a)print i,a[i]}' File1 File2

Feherke.
 
Sorry for confusing term I have used Annihilannic,

What meant was, "it is inefficient based on processing time". It took more than 5 houre to count 250(file1) short reads within >10 millions sequence(file2) under 32gb momory computing linux machine

Thank you guys,
It helps a lot.
 
Hi

Me stupid. I forgot another speed optimization advice :
[ul]
[li]if not needed, alter the default behavior of splitting the record into fields[/li]
[/ul]
Code:
awk [red]-F '//'[/red] 'NR==FNR{a[$0]=0;next}{for(i in a)if(index($0,i)){++a[i];break}}END{for(i in a)print i,a[i]}' File1 File2
Or if you have [tt]gawk[/tt] :
Code:
awk [red]-vFIELDWIDTHS='1'[/red] 'NR==FNR{a[$0]=0;next}{for(i in a)if(index($0,i)){++a[i];break}}END{for(i in a)print i,a[i]}' File1 File2

Feherke.
 
Hi feherke,

Yours only search exact match. Is there any possiblity to change to pattern match similar to PVH.

file1:

TCGTCTGCCGT
TCGTCTGCCGTTTT
TCGTCTGCCGTTTTT
TCGTCTGCCGTTTTTT
TCTCTGAGGGTCG
TCTCTGAGGGTCGG
TCTCTGAGGGTCGGT

file2
TCGTCTGCCGT
TCGTCTGCCGTTTTTTCCTTG
TCGTCTGCCGTTTTTTCCTTTTCATCTTAAAAAAAA
TCGTCTGCCGTTTTTTCGTTGGCAACAATAAAGTCT
TCGTCTGCCGTTTTTTG
TCGTCTGCCGTTTTTTG
TCTCTGAGGGTCG
TCTCTGAGGGTCGG
TCTCTGAGGGTCGGT

Result:
TCGTCTGCCGT 5
TCGTCTGCCGTTTT 6
TCGTCTGCCGTTTTT 6
TCGTCTGCCGTTTTTT 6
TCTCTGAGGGTCG 3
TCTCTGAGGGTCGG 2
TCTCTGAGGGTCGGT 1


thanks
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top