Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

a question in my code to calculate the frequency of nucleotides 3

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
US
Hi! I have a txt file which holds 100 sequences and the
width for each sequence is 8 nucleotide long.

Like:

AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA
........

I want to calculate the frequency of nucleotides A,T,C,G
for each column NOT each row and print them out.

My code is here but it doesn't work. Would anybody give
me some suggestions? Thank you very much!

Alex



Code


#!/usr/bin/perl -w
# Determining frequency of nucleotides
my (@short,$x,$position,$base);

open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;

for ( $position = 0 ; $position < 7 ; ++$position ){
for ($x=0; $x<=$#short; $x++){


$count_of_A = 0; # Initialize the counts.
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$base = substr($short[$x], $position, 1);


if ( $base eq 'A' ) {
++$count_of_A;
} elsif ( $base eq 'C' ) {
++$count_of_C;
} elsif ( $base eq 'G' ) {
++$count_of_G;
} elsif ( $base eq 'T' ) {
++$count_of_T;
}

}


print "A = $count_of_A\n\"; # print the results
print "C = $count_of_C\n\";
print "G = $count_of_G\n\";
print "T = $count_of_T\n\";

}
# exit the program
exit;

outfile8.txt holds sequences:

ACGTGACC
TGCAGATT
TAGTTTGA
TTTTCAAA
GACTTCAG
CCAGTTTA
TTACTTTG
TGCCGCGT
TCACGGCG
AGTCGTCA
CCCTAATT
CATTAGCG
TGGTTAAT
TGGGTTAC
GACTACCC
TAGCGCTG
TACTGTTC
AGTGAGCC
CCTTTAAC
GAGCGTCG
GTACAGTT
ATTCTCCA
GACACGCG
TACTAGAA
GTTCATCA
TATAGTCC
CTAATTTT
TACCCATG
CATGTATG
GAACTACG
CGGTTAGT
CACATGCG
TGTGACCT
TTCATCTA
GCCCTGGG
TTGGGAGC
CTACTACC
AGAACCAC
AGTCGACT
TTGATGCT
CCCTCCTA
AGCAAGTT
AGAGCGAA
TACGACAA
CGATTAGC
TACTTAGA
ACCCACCG
CTGAGTTC
TGTCATAA
CTATTGCC
ATCTGTAC
TAACTCAG
CTTGTACC
CATCATTG
GCTTGGAG
TATGAAGC
CAGACTAT
GAATCGTG
AACGATCC
ATTTGTCC
AGGGCCGG
GATATGTC
GATTCTGG
TCCAATCC
CGCGGTAA
TAGGGTCT
TCGGTATT
TCGTTGGT
TGACTGGG
GGTTATTA
ACCATGCG
TTTCAGGC
TACTGAGA
GGTTATAC
CACGCAGT
GCGAACCA
ACCCAGTA
CCGGGCTG
TGGTCTTC
AGATAACA
AAATACTA
TACGGAAG
GGAATAAA
TCTTGATA
TGGAAGGA
AGATGACG
AAACTCAA
TGAAGCGC
CGGTGCGG
TGTCGTCT
AATACGGA
TAAGTGCA
GCTGGTAT
GATCAGAC
CGTTTGGG
AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA
 
i haven't checked it - but something like this?

Code:
[b]#!/usr/bin/perl[/b]

while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    print "$x | $cols[$x]\n";
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}

[blue]__DATA__
ACGTGACC
TGCAGATT
TAGTTTGA
TTTTCAAA
GACTTCAG
CCAGTTTA
TTACTTTG
TGCCGCGT
TCACGGCG
AGTCGTCA
CCCTAATT
CATTAGCG
TGGTTAAT
TGGGTTAC
GACTACCC
TAGCGCTG
TACTGTTC
AGTGAGCC
CCTTTAAC
GAGCGTCG
GTACAGTT
ATTCTCCA
GACACGCG
TACTAGAA
GTTCATCA
TATAGTCC
CTAATTTT
TACCCATG
CATGTATG
GAACTACG
CGGTTAGT
CACATGCG
TGTGACCT
TTCATCTA
GCCCTGGG
TTGGGAGC
CTACTACC
AGAACCAC
AGTCGACT
TTGATGCT
CCCTCCTA
AGCAAGTT
AGAGCGAA
TACGACAA
CGATTAGC
TACTTAGA
ACCCACCG
CTGAGTTC
TGTCATAA
CTATTGCC
ATCTGTAC
TAACTCAG
CTTGTACC
CATCATTG
GCTTGGAG
TATGAAGC
CAGACTAT
GAATCGTG
AACGATCC
ATTTGTCC
AGGGCCGG
GATATGTC
GATTCTGG
TCCAATCC
CGCGGTAA
TAGGGTCT
TCGGTATT
TCGTTGGT
TGACTGGG
GGTTATTA
ACCATGCG
TTTCAGGC
TACTGAGA
GGTTATAC
CACGCAGT
GCGAACCA
ACCCAGTA
CCGGGCTG
TGGTCTTC
AGATAACA
AAATACTA
TACGGAAG
GGAATAAA
TCTTGATA
TGGAAGGA
AGATGACG
AAACTCAA
TGAAGCGC
CGGTGCGG
TGTCGTCT
AATACGGA
TAAGTGCA
GCTGGTAT
GATCAGAC
CGTTTGGG
AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA[/blue]


Kind Regards
Duncan
 
You can easily include a subroutine to tidy the 4 blocks - i know you can do that

the backbone of the script is a very simple hash

Code:
while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    [b]$cols[$x]{$x}++;[/b]
  }
  
  print "\n";
  
}


Kind Regards
Duncan
 
Hi Duncan,

I ran the 1st script. The output is:

A: conclusion
C: conclusion
T: conclusion
G: conclusion

What is the problem?

Thanks,
Alex
 
No

Code:
[b]#!/usr/bin/perl[/b]

open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;

foreach (@short) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    print "$x | $cols[$x]\n";
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}


Kind Regards
Duncan
 
Hi Duncan,

I put the data in the code and got the result like:

0 | C
1 | G
2 | A
3 | T
4 | T
5 | A
6 | G
7 | C

0 | T
1 | A
2 | C
3 | T
4 | T
5 | A
6 | G
7 | A

0 | A
1 | C
2 | C
3 | C
4 | A
5 | C
6 | C
7 | G

0 | C
1 | T
2 | G
3 | A
4 | G
5 | T
6 | T
7 | C

0 | T
1 | G
2 | T
3 | C
4 | A
5 | T
6 | A
7 | A

0 | C
1 | T
2 | A
3 | T
4 | T
5 | G
6 | C
7 | C

0 | A
1 | T
2 | C
3 | T
4 | G
5 | T
6 | A
7 | C

0 | T
1 | A
2 | A
3 | C
4 | T
5 | C
6 | A
7 | G

0 | C
1 | T
2 | T
3 | G
4 | T
5 | A
6 | C
7 | C

0 | C
1 | A
2 | T
3 | C
4 | A
5 | T
6 | T
7 | G

0 | G
1 | C
2 | T
3 | T
4 | G
5 | G
6 | A
7 | G

0 | T
1 | A
2 | T
3 | G
4 | A
5 | A
6 | G
7 | C

0 | C
1 | A
2 | G
3 | A
4 | C
5 | T
6 | A
7 | T

0 | G
1 | A
2 | A
3 | T
4 | C
5 | G
6 | T
7 | G

0 | A
1 | A
2 | C
3 | G
4 | A
5 | T
6 | C
7 | C

0 | A
1 | T
2 | T
3 | T
4 | G
5 | T
6 | C
7 | C

0 | A
1 | G
2 | G
3 | G
4 | C
5 | C
6 | G
7 | G

0 | G
1 | A
2 | T
3 | A
4 | T
5 | G
6 | T
7 | C

0 | G
1 | A
2 | T
3 | T
4 | C
5 | T
6 | G
7 | G

0 | T
1 | C
2 | C
3 | A
4 | A
5 | T
6 | C
7 | C

0 | C
1 | G
2 | C
3 | G
4 | G
5 | T
6 | A
7 | A

0 | T
1 | A
2 | G
3 | G
4 | G
5 | T
6 | C
7 | T

0 | T
1 | C
2 | G
3 | G
4 | T
5 | A
6 | T
7 | T

0 | T
1 | C
2 | G
3 | T
4 | T
5 | G
6 | G
7 | T

0 | T
1 | G
2 | A
3 | C
4 | T
5 | G
6 | G
7 | G

0 | G
1 | G
2 | T
3 | T
4 | A
5 | T
6 | T
7 | A

0 | A
1 | C
2 | C
3 | A
4 | T
5 | G
6 | C
7 | G

0 | T
1 | T
2 | T
3 | C
4 | A
5 | G
6 | G
7 | C

0 | T
1 | A
2 | C
3 | T
4 | G
5 | A
6 | G
7 | A

0 | G
1 | G
2 | T
3 | T
4 | A
5 | T
6 | A
7 | C

0 | C
1 | A
2 | C
3 | G
4 | C
5 | A
6 | G
7 | T

0 | G
1 | C
2 | G
3 | A
4 | A
5 | C
6 | C
7 | A

0 | A
1 | C
2 | C
3 | C
4 | A
5 | G
6 | T
7 | A

0 | C
1 | C
2 | G
3 | G
4 | G
5 | C
6 | T
7 | G

0 | T
1 | G
2 | G
3 | T
4 | C
5 | T
6 | T
7 | C

0 | A
1 | G
2 | A
3 | T
4 | A
5 | A
6 | C
7 | A

0 | A
1 | A
2 | A
3 | T
4 | A
5 | C
6 | T
7 | A

0 | T
1 | A
2 | C
3 | G
4 | G
5 | A
6 | A
7 | G

0 | G
1 | G
2 | A
3 | A
4 | T
5 | A
6 | A
7 | A

0 | T
1 | C
2 | T
3 | T
4 | G
5 | A
6 | T
7 | A

0 | T
1 | G
2 | G
3 | A
4 | A
5 | G
6 | G
7 | A

0 | A
1 | G
2 | A
3 | T
4 | G
5 | A
6 | C
7 | G

0 | A
1 | A
2 | A
3 | C
4 | T
5 | C
6 | A
7 | A

0 | T
1 | G
2 | A
3 | A
4 | G
5 | C
6 | G
7 | C

0 | C
1 | G
2 | G
3 | T
4 | G
5 | C
6 | G
7 | G

0 | T
1 | G
2 | T
3 | C
4 | G
5 | T
6 | C
7 | T

0 | A
1 | A
2 | T
3 | A
4 | C
5 | G
6 | G
7 | A

0 | T
1 | A
2 | A
3 | G
4 | T
5 | G
6 | C
7 | A

0 | G
1 | C
2 | T
3 | G
4 | G
5 | T
6 | A
7 | T

0 | G
1 | A
2 | T
3 | C
4 | A
5 | G
6 | A
7 | C

0 | C
1 | G
2 | T
3 | T
4 | T
5 | G
6 | G
7 | G

0 | A
1 | G
2 | A
3 | T
4 | A
5 | G
6 | C
7 | A

0 | A
1 | C
2 | A
3 | C
4 | A
5 | G
6 | T
7 | A

0 | T
1 | C
2 | C
3 | G
4 | T
5 | G
6 | C
7 | T

0 | A
1 | T
2 | T
3 | G
4 | G
5 | C
6 | T
7 | A

0 | G
1 | A
2 | C
3 | T
4 | T
5 | G
6 | A
7 | A

A: conclusion
6 22
3 19
7 30
2 23
1 34
4 27
0 23
5 25
C: conclusion
6 32
3 25
7 25
2 28
1 20
4 13
0 21
5 20
T: conclusion
6 26
1 17
4 33
0 37
3 34
7 19
2 29
5 26
G: conclusion
6 20
1 29
4 27
3 22
0 19
7 26
2 20
5 29


so the digital number 0,1,2,3,4,5,6,7 means the id of column?

can u change the code for the input because i have many files to process and don't want the data is placed in the code.

Thanks!

Alex
 
Hi Alex

I did change it... did you not get the post?


Kind Regards
Duncan
 
Here it is again:-

Code:
[b]#!/usr/bin/perl[/b]

while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}


Kind Regards
Duncan
 
I see! Thank you!
But can you remove the output like
"0 | G
1 | A
2 | C
3 | T
4 | T
5 | G
6 | A
7 | A"

?
 
I did check a few of the results on a much smaller dataset - and all seems fine. There isn't really much that can go wrong. I guess, by the star (thank you), that you are sorted?


Kind Regards
Duncan
 
The last post i sent is what you need - and i have stripped out the output


Kind Regards
Duncan
 
This is another problem that would be beautifully solved with awk... maybe QJRude is watching


Kind Regards
Duncan
 
another way:

Code:
my %count = ();
my @AoA = map { chomp; [split(//)] } <DATA>;
foreach my $lines (@AoA) {
   my $n = 0;
   $count{$_}{++$n}++ for (@{$lines});
}
foreach my $keys (sort keys %count) {
   print "'$keys'\nColumn\tCount\n";
   foreach my $cols (sort keys %{$count{$keys}}) {
      print "$cols\t$count{$keys}{$cols}\n";
   }
   print "\n\n";
}
 
It's probably not the best use of a regex, but here's one more:
Code:
my %hash;
map {m/\G(\w)(?{ $hash{$1}[$-[0]]++ })/cg} <DATA>;

foreach my $key (sort keys %hash) {
    print "\"$key\"\n";
    foreach (0..$#{$hash{$key}}) {
        print "$_ - ", ($hash{$key}[$_] || 0), "\n";
    }
}
 
Actually, my code is based on at least one bad assumption - this should work better:
Code:
my %hash;
my $max_len = 0;
map {m/\G(\w)(?{ $hash{$1}[$-[0]]++ })/cg;} <DATA>;
map {$max_len = $#{$hash{$_}} if ($max_len < $#{$hash{$_}})} keys %hash;

foreach my $key (sort keys %hash) {
    print "\"$key\"\n";
    foreach (0..$max_len) {
        print "$_ - ", ($hash{$key}[$_] || 0),"\n";
    }
}
 
Dear all,

Thank you for warm help! I prefer

"open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;"

for the input.

BTW, can the code be slightly modified to print the
result like:

37 26 26 56 32 22 36 47

(Note: each number is the highest frequency of the
nucleotide in each column)

Thank you very much!


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top