a question in my code to calculate the frequency of nucleotides 3

Everwood · Aug 17, 2005

Hi! I have a txt file which holds 100 sequences and the
width for each sequence is 8 nucleotide long.

Like:

AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA
........

I want to calculate the frequency of nucleotides A,T,C,G
for each column NOT each row and print them out.

My code is here but it doesn't work. Would anybody give
me some suggestions? Thank you very much!

Alex

Code

#!/usr/bin/perl -w
# Determining frequency of nucleotides
my (@short,$x,$position,$base);

open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;

for ( $position = 0 ; $position < 7 ; ++$position ){
for ($x=0; $x<=$#short; $x++){

$count_of_A = 0; # Initialize the counts.
$count_of_C = 0;
$count_of_G = 0;
$count_of_T = 0;
$base = substr($short[$x], $position, 1);

if ( $base eq 'A' ) {
++$count_of_A;
} elsif ( $base eq 'C' ) {
++$count_of_C;
} elsif ( $base eq 'G' ) {
++$count_of_G;
} elsif ( $base eq 'T' ) {
++$count_of_T;
}

}

print "A = $count_of_A\n\"; # print the results
print "C = $count_of_C\n\";
print "G = $count_of_G\n\";
print "T = $count_of_T\n\";

}
# exit the program
exit;

outfile8.txt holds sequences:

ACGTGACC
TGCAGATT
TAGTTTGA
TTTTCAAA
GACTTCAG
CCAGTTTA
TTACTTTG
TGCCGCGT
TCACGGCG
AGTCGTCA
CCCTAATT
CATTAGCG
TGGTTAAT
TGGGTTAC
GACTACCC
TAGCGCTG
TACTGTTC
AGTGAGCC
CCTTTAAC
GAGCGTCG
GTACAGTT
ATTCTCCA
GACACGCG
TACTAGAA
GTTCATCA
TATAGTCC
CTAATTTT
TACCCATG
CATGTATG
GAACTACG
CGGTTAGT
CACATGCG
TGTGACCT
TTCATCTA
GCCCTGGG
TTGGGAGC
CTACTACC
AGAACCAC
AGTCGACT
TTGATGCT
CCCTCCTA
AGCAAGTT
AGAGCGAA
TACGACAA
CGATTAGC
TACTTAGA
ACCCACCG
CTGAGTTC
TGTCATAA
CTATTGCC
ATCTGTAC
TAACTCAG
CTTGTACC
CATCATTG
GCTTGGAG
TATGAAGC
CAGACTAT
GAATCGTG
AACGATCC
ATTTGTCC
AGGGCCGG
GATATGTC
GATTCTGG
TCCAATCC
CGCGGTAA
TAGGGTCT
TCGGTATT
TCGTTGGT
TGACTGGG
GGTTATTA
ACCATGCG
TTTCAGGC
TACTGAGA
GGTTATAC
CACGCAGT
GCGAACCA
ACCCAGTA
CCGGGCTG
TGGTCTTC
AGATAACA
AAATACTA
TACGGAAG
GGAATAAA
TCTTGATA
TGGAAGGA
AGATGACG
AAACTCAA
TGAAGCGC
CGGTGCGG
TGTCGTCT
AATACGGA
TAAGTGCA
GCTGGTAT
GATCAGAC
CGTTTGGG
AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA

duncdude · Aug 17, 2005

i haven't checked it - but something like this?

Code:

[b]#!/usr/bin/perl[/b]

while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    print "$x | $cols[$x]\n";
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}

[blue]__DATA__
ACGTGACC
TGCAGATT
TAGTTTGA
TTTTCAAA
GACTTCAG
CCAGTTTA
TTACTTTG
TGCCGCGT
TCACGGCG
AGTCGTCA
CCCTAATT
CATTAGCG
TGGTTAAT
TGGGTTAC
GACTACCC
TAGCGCTG
TACTGTTC
AGTGAGCC
CCTTTAAC
GAGCGTCG
GTACAGTT
ATTCTCCA
GACACGCG
TACTAGAA
GTTCATCA
TATAGTCC
CTAATTTT
TACCCATG
CATGTATG
GAACTACG
CGGTTAGT
CACATGCG
TGTGACCT
TTCATCTA
GCCCTGGG
TTGGGAGC
CTACTACC
AGAACCAC
AGTCGACT
TTGATGCT
CCCTCCTA
AGCAAGTT
AGAGCGAA
TACGACAA
CGATTAGC
TACTTAGA
ACCCACCG
CTGAGTTC
TGTCATAA
CTATTGCC
ATCTGTAC
TAACTCAG
CTTGTACC
CATCATTG
GCTTGGAG
TATGAAGC
CAGACTAT
GAATCGTG
AACGATCC
ATTTGTCC
AGGGCCGG
GATATGTC
GATTCTGG
TCCAATCC
CGCGGTAA
TAGGGTCT
TCGGTATT
TCGTTGGT
TGACTGGG
GGTTATTA
ACCATGCG
TTTCAGGC
TACTGAGA
GGTTATAC
CACGCAGT
GCGAACCA
ACCCAGTA
CCGGGCTG
TGGTCTTC
AGATAACA
AAATACTA
TACGGAAG
GGAATAAA
TCTTGATA
TGGAAGGA
AGATGACG
AAACTCAA
TGAAGCGC
CGGTGCGG
TGTCGTCT
AATACGGA
TAAGTGCA
GCTGGTAT
GATCAGAC
CGTTTGGG
AGATAGCA
ACACAGTA
TCCGTGCT
ATTGGCTA
GACTTGAA[/blue]

Kind Regards
Duncan

duncdude · Aug 17, 2005

You can easily include a subroutine to tidy the 4 blocks - i know you can do that

the backbone of the script is a very simple hash

Code:

while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    [b]$cols[$x]{$x}++;[/b]
  }
  
  print "\n";
  
}

Kind Regards
Duncan

Everwood · Aug 17, 2005

Hi Duncan,

I ran the 1st script. The output is:

A: conclusion
C: conclusion
T: conclusion
G: conclusion

What is the problem?

Thanks,
Alex

duncdude · Aug 17, 2005

did u include the data?

Kind Regards
Duncan

Everwood · Aug 17, 2005

should I include the data in the code?

duncdude · Aug 17, 2005

No

Code:

[b]#!/usr/bin/perl[/b]

open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;

foreach (@short) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    print "$x | $cols[$x]\n";
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}

Kind Regards
Duncan

Everwood · Aug 17, 2005

Hi Duncan,

I put the data in the code and got the result like:

0 | C
1 | G
2 | A
3 | T
4 | T
5 | A
6 | G
7 | C

0 | T
1 | A
2 | C
3 | T
4 | T
5 | A
6 | G
7 | A

0 | A
1 | C
2 | C
3 | C
4 | A
5 | C
6 | C
7 | G

0 | C
1 | T
2 | G
3 | A
4 | G
5 | T
6 | T
7 | C

0 | T
1 | G
2 | T
3 | C
4 | A
5 | T
6 | A
7 | A

0 | C
1 | T
2 | A
3 | T
4 | T
5 | G
6 | C
7 | C

0 | A
1 | T
2 | C
3 | T
4 | G
5 | T
6 | A
7 | C

0 | T
1 | A
2 | A
3 | C
4 | T
5 | C
6 | A
7 | G

0 | C
1 | T
2 | T
3 | G
4 | T
5 | A
6 | C
7 | C

0 | C
1 | A
2 | T
3 | C
4 | A
5 | T
6 | T
7 | G

0 | G
1 | C
2 | T
3 | T
4 | G
5 | G
6 | A
7 | G

0 | T
1 | A
2 | T
3 | G
4 | A
5 | A
6 | G
7 | C

0 | C
1 | A
2 | G
3 | A
4 | C
5 | T
6 | A
7 | T

0 | G
1 | A
2 | A
3 | T
4 | C
5 | G
6 | T
7 | G

0 | A
1 | A
2 | C
3 | G
4 | A
5 | T
6 | C
7 | C

0 | A
1 | T
2 | T
3 | T
4 | G
5 | T
6 | C
7 | C

0 | A
1 | G
2 | G
3 | G
4 | C
5 | C
6 | G
7 | G

0 | G
1 | A
2 | T
3 | A
4 | T
5 | G
6 | T
7 | C

0 | G
1 | A
2 | T
3 | T
4 | C
5 | T
6 | G
7 | G

0 | T
1 | C
2 | C
3 | A
4 | A
5 | T
6 | C
7 | C

0 | C
1 | G
2 | C
3 | G
4 | G
5 | T
6 | A
7 | A

0 | T
1 | A
2 | G
3 | G
4 | G
5 | T
6 | C
7 | T

0 | T
1 | C
2 | G
3 | G
4 | T
5 | A
6 | T
7 | T

0 | T
1 | C
2 | G
3 | T
4 | T
5 | G
6 | G
7 | T

0 | T
1 | G
2 | A
3 | C
4 | T
5 | G
6 | G
7 | G

0 | G
1 | G
2 | T
3 | T
4 | A
5 | T
6 | T
7 | A

0 | A
1 | C
2 | C
3 | A
4 | T
5 | G
6 | C
7 | G

0 | T
1 | T
2 | T
3 | C
4 | A
5 | G
6 | G
7 | C

0 | T
1 | A
2 | C
3 | T
4 | G
5 | A
6 | G
7 | A

0 | G
1 | G
2 | T
3 | T
4 | A
5 | T
6 | A
7 | C

0 | C
1 | A
2 | C
3 | G
4 | C
5 | A
6 | G
7 | T

0 | G
1 | C
2 | G
3 | A
4 | A
5 | C
6 | C
7 | A

0 | A
1 | C
2 | C
3 | C
4 | A
5 | G
6 | T
7 | A

0 | C
1 | C
2 | G
3 | G
4 | G
5 | C
6 | T
7 | G

0 | T
1 | G
2 | G
3 | T
4 | C
5 | T
6 | T
7 | C

0 | A
1 | G
2 | A
3 | T
4 | A
5 | A
6 | C
7 | A

0 | A
1 | A
2 | A
3 | T
4 | A
5 | C
6 | T
7 | A

0 | T
1 | A
2 | C
3 | G
4 | G
5 | A
6 | A
7 | G

0 | G
1 | G
2 | A
3 | A
4 | T
5 | A
6 | A
7 | A

0 | T
1 | C
2 | T
3 | T
4 | G
5 | A
6 | T
7 | A

0 | T
1 | G
2 | G
3 | A
4 | A
5 | G
6 | G
7 | A

0 | A
1 | G
2 | A
3 | T
4 | G
5 | A
6 | C
7 | G

0 | A
1 | A
2 | A
3 | C
4 | T
5 | C
6 | A
7 | A

0 | T
1 | G
2 | A
3 | A
4 | G
5 | C
6 | G
7 | C

0 | C
1 | G
2 | G
3 | T
4 | G
5 | C
6 | G
7 | G

0 | T
1 | G
2 | T
3 | C
4 | G
5 | T
6 | C
7 | T

0 | A
1 | A
2 | T
3 | A
4 | C
5 | G
6 | G
7 | A

0 | T
1 | A
2 | A
3 | G
4 | T
5 | G
6 | C
7 | A

0 | G
1 | C
2 | T
3 | G
4 | G
5 | T
6 | A
7 | T

0 | G
1 | A
2 | T
3 | C
4 | A
5 | G
6 | A
7 | C

0 | C
1 | G
2 | T
3 | T
4 | T
5 | G
6 | G
7 | G

0 | A
1 | G
2 | A
3 | T
4 | A
5 | G
6 | C
7 | A

0 | A
1 | C
2 | A
3 | C
4 | A
5 | G
6 | T
7 | A

0 | T
1 | C
2 | C
3 | G
4 | T
5 | G
6 | C
7 | T

0 | A
1 | T
2 | T
3 | G
4 | G
5 | C
6 | T
7 | A

0 | G
1 | A
2 | C
3 | T
4 | T
5 | G
6 | A
7 | A

A: conclusion
6 22
3 19
7 30
2 23
1 34
4 27
0 23
5 25
C: conclusion
6 32
3 25
7 25
2 28
1 20
4 13
0 21
5 20
T: conclusion
6 26
1 17
4 33
0 37
3 34
7 19
2 29
5 26
G: conclusion
6 20
1 29
4 27
3 22
0 19
7 26
2 20
5 29

so the digital number 0,1,2,3,4,5,6,7 means the id of column?

can u change the code for the input because i have many files to process and don't want the data is placed in the code.

Thanks!

Alex

duncdude · Aug 17, 2005

Hi Alex

I did change it... did you not get the post?

Kind Regards
Duncan

duncdude · Aug 17, 2005

Here it is again:-

Code:

[b]#!/usr/bin/perl[/b]

while (<DATA>) {

  @cols = split(//);
  
  for ($x=0; $x<$#cols; $x++) {
    $cols[$x]{$x}++;
  }
  
  print "\n";
  
}

print "A: conclusion\n";
while (($key, $value) = each %A) {
  print "$key\t$value\n";
}

print "C: conclusion\n";
while (($key, $value) = each %C) {
  print "$key\t$value\n";
}

print "T: conclusion\n";
while (($key, $value) = each %T) {
  print "$key\t$value\n";
}

print "G: conclusion\n";
while (($key, $value) = each %G) {
  print "$key\t$value\n";
}

Kind Regards
Duncan

Everwood · Aug 17, 2005

I see! Thank you!
But can you remove the output like
"0 | G
1 | A
2 | C
3 | T
4 | T
5 | G
6 | A
7 | A"

?

duncdude · Aug 17, 2005

I did check a few of the results on a much smaller dataset - and all seems fine. There isn't really much that can go wrong. I guess, by the star (thank you), that you are sorted?

Kind Regards
Duncan

duncdude · Aug 17, 2005

The last post i sent is what you need - and i have stripped out the output

Kind Regards
Duncan

duncdude · Aug 17, 2005

This is another problem that would be beautifully solved with awk... maybe QJRude is watching

Kind Regards
Duncan

TrojanWarBlade · Aug 17, 2005

Or ruby maybe?

(very injured) Trojan.

duncdude · Aug 17, 2005

oh no!

Kind Regards
Duncan

KevinADC · Aug 18, 2005

another way:

Code:

my %count = ();
my @AoA = map { chomp; [split(//)] } <DATA>;
foreach my $lines (@AoA) {
   my $n = 0;
   $count{$_}{++$n}++ for (@{$lines});
}
foreach my $keys (sort keys %count) {
   print "'$keys'\nColumn\tCount\n";
   foreach my $cols (sort keys %{$count{$keys}}) {
      print "$cols\t$count{$keys}{$cols}\n";
   }
   print "\n\n";
}

rharsh · Aug 18, 2005

It's probably not the best use of a regex, but here's one more:

Code:

my %hash;
map {m/\G(\w)(?{ $hash{$1}[$-[0]]++ })/cg} <DATA>;

foreach my $key (sort keys %hash) {
    print "\"$key\"\n";
    foreach (0..$#{$hash{$key}}) {
        print "$_ - ", ($hash{$key}[$_] || 0), "\n";
    }
}

rharsh · Aug 18, 2005

Actually, my code is based on at least one bad assumption - this should work better:

Code:

my %hash;
my $max_len = 0;
map {m/\G(\w)(?{ $hash{$1}[$-[0]]++ })/cg;} <DATA>;
map {$max_len = $#{$hash{$_}} if ($max_len < $#{$hash{$_}})} keys %hash;

foreach my $key (sort keys %hash) {
    print "\"$key\"\n";
    foreach (0..$max_len) {
        print "$_ - ", ($hash{$key}[$_] || 0),"\n";
    }
}

Everwood · Aug 19, 2005

Dear all,

Thank you for warm help! I prefer

"open (SHORT, "< outfile8.txt");
chomp (@short = <SHORT>);
close SHORT;"

for the input.

BTW, can the code be slightly modified to print the
result like:

37 26 26 56 32 22 36 47

(Note: each number is the highest frequency of the
nucleotide in each column)

Thank you very much!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

a question in my code to calculate the frequency of nucleotides 3

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Technical User

Technical User

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor