Interesting results (tangent of the 8 base-pairs thread) 4

KevinADC · Jul 22, 2005

This is a tangent to this thread:

http://www.tek-tips.com/viewthread.cfm?qid=1095895&page=1

As I was testing code, I noticed that the pair "AC" seemd to be starting most of the 8 base-pair sequences. I would have thought that since it's random any base-pair could start a sequence. Maybe there is a flaw in my code or the hash to count the frequency that base-paris occur in:

Code:

#!perl
use strict;
#use Benchmark qw(:all);
#use Data::Dump qw(dump);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.851;
my %results = ();
my %pairs = map ($_ => 0) for (0..3);

for (0..99) {
   my %base_pairs = ();
   my @motif = ();
   until (scalar keys %base_pairs == 8) {
      my $index = int(rand(@fullset));
      $base_pairs{$fullset[$index]} = $fullset[$index];
   }
   my @base_pairs = keys %base_pairs;
   my @major = @base_pairs[0..3];
   my @minor = @base_pairs[4..7];
   for my $i (0..3) {
      push @motif, rand(1) < $probab ? $major[$i] : $minor[$i]; 
      $pairs{$i}{$motif[$i]}++;
   }
   my $motif = join('',@motif);
	redo if exists $results{$motif};
   $results{$motif} = $motif;
#   print "major = @major , minor = @minor , results = $motif\n";
}
#print "$_\n" for keys %results;
print "\n\n";
for my $keys (sort{$a <=> $b} keys %pairs) {
   print "Frequency of pairs at position @{[$keys+1]}:\n";
   for my $freq (sort { $pairs{$keys}{$b} <=> $pairs{$keys}{$a} } keys %{$pairs{$keys}}) {
	   print "   $freq = $pairs{$keys}{$freq}\n";
	}
   print "\n";
}

quite literally "AC" is always the most frequent pair in the first position, as in this example print out:

Code:

Frequency of pairs at position 1:
   AC = 47
   CC = 18
   AG = 12
   GG = 8
   GA = 6
   TG = 4
   AA = 3
   CT = 3
   AT = 1
   TT = 1
   CG = 1
   TA = 1

there appear to be other patterns in the frequency of occurences of other pairs too, any thoughts concerning this? I feel like I must be missing something obvious.

Everwood · Jul 22, 2005

Hi Kevin,

Glad that you have started this discussion. Literally,
the probability of each pair in Group A (major group)
should be 4/16 while the probability of each pair in Group B (minor group) 4/12.

Since it is defined the probability of each pair in Group A
being selected in the motif is 85%, so the probability of each position is (4/16)*85%.

But have no idea to explain why AC pair has the highest frequency...

fishiface · Jul 23, 2005

If you change the order of definitions in @fullset does this perturb the result?

"As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs."
--Maurice Wilkes

KevinADC · Jul 23, 2005

feel free to run the code fish and experiment if you want to. I reversed the order of the fullset array and AC was still always the most frequent base-pair in the first position.

I tried this:

my $index = int(rand(2-16));

so AA and AC could no longer be picked at all, but then CC was always the most frequent base-pair in the first position. In fact it never varies as far as which three base-pairs are most frequent in the first position list:

CC = always highest
TG = always second highest
AT = always third highest

[sadeyes]

KevinADC · Jul 23, 2005

well, if anyone is following this thread, I decided that when the hash keys are converted into an array here:

my @base_pairs = keys %base_pairs;

something is happening that I do not understand. I thought it would make a random list of the 8 unique base-pairs that comprised the hash, but that was not happening. So I decided to shuffle the list first then convert to an array using the List::Util module. This seems to have done the trick and now the motifs (the 8 base-pair sequences) now seem to be random. Can't say I am content with this solution as I do not understand what was happening to cause "AC" and other pairs to be selected more frequently than others.

Code:

#!perl
use strict;
#use Benchmark qw(:all);
[b]use List::Util 'shuffle';[/b]
#use Data::Dump qw(dump);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.851;
my %results = ();

for (0..99) {
   my %base_pairs = ();
   my $motif = '';
   $base_pairs{$fullset[int(rand(16))]}++ until (scalar keys %base_pairs == 8);
   my @base_pairs = [b]shuffle(keys %base_pairs);[/b]
   my @major = @base_pairs[0..3];
   my @minor = @base_pairs[4..7];
   $motif .= rand(1) < $probab ? $major[$_] : $minor[$_] for (0..3);
   redo if exists $results{$motif};
   $results{$motif} = 1;
}
print "$_\n" for keys %results;

a sorted list of the output seems to shows no weighting of certain base-pairs like before:

Code:

AACAATTC
AACAGCGG
AAGCTTAC
AATGGGCA
ACATGCCG
ACCGGTGC
ACCTTTGG
ACGAGTTT
ACTATCTG
AGCCTTAC
AGCTAATG
AGCTTGGA
AGGAACGC
AGGGCACT
AGTAGAGG
AGTCTGTA
AGTGAAAC
ATAGACGG
ATAGGCCT
ATTCTGAC
ATTGACGA
CAAGCCGT
CACCCTGA
CACTCGGC
CATGCCAT
CATTTCGC
CCAGTTGC
CCCATCAC
CCCTAGAT
CCGACGAA
CCTAAGGC
CCTCTGAA
CCTCTTAC
CGACCCGC
CGACGAGG
CGAGACTC
CGCTTGCC
CGTGACGG
CTACAAGC
CTAGACCG
CTAGCGAT
CTAGGTTG
CTCAGCGT
CTCGAACC
CTGTTAGA
CTTAAAGT
GAACTCAA
GAATACCA
GACCAACT
GACCAATT
GACGGGCT
GAGTCTCA
GAGTGGCT
GATCTAAG
GATTACGT
GCAACGCA
GCAATCCA
GCAGGAAC
GCCCAATA
GCCGAAGG
GCGGAACA
GCTCCGAG
GGACGTCA
GGATTTCG
GGCGGCCC
GGCTATGA
GGGTTTGC
GTACGCTG
GTAGTATC
GTATTAAA
GTCATCGA
GTCCGCAT
GTCTTACA
GTGGTTGA
TAAGGAAT
TACAAGCT
TACCCTTG
TACGGGAG
TAGATTAG
TAGTGAAA
TATCAGGA
TCGCGATG
TCGCTTAG
TCGGCTAT
TCTGGGCT
TGACGTTC
TGATCTGA
TGCACGAC
TGCCAGTC
TGGATTTC
TGGGGCCT
TTAAACTG
TTAACGTG
TTATAGTG
TTCCCGTC
TTGTATTG
TTTCAACT
TTTCACCG
TTTGCTTC
TTTGGGGA

of course that is just one list, each time the script runs there seems to be a good blending of all the base-pairs into the random but unique sequences.

fishiface · Jul 25, 2005

that's got to be perl's internal hash function. From perlguts

Code:

The hash algorithm is defined in the PERL_HASH(hash, key, klen) macro: 


    i = klen;
    hash = 0;
    s = key;
    while (i--)
        hash = hash * 33 + *s++;

or, in (my) perl

Code:

sub hashval {
  my $string = shift;
  my $result = 0;
  foreach $char (split //, $string) {
    $result = ( $result * 33 + ord($c) ) % 2**16;
  }
  return $result;
}

This gives

Code:

AC -> 2212
CC -> 2278
AG -> 2216
GG -> 2414
GA -> 2408
TG -> 2843
AA -> 2210
CT -> 2295
AT -> 2229
TT -> 2856
CG -> 2282
TA -> 2837

from which I fail to deduce anything so far.

maybe inspiration will strike later..

f

"As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs."
--Maurice Wilkes

duncdude · Jul 26, 2005

Hi Kevin

I ran your first script with no such problems - i.e. AC certainly isn't being picked more often? It's almost as if your random function on your computer is being odd?

Here is some samples from my machine (G4 OSX Panther):-

Code:

Frequency of pairs at position 1:
   CT = 46
   CA = 16
   TA = 13
   TC = 8
   TG = 5
   GT = 4
   AG = 4
   AT = 3
   AC = 3
   CG = 3
   CC = 2
   GA = 1
   GC = 1

Frequency of pairs at position 2:
   TC = 24
   TA = 22
   CA = 20
   TG = 19
   AT = 9
   GC = 5
   GA = 3
   AG = 2
   GT = 1
   AC = 1
   CC = 1
   CG = 1
   AA = 1

Frequency of pairs at position 3:
   TG = 23
   GA = 18
   AT = 16
   TC = 12
   AC = 9
   TA = 8
   GT = 5
   AA = 5
   AG = 4
   CG = 4
   TT = 4
   GC = 1

Frequency of pairs at position 4:
   AC = 22
   GA = 17
   GT = 14
   AT = 12
   CC = 11
   GG = 11
   TG = 9
   TC = 5
   AG = 3
   AA = 2
   GC = 1
   CG = 1
   TT = 1



Frequency of pairs at position 1:
   TG = 40
   CA = 20
   GG = 19
   TC = 5
   GT = 5
   TT = 4
   GC = 4
   CC = 3
   AT = 3
   AG = 3
   CT = 2
   AC = 1

Frequency of pairs at position 2:
   CA = 25
   GG = 22
   TC = 14
   CT = 8
   GT = 7
   AG = 7
   AT = 6
   TT = 5
   CC = 5
   TA = 4
   AC = 3
   GC = 2
   GA = 1

Frequency of pairs at position 3:
   CT = 21
   CC = 19
   TT = 14
   TC = 13
   AT = 9
   GG = 6
   CG = 6
   CA = 5
   AC = 4
   AG = 4
   GA = 3
   GT = 2
   GC = 1
   TA = 1
   AA = 1

Frequency of pairs at position 4:
   TT = 18
   CC = 14
   TC = 13
   GC = 13
   AC = 12
   AT = 11
   CT = 9
   AA = 6
   CG = 6
   GT = 3
   GA = 2
   AG = 1
   TA = 1



Frequency of pairs at position 1:
   CG = 51
   GG = 27
   CA = 9
   GA = 6
   GT = 5
   AA = 3
   AC = 3
   AT = 2
   TT = 2
   GC = 2
   CC = 1
   TC = 1

Frequency of pairs at position 2:
   GG = 25
   CA = 23
   AA = 23
   GA = 15
   CC = 10
   GT = 3
   GC = 3
   AC = 3
   AG = 3
   AT = 1
   TT = 1
   TC = 1
   TG = 1

Frequency of pairs at position 3:
   CC = 24
   GA = 22
   AT = 18
   AA = 11
   TT = 11
   CA = 10
   GC = 3
   TC = 3
   GT = 2
   AC = 2
   AG = 2
   CT = 2
   TG = 1
   TA = 1

Frequency of pairs at position 4:
   CC = 17
   TT = 17
   AT = 16
   GT = 14
   GC = 12
   TA = 10
   CT = 5
   CA = 4
   GA = 4
   AA = 3
   AC = 3
   TC = 3
   AG = 3
   TG = 1

Kind Regards
Duncan

Everwood · Jul 26, 2005

great!

KevinADC · Jul 26, 2005

thanks Duncan, nice to get a sanity check. Maybe my old windows98/Apache test server is the problem. I should have thought to check on another server setup.

duncdude · Jul 26, 2005

no problem Kevin - your code looked fine to me so i gave it a go... and it seems that there is not an issue??? AC came last on one of the test-runs - which i was hoping to see

... which means you've got another issue

PC & Win98 | bin ;-)

... good luck!

Kind Regards
Duncan

KevinADC · Jul 26, 2005

its funny becuase this is the only time I have ever observed this behavior in years of using this old server to test scripts with. It's an old machine, way past its prime, but its super reliable, and I'm pretty much the same way, well at least the past the prime part. [glasses]

Everwood · Jul 29, 2005

Hi all,

Thank you so much for help on the question. After meeting
with my advisor, I realized that there was some problem
with my previous description.

My previous description was:
**************************************************
I am going to generating 100 8 base-pair long dependent motif binding sites using Perl.

The thread is:

(1)First,randomly select 4 pairs of nucleotides as the group A.

For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT

Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.

Any pair in group A should be different from the one in group B.

(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).

Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:

CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position

CACCCGTC----the 8 bp long binding motif site

We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.

Then we can get a random motif like: GGAAATCG

(3)Repeat this to make another 99 binding motif sites.
********************************************************

In the example, we have:

The major group (A)which includes pairs: CA,TG,CG,TC
The minor group (B)which includes pairs: AA,CA,CT,TT

we randomly get:

CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position

What we need to do next is to generate 99 sequences so that we will make a total of 100 sequences including the first sequence: CA-CC-CG-TC.

In all these sequences: 85 1st pairs should be CA,
85 2nd pairs should be TG,
85 3rd pairs should be CG,
85 4th pairs should be TC,

vs.
15 1st pairs should be AA,
15 2nd pairs should be CA,
15 3rd pairs should be CT,
15 4th pairs should be TT,

This wil satisfying that all pairs in major group will be selected with 85% probability while all pairs in minor group with 15% probability.

The original code from Trojan is:

#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab = 0.85;

for(my $j=0; $j<=99; $j++) {
my $i;
my @major = (); # Cluster of 4 nucleotides (A)
my @minor = (); # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp = @fullset;
for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp)));
push @major, $temp[$index];
splice(@temp, $index, 1);
}
#print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
@temp = @fullset;
for($i = 0; $i<4; ++$i) {
my $index = int(rand(scalar(@temp)));
redo unless($major[$i] ne $temp[$index]);
push @minor, $temp[$index];
splice(@temp, $index, 1);
}
#print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print join("", @result), "\n";
}

Can anybody give me some suggestions? Thank you all so much again!

KevinADC · Jul 29, 2005

ahh! I need a strong drink!

Everwood · Jul 29, 2005

hmm,Dr.Pepper?

TrojanWarBlade · Jul 29, 2005

Southern Comfort? ;-)

Trojan.

KevinADC · Jul 29, 2005

trojans more in the ball park.... I'm partial to tequila myself.

rharsh · Jul 31, 2005

How about something like this?

Code:

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100);

{ my @temp = @fullset;
foreach (0..3) {
    my $r_pair = splice(@temp, int(rand($#temp+1)),1);
    $major[$_] = $r_pair;
}
@temp = @fullset;
foreach (0..3) {
    my $r_index = int(rand($#temp+1));
    my $r_pair = ($temp[$_] ne $major[$_]) ? splice(@temp, $r_index,1) : redo;
    $minor[$_] = $r_pair;
}}

foreach my $pos (0..$#major) {
    my @list;
    foreach (1..$prob) { push(@list, $major[$pos]);}
    foreach (1..($records-$prob)) { push @list, $minor[$pos];}
    foreach (0..$records-1) {
        $combinations[$_][$pos] = splice(@list,int(rand($#list+1)),1);
    }
}

open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
    print OUTPUT join('',@{$_}), "\n";
}
close OUTPUT;

Everwood · Jul 31, 2005

Thank you,rharsh !!!

rharsh · Jul 31, 2005

I just realized that I didn't go back and fix the foreach (1..$prob) loop so it would work with a number of records other than 100.

Everwood · Jul 31, 2005

so what is your suggestion?

you mean it will generate more than 100 sequences?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Interesting results (tangent of the 8 base-pairs thread) 4

Technical User

Technical User

IS-IT--Management

Technical User

Technical User

IS-IT--Management

Programmer

Technical User

Technical User

Programmer

Technical User

Technical User

Technical User

Technical User

Programmer

Technical User

Technical User

Technical User

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor