Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Seek help on Generating multiple 8 base-pair long dependent motif 1

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
0
0
US
Hi all![noevil]

I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.

The thread is:

(1)First,randomly select 4 pairs of nucleotides as the group A.

For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT


Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.

Any pair in group A should be different from the one in group B.

(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).

Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:

CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position

CACCCGTC----the 8 bp long binding motif site

We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.

Then we can get a random motif like: GGAAATCG


(3)Repeat this to make another 99 binding motif sites.



I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!

Alex



The previous code is:

## $len = length of the sequence - 1
$len = 2;

## $ar letter in position i with probability $ap
## other letters have the same probability (in our case 0.05)
## you have to build the arrays
$ar[0] = "T";
$ap[0] = 0.85;

$ar[1] = "C";
$ap[1] = 0.85;

$ar[2] = "A";
$ap[2] = 0.85;

for ($i=0;$i<=$len;$i++) {

$ar_other[$i][0] = "A";
$ar_other[$i][1] = "C";
$ar_other[$i][2] = "G";

if ($ar[$i] eq "A") {
$ar_other[$i][0] = "T";
} elsif ($ar[$i] eq "C") {
$ar_other[$i][1] = "T";
} elsif ($ar[$i] eq "G") {
$ar_other[$i][2] = "T";
}

$p = rand(1);
#print "$p";
if ($p<(1-$ap[$i])) {
$frac = (1-$ap[$i])/3;
if ($p<(1-$ap[$i]-(2*$frac))) {
$ran_seq[$i] = $ar_other[$i][0];
} elsif ($p<(1-$ap[$i]-$frac)) {
$ran_seq[$i] = $ar_other[$i][1];
} else {
$ran_seq[$i] = $ar_other[$i][2];
}
} else {

$ran_seq[$i] = $ar[$i];

}

}

## you have your sequence in the array @ran_seq
for ($i=0;$i<=$len;$i++) {
print "$ran_seq[$i] ";
}
print "\n";


A overly longer code is:

#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----
+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================

use strict;
use warnings;


my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...

# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================


for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {

loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}

exit();

# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';

for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}

return();
}

# ==========================================
sub addNoiseToDistribution {

# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.

my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);

for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}

return();
}

# ==========================================
sub convertToIntervals {

my $Sum = 0;

# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...

for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}

return();
}

# ==========================================
sub getRandomCharacter {

my ($RandomNumber) = @_;
my $i=0;

# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.

for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}

return($Alphabet[$i]);
}
 
I don't know if this helps or answers any of your questions but I had a go anyway:

Code:
#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  push @major, $temp[int(rand(scalar(@temp)))];
}
print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
}
print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print "@result\n";
You can obviously loop for 99 times if you want to.

Trojan.
 
Hi Trojan,

Thank you very much first of all.

What I mentioned the probabilty of 0.85 of pair
in the group A (major group in your case)means
that it can be randomly selected 85 times out of
100 times vs. the pair in group B 15 times out of
100 times.

I found you assigned 0.85 to the $probab in the code.
Can you clarifying this ?

Thank you for help!

Alex
 
Slightly modified version here that eliminates the chance that two pairs can be identical in one nucleotide:

Code:
#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  push @major, $temp[$index];
  splice(@temp, $index, 1);
}
print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
  splice(@temp, $index, 1);
}
print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print "@result\n";



Trojan.
 
Yes, no problem.
The rand(1.0) will generate a random number between 0.0000 and 0.99999 (ish).
Therefore when my code says:
Code:
push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
If the rand(1.0) prduces something above 0.85 (say) 0.92364 then yuo get the minor pair, otherwise you get the major pair.
This means that 85% of the time (generally) you get the major pair.

Hope that helps explain.


Trojan.
 
did not get you here.

"two pairs can be identical in one nucleotide"?

"one nucleotide" means "a position" in a binding site?
 
Trojan,

Can u show a generated motif here?
 
Sorry, genetics is not my thing so I'm trying.

The "splice" that I added was to remove the chance of getting something like this:

AA CG TA [red]CG[/red]

It removes the "CG" from the "@temp" list when it's been used so that it cannot be selected again.

I did not know if duplicate pairs were allowed so I gave you both options.


Trojan.
 
It's at the end "print @result"

(I hope I'm understanding this enough and not making a fool of myself here!!!)


Trojan.
 
Hi Trojan,

I would say "duplicate pairs" are not allowed
to appear.

We will have two groups A and B (major and minor here).
Each of them is different from another pair. So the ideal
result should never be like AA CG TA CG.

Regards,
Alex
 
That's what I thought and that's why I gave you the second version. It's removes that possibility.

The motif is in @result and therefore you may only need to print that.


Trojan.
 
Here's a sample run of 100:
CAAGACAA
ACAGTTCA
CCGAATTC
GATACTAA
GTTTCACC
TCATCTCA
CCGCCGAT
TATTGTCA
ACGTAACG
TGAGGCTT
CAAATTAC
TCGACGGG
GGTCAGGT
ATAAGGCA
ACGGGCTG
CTCCTCGT
GTAATCAT
GACAAGAA
AACCGTGA
GTCCACGG
AAGCCGGG
CTTGTCCT
GTTCTTAC
AATGCCGG
TAGCAGGA
GTCTTGAG
AGATCTGA
ATTTGATG
TGTTGGCG
CACTGGTT
AGTGAAGG
CTATGATC
GCGATCTA
AATCCAAT
TGATCGCA
TTTGAGAA
TGGTATCT
TCTGGACG
AACCCTGA
TTCCACCT
CGCTCCTG
GACTAGGG
CCAAGCAT
CAACGTTA
TCCTTAAA
TCCCGCAT
TCGCCATT
AAGCTCTC
GTACCGTT
CCAATCCT
CAATCGTC
AGTCGCTA
TACACGAG
TTCGGTAT
ACCAATCG
TGGTGATA
ACCTATTG
TTCCGTAA
CGGCATCT
AAACATGA
CTTATAAT
CGAAGTGT
CCCCCATG
GACTAGAC
CAAAGTTA
AAGTAGTC
TTACGTCA
GACTGGAT
AATATTGG
CTGGGTAT
CTTCAGGT
TCTGTTGA
AAGTGCGT
ACCTTCGT
TAGAGACG
ACAGTCGA
AACGTCTT
GGCATGAT
TGGTTTAT
TTAACTAG
GTAATGGC
CTTGTCAT
TCTGCACA
ACTCAGAA
CTGACATG
ATTCAACC
TAGTCAAC
GACACGGT
AAGACTGC
ACGAGGCC
GTACGGTC
AGCCAGTG
CGTTGATC
CTTCAAAT
ACGTTAGA
CCAACGTC
TTCCGTGC
TTCGGTTC
TCCAATAC


And here's the final code that created it:

Code:
#!/usr/bin/perl -w
use strict;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;

for(my $j=0; $j<99; $j++) {
my $i;
my @major   = ();    # Cluster of 4 nucleotides (A)
my @minor   = ();    # Cluster of 4 nucleotides (B)

# Pick 4 random pairs
my @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  push @major, $temp[$index];
  splice(@temp, $index, 1);
}
#print "@major\n";

# Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
for($i = 0; $i<4; ++$i) {
  my $index = int(rand(scalar(@temp)));
  redo unless($major[$i] ne $temp[$index]);
  push @minor, $temp[$index];
  splice(@temp, $index, 1);
}
#print "@minor\n";

# Generate resultant set based on previously defined probability
my @result = ();
for($i = 0; $i<4; ++$i) {
  push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
}
print join("", @result), "\n";
}
Is this what you wanted?

Trojan.
 
yeah,that's it! Wonderful! Appreciated!

Alex
 
Cool, glad to help, sorry I didn't understand the terminology properly. Maybe I'll learn in time.

Good luck. :)


Trojan.
 
this was only an excersize for fun:

Code:
#!perl
use strict;
use Benchmark qw(:all);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.85;
my $reps = 1000;
my $r = timethese($reps, {
   'Kevin' => \&Kevin,
   'Trojan' => \&Trojan,
});
cmpthese($r);

sub Kevin {	
   my %results = ();
   for (0..99) {
      my %base_pairs = ();
      my $motif = '';
      until (scalar keys %base_pairs == 8) {
         my $index = int rand @fullset;
         $base_pairs{$fullset[$index]} = $fullset[$index];
      }
      my @base_pairs = keys %base_pairs;
      my @major = @base_pairs[0..3];
      my @minor = @base_pairs[4..7];
      $motif .= rand(1) > $probab ? $minor[$_] : $major[$_] for (0..3);
      redo if exists $results{$motif};
      $results{$motif} = $motif;
   }
   print "$_\n" for keys %results;
}

sub Trojan {
for(my $j=0; $j<99; $j++) {
   my $i;
   my @major   = ();    # Cluster of 4 nucleotides (A)
   my @minor   = ();    # Cluster of 4 nucleotides (B)
  
   # Pick 4 random pairs
   my @temp    = @fullset;
   for($i = 0; $i<4; ++$i) {
     my $index = int(rand(scalar(@temp)));
     push @major, $temp[$index];
     splice(@temp, $index, 1);
   }
   #print "@major\n";
   
   # Pick 4 random pairs (excluding matches to @major)
   @temp    = @fullset;
   for($i = 0; $i<4; ++$i) {
     my $index = int(rand(scalar(@temp)));
     redo unless($major[$i] ne $temp[$index]);
     push @minor, $temp[$index];
     splice(@temp, $index, 1);
   }
   #print "@minor\n";
   
   # Generate resultant set based on previously defined probability
   my @result = ();
   for($i = 0; $i<4; ++$i) {
     push @result, rand(1.0) > $probab ? $major[$i] : $minor[$i];
   }
   print join("", @result), "\n";
}
}

with the results printed, Trojans code was typically a bit slower than my code (25% approx), but with the results not printed Trojans code was faster than mine (by about the same percentage).
 
I think you have this reversed Trojan:

Code:
push @result, rand(1.0) [COLOR=#ff0000]>[/color] $probab ? $major[$i] : $minor[$i];

should be:

Code:
push @result, rand(1.0) [COLOR=#ff0000]<=[/color] $probab ? $major[$i] : $minor[$i];


if the rand() function returns a value greater than the probability factor the minor sequence should be used, not the major sequence, which is what is happening in the first line above.
 
Well spotted Kevin, I think you're right there.

Not quite so sure about your algorithm for selecting the pairs though. Would it be fair to say that with your algorithm, @minor could not possibly contain any pairs that appear anywhere in @major and vice versa?
If that's the case, you'll see from the original supplied example that it should be possible to have the same values, as long as they do not share the same index position.


Trojan.
 
Trojan,

I admit I was (still am) confused if identical base-pairs could be in major and minor as long they were not in same index. But since there could never be duplicated base-pairs in the motif (I think) it seemed easier just to make sure both arrays contained no duplicate base-pairs to begin with. That might make a difference in the final outcome of the generated list, but honestly I am not sure.

Also, the implication of my benchmarking test is that the join() function is extremely slow and looks like it can drag down the run time of an entire script. You might want to look into that if you use join() often in code.

I might play around with my code a little to see if I can speed it up.

Regards,
Kevin
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top