Hi all!![[noevil] [noevil] [noevil]](/data/assets/smilies/noevil.gif)
I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.
The thread is:
(1)First,randomly select 4 pairs of nucleotides as the group A.
For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :
AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT
Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.
Any pair in group A should be different from the one in group B.
(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).
Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:
CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position
CACCCGTC----the 8 bp long binding motif site
We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.
Then we can get a random motif like: GGAAATCG
(3)Repeat this to make another 99 binding motif sites.
I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!
Alex
The previous code is:
## $len = length of the sequence - 1
$len = 2;
## $ar letter in position i with probability $ap
## other letters have the same probability (in our case 0.05)
## you have to build the arrays
$ar[0] = "T";
$ap[0] = 0.85;
$ar[1] = "C";
$ap[1] = 0.85;
$ar[2] = "A";
$ap[2] = 0.85;
for ($i=0;$i<=$len;$i++) {
$ar_other[$i][0] = "A";
$ar_other[$i][1] = "C";
$ar_other[$i][2] = "G";
if ($ar[$i] eq "A") {
$ar_other[$i][0] = "T";
} elsif ($ar[$i] eq "C") {
$ar_other[$i][1] = "T";
} elsif ($ar[$i] eq "G") {
$ar_other[$i][2] = "T";
}
$p = rand(1);
#print "$p";
if ($p<(1-$ap[$i])) {
$frac = (1-$ap[$i])/3;
if ($p<(1-$ap[$i]-(2*$frac))) {
$ran_seq[$i] = $ar_other[$i][0];
} elsif ($p<(1-$ap[$i]-$frac)) {
$ran_seq[$i] = $ar_other[$i][1];
} else {
$ran_seq[$i] = $ar_other[$i][2];
}
} else {
$ran_seq[$i] = $ar[$i];
}
}
## you have your sequence in the array @ran_seq
for ($i=0;$i<=$len;$i++) {
print "$ran_seq[$i] ";
}
print "\n";
A overly longer code is:
#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----
+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict;
use warnings;
my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...
# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================
for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {
loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}
exit();
# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';
for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}
return();
}
# ==========================================
sub addNoiseToDistribution {
# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.
my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}
return();
}
# ==========================================
sub convertToIntervals {
my $Sum = 0;
# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...
for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}
return();
}
# ==========================================
sub getRandomCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.
for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}
return($Alphabet[$i]);
}
![[noevil] [noevil] [noevil]](/data/assets/smilies/noevil.gif)
I am going to generating multiple 8 base-pair long dependent motif binding sites using Perl.
The thread is:
(1)First,randomly select 4 pairs of nucleotides as the group A.
For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleoties :
AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT
Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CA”,”CT” and “TT” as the group B.
Any pair in group A should be different from the one in group B.
(2)To create the motif binding site. Choose “CA”, the first pair from the group A with 85% probability or “AA” from the group B with 15% probability to occupy the 1st position in the binding site, which means that each pair in the group A will be the dominant pair compared to the one in the group B. The sum of probabilities of corresponding pairs in group A and B will be 1.00 (0.85+0.15=1.00).
Repeat this for the next 3 positions and put 4 pairs of nucleotides together, we will end up with a random binding motif like:
CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position
CACCCGTC----the 8 bp long binding motif site
We also can form another group A like GG,GT,AT,CG and group
B like GA,AA,AC,CC.
Then we can get a random motif like: GGAAATCG
(3)Repeat this to make another 99 binding motif sites.
I had experience of generating 8 base-pair long indepent motif binding sites with the probability of dominant nucleotide as 0.85. But I don't know how to generate two groups of pairs of nucleotides which are not identical. Thank you very much for suggestions!
Alex
The previous code is:
## $len = length of the sequence - 1
$len = 2;
## $ar letter in position i with probability $ap
## other letters have the same probability (in our case 0.05)
## you have to build the arrays
$ar[0] = "T";
$ap[0] = 0.85;
$ar[1] = "C";
$ap[1] = 0.85;
$ar[2] = "A";
$ap[2] = 0.85;
for ($i=0;$i<=$len;$i++) {
$ar_other[$i][0] = "A";
$ar_other[$i][1] = "C";
$ar_other[$i][2] = "G";
if ($ar[$i] eq "A") {
$ar_other[$i][0] = "T";
} elsif ($ar[$i] eq "C") {
$ar_other[$i][1] = "T";
} elsif ($ar[$i] eq "G") {
$ar_other[$i][2] = "T";
}
$p = rand(1);
#print "$p";
if ($p<(1-$ap[$i])) {
$frac = (1-$ap[$i])/3;
if ($p<(1-$ap[$i]-(2*$frac))) {
$ran_seq[$i] = $ar_other[$i][0];
} elsif ($p<(1-$ap[$i]-$frac)) {
$ran_seq[$i] = $ar_other[$i][1];
} else {
$ran_seq[$i] = $ar_other[$i][2];
}
} else {
$ran_seq[$i] = $ar[$i];
}
}
## you have your sequence in the array @ran_seq
for ($i=0;$i<=$len;$i++) {
print "$ran_seq[$i] ";
}
print "\n";
A overly longer code is:
#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----
+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict;
use warnings;
my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...
# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================
for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {
loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}
exit();
# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';
for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}
return();
}
# ==========================================
sub addNoiseToDistribution {
# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.
my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}
return();
}
# ==========================================
sub convertToIntervals {
my $Sum = 0;
# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...
for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}
return();
}
# ==========================================
sub getRandomCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.
for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}
return($Alphabet[$i]);
}