Hi all,
I want to generate a random motif of width 8. Each position will have a dominant letter with probability around 0.85. Repeat this to generate 100 motifs.
For example, I have a motif like:
ATCGTGCC
The first nucleotide is A and its probability is 0.85, which means the probability of other letters T or C or G which can be selected to appear on the first position is 0.05.
My code for this is given below. I borrowed it from and made a slight change. However, it resulted in some errors:
********************************
"use" not allowed in expression at generate_random_motif.pl line 52, at end of line
syntax error at generate_random_motif.pl line 52, near "to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict"
BEGIN not safe after errors--compilation aborted at generate_random_motif.pl line 53.
********************************
I need some suggestions on this code or do you have
some other ideas?
Thanks a lot!
Alex
The code
*************************************************
#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----
+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict;
use warnings;
my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTTATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...
# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================
for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {
loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}
exit();
# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';
for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}
return();
}
# ==========================================
sub addNoiseToDistribution {
# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.
my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}
return();
}
# ==========================================
sub convertToIntervals {
my $Sum = 0;
# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...
for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}
return();
}
# ==========================================
sub getRandomCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.
for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}
return($Alphabet[$i]);
}
*************************************************
I want to generate a random motif of width 8. Each position will have a dominant letter with probability around 0.85. Repeat this to generate 100 motifs.
For example, I have a motif like:
ATCGTGCC
The first nucleotide is A and its probability is 0.85, which means the probability of other letters T or C or G which can be selected to appear on the first position is 0.05.
My code for this is given below. I borrowed it from and made a slight change. However, it resulted in some errors:
********************************
"use" not allowed in expression at generate_random_motif.pl line 52, at end of line
syntax error at generate_random_motif.pl line 52, near "to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict"
BEGIN not safe after errors--compilation aborted at generate_random_motif.pl line 53.
********************************
I need some suggestions on this code or do you have
some other ideas?
Thanks a lot!
Alex
The code
*************************************************
#!/usr/bin/perl
# randommotifs.pl
#
# This program generates randomized instances of pseudomotifs, i.e.
# sequences that are obtained by adding "noise", according to a
specified
# model to a consensus sequence. To keep this reasonably general, we
# proceed through the following steps:
#
# Define consensus motif and alphabet
# For the required number of repetitions
# For each position in motif
# Define an appropriate probability distribution over the
alphabet
# Produce a character according to the distribution
# ...
# ...
#
# The probability distributions are generated to produce a consensus
character
# with a particular "consensus probability" and all other characters
# with equal probabilites (the noise). This is defined in a single
subroutine
# so it is straightforward to change this for a different model of noise
# (e.g. increase the probabilities for "similar" characters). It's easy
# to become creative here and incorporate (biological) knowledge into
the
# noise-model.
#
# Probablities are then converted into cumulative intervals to chose a
# particular character:
#
# Eg: Probabilities A: 0.4, C: 0.3, G: 0.2, T: 0.1
# Intervals A: 0.4 C: 0.7, G: 0.9, T: 1.0
#
# This example can be pictured as
#
# 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# +----+----+----+----+----+----
+----+----+----+----+
# | A | C | G | T |
# +----+----+----+----+----+----+----+----+----+----+
#
# .. now all we need to do is generate a random number between 0 and 1
# and see in which interval it falls. This then points to the character
to be
# written to output.
#
# B. Steipe, July 2005. Public domain code.
# =====================================================================
use strict;
use warnings;
my $N_Sequences = 100; # Number of sequences to be
generated
my @Motif = split(//,'TTTATAAT'); # This example is simply a
concatenation
# of the -35 and -10 consensus
elements
# of prokaryotic promoters;
replace with
# whatever ...
my @Alphabet = split(//,'ACGT'); # Nucleotides; replace with
whatever ...
my $P_Consensus = 0.85; # Replace with whatever ...
# ====== Globals ==========================
my @Probabilities; # Stores the probability of each
character
# ====== Program ==========================
for (my $i=0; $i < $N_Sequences; $i++) {
for (my $j=0; $j < scalar(@Motif); $j++) {
loadConsensusCharacter($Motif[$j]); # Load the character for
this position
addNoiseToDistribution(); # Modify probabilities
according to noise model
convertToIntervals();
print(getRandomCharacter(rand(1.0))); # Output character
}
print "\n";
}
exit();
# ====== Subroutines =======================
#
sub loadConsensusCharacter {
my ($char) = @_;
my $Found = 'FALSE';
for (my $i=0; $i < scalar(@Alphabet); $i++) {
if ( $char eq $Alphabet[$i]) { # found character in i_th
position in Alphabet
$Probabilities[$i] = 1.0;
$Found = 'TRUE';
} else {
$Probabilities[$i] = 0.0;
}
}
if ($Found eq 'FALSE') {
die("Panic: Motif-Character\"$char\" was not found in Alphabet.
Aborting.\n");
}
return();
}
# ==========================================
sub addNoiseToDistribution {
# This implements a very simple noise model:
# We work on an array in which one element is 1.0 and
# all others are 0.0. The element with 1.0 has the same
# index as the consensus character in the Alphabet.
#
# We change that value to the consensus probability and
# we distribute the remaining probability equally among
# all other elements.
#
# It should be straightforward to implement more elaborate
# noise models, or use a position specific scoring matrix
# or something else here.
my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
for (my $i=0; $i < scalar(@Probabilities); $i++) {
if ( $Probabilities[$i] == 1.0 ) { # ... this is the consensus
character
$Probabilities[$i] = $P_Consensus;
} else {
$Probabilities[$i] = $P_NonConsensus;
}
}
return();
}
# ==========================================
sub convertToIntervals {
my $Sum = 0;
# Convert the values of the probabilities array to the top
boundaries
# of intervals having widths proportional to their relative
frequency.
# Numbers range from 0 to 1.0 ...
for (my $i=1; $i < scalar(@Probabilities); $i++) {
$Probabilities[$i] += $Probabilities[$i-1];
}
return();
}
# ==========================================
sub getRandomCharacter {
my ($RandomNumber) = @_;
my $i=0;
# Test which interval contains the RandomNumber ...
# The index of the interval points to the Event we should choose.
for ($i=0; $i < scalar(@Probabilities); $i++) {
if ($Probabilities[$i] > $RandomNumber) { last; }
}
return($Alphabet[$i]);
}
*************************************************