Hi,all! I am going to generate some nagative dependent
nucleotide pairs and need your help.
I have enclosed the description of the question and
my code here. Any suggestion is highly appreciated.
***********************************
The description
To generate negative dependent pair binding sites for motif
1. We can get 16 combinations of any 2 nucleotides. They are:
AA, AT, AC, AG,
TT, TC, TG, TA,
CC, CT, CG, CA,
GG, GC, GT and GA
For example, if we say pair “AA” is a positive dependent pair, which means that “A” always comes with another “A” across many sequences with probability x%. In other words, it looks like:
…………………
……AA………..
……AA……….
……AA……….
……AA……….
……AA……….
……AA……….
………………..
In contrast to positive pair, the negative pair “AG” looks like in some sequences:
……………….
……A………..
……A………..
……A………..
……..G……….
……..G……….
……..G……….
……………….
Which means that “A” is less likely to be with “G” across these sequences than other nucleotides G, T, C. But if we count the frequency of each nucleotide along the column, we can find that the “A” and “G” have the highest frequencies in its columns. By generating 4 negative pairs, we can end up with motif binding sites of length 8. Finally we are going to make 100 binding sites.
2. (1) Randomly pick 4 pairs from the 16 combinations which will be used as “negative pairs” in the sequences. For example, we get pairs AG, CT, CT, GG.
(2) Suppose the probability for each negative pair is 70%. In the 100 binding sites, we let the all the 1st nucleotides be A with probability 70%. In other words, there are 70 As in the 100 binding sites on the 1st positions.
If 1st position is A, then 2nd position will be G with probability 57% and A or C or T with probability (1-0.57)/3;
If 1st position is not A, then let 2nd position be G automatically;
(3) Repeat this for other three negative pairs.
3. Generally speaking, we have negative pair XY.
(a) let 1st nucleotides in 100 sites be X with probability 70% and other with probability 10%
(b) if 1st nucleotide = X, then let 2nd nucleotide in 100 sites be Y with probability 57% and other with probability (1-57%)/3;
(c) Else, let 2nd nucleotide in 100 sites be Y automatically;
(d) Repeat (a) (b) (c) for other three pairs.
*************************************
The code:
#!/usr/bin/perl
use strict;
use warnings;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major,$y, $i, @combinations);
my ($prob, $prob2, $prob3, $records) = (70, 57, 14,100); # Probably of major pair, num records to generate
# This is a basic ceiling function for $num_major
for ($y=0; $y<=99; $y++) {
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major = int($num_major++); }
{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
for ($i =0;$i<=7; $i++)
{ if ($major = the first nucleotide of the first negative pair) {my $num_major2 = ($prob2 / 100) * $records;}
else {my $num_major2 = $records;}
}}
foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }
my $pair = $major[$pos]; # because array indices don't work in
# a pattern match.
foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)),
1);
}
}
nucleotide pairs and need your help.
I have enclosed the description of the question and
my code here. Any suggestion is highly appreciated.
***********************************
The description
To generate negative dependent pair binding sites for motif
1. We can get 16 combinations of any 2 nucleotides. They are:
AA, AT, AC, AG,
TT, TC, TG, TA,
CC, CT, CG, CA,
GG, GC, GT and GA
For example, if we say pair “AA” is a positive dependent pair, which means that “A” always comes with another “A” across many sequences with probability x%. In other words, it looks like:
…………………
……AA………..
……AA……….
……AA……….
……AA……….
……AA……….
……AA……….
………………..
In contrast to positive pair, the negative pair “AG” looks like in some sequences:
……………….
……A………..
……A………..
……A………..
……..G……….
……..G……….
……..G……….
……………….
Which means that “A” is less likely to be with “G” across these sequences than other nucleotides G, T, C. But if we count the frequency of each nucleotide along the column, we can find that the “A” and “G” have the highest frequencies in its columns. By generating 4 negative pairs, we can end up with motif binding sites of length 8. Finally we are going to make 100 binding sites.
2. (1) Randomly pick 4 pairs from the 16 combinations which will be used as “negative pairs” in the sequences. For example, we get pairs AG, CT, CT, GG.
(2) Suppose the probability for each negative pair is 70%. In the 100 binding sites, we let the all the 1st nucleotides be A with probability 70%. In other words, there are 70 As in the 100 binding sites on the 1st positions.
If 1st position is A, then 2nd position will be G with probability 57% and A or C or T with probability (1-0.57)/3;
If 1st position is not A, then let 2nd position be G automatically;
(3) Repeat this for other three negative pairs.
3. Generally speaking, we have negative pair XY.
(a) let 1st nucleotides in 100 sites be X with probability 70% and other with probability 10%
(b) if 1st nucleotide = X, then let 2nd nucleotide in 100 sites be Y with probability 57% and other with probability (1-57%)/3;
(c) Else, let 2nd nucleotide in 100 sites be Y automatically;
(d) Repeat (a) (b) (c) for other three pairs.
*************************************
The code:
#!/usr/bin/perl
use strict;
use warnings;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major,$y, $i, @combinations);
my ($prob, $prob2, $prob3, $records) = (70, 57, 14,100); # Probably of major pair, num records to generate
# This is a basic ceiling function for $num_major
for ($y=0; $y<=99; $y++) {
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major = int($num_major++); }
{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
for ($i =0;$i<=7; $i++)
{ if ($major = the first nucleotide of the first negative pair) {my $num_major2 = ($prob2 / 100) * $records;}
else {my $num_major2 = $records;}
}}
foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }
my $pair = $major[$pos]; # because array indices don't work in
# a pattern match.
foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)),
1);
}
}