Hi all,
Thank you so much for help on this topic. After discussing
with my advisor, we thought that we would have to switch to
another idea because the previous data was not what we really want.
My previous description was:
**************************************************
I am going to generating 100 8 base-pair long dependent motif binding sites using Perl.
The thread is:
(1)First, randomly select 4 pairs of nucleotides as the group A.
For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleotides :
AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT
Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CC”,”CT” and “TT” as the group B.
Two requirements:
(1) Any pair in group A should be different from the one in group B; Obviously, no two pairs in group A or B are identical.
(2) 1st pair in group A will compete the position in the final motif against 1st pair in group B with 85% probability, as well as for the 2nd, 3rd, 4th pairs.
In the example, we have:
The major group (A)which includes pairs: CA,TG,CG,TC
The minor group (B)which includes pairs: AA,CC,CT,TT
We can randomly get:
CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position
What we need to do next is to generate 99 sequences so that we will make a total of 100 sequences including the first sequence: CA-CC-CG-TC.
In all these sequences: 85 1st pairs should be CA,
85 2nd pairs should be TG,
85 3rd pairs should be CG,
85 4th pairs should be TC,
vs.
15 1st pairs should be AA,
15 2nd pairs should be CC,
15 3rd pairs should be CT,
15 4th pairs should be TT,
This wil satisfying that all pairs in major group will be selected with 85% probability while all pairs in minor group with 15% probability.
Right now we thought that it might be better if we allow that group B include all 16 pairs and 1st pair in group A will compete the position in the final motif against any other pair in group B except the same one as 1st pair in group A with 85% probability, as well as for the 2nd, 3rd, 4th pairs.
Can anybody suggest this? Many thanks!
The previous code by rharsh is:
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100);
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major++; }
my $num_minor = $records - $num_major;
{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
}
@temp = @fullset;
foreach (0..3) {
my $r_index = int(rand($#temp+1));
my $r_pair = ($temp[$r_index] ne $major[$_]) ? splice(@temp, $r_index, 1) : redo;
$minor[$_] = $r_pair;
}}
foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }
foreach (1..$num_minor) { push(@list, $minor[$pos]); }
foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
}
}
open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;
Thank you so much for help on this topic. After discussing
with my advisor, we thought that we would have to switch to
another idea because the previous data was not what we really want.
My previous description was:
**************************************************
I am going to generating 100 8 base-pair long dependent motif binding sites using Perl.
The thread is:
(1)First, randomly select 4 pairs of nucleotides as the group A.
For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleotides :
AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT
Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CC”,”CT” and “TT” as the group B.
Two requirements:
(1) Any pair in group A should be different from the one in group B; Obviously, no two pairs in group A or B are identical.
(2) 1st pair in group A will compete the position in the final motif against 1st pair in group B with 85% probability, as well as for the 2nd, 3rd, 4th pairs.
In the example, we have:
The major group (A)which includes pairs: CA,TG,CG,TC
The minor group (B)which includes pairs: AA,CC,CT,TT
We can randomly get:
CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position
What we need to do next is to generate 99 sequences so that we will make a total of 100 sequences including the first sequence: CA-CC-CG-TC.
In all these sequences: 85 1st pairs should be CA,
85 2nd pairs should be TG,
85 3rd pairs should be CG,
85 4th pairs should be TC,
vs.
15 1st pairs should be AA,
15 2nd pairs should be CC,
15 3rd pairs should be CT,
15 4th pairs should be TT,
This wil satisfying that all pairs in major group will be selected with 85% probability while all pairs in minor group with 15% probability.
Right now we thought that it might be better if we allow that group B include all 16 pairs and 1st pair in group A will compete the position in the final motif against any other pair in group B except the same one as 1st pair in group A with 85% probability, as well as for the 2nd, 3rd, 4th pairs.
Can anybody suggest this? Many thanks!
The previous code by rharsh is:
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100);
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major++; }
my $num_minor = $records - $num_major;
{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
}
@temp = @fullset;
foreach (0..3) {
my $r_index = int(rand($#temp+1));
my $r_pair = ($temp[$r_index] ne $major[$_]) ? splice(@temp, $r_index, 1) : redo;
$minor[$_] = $r_pair;
}}
foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }
foreach (1..$num_minor) { push(@list, $minor[$pos]); }
foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
}
}
open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;