Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

About generating a dependent sequence 2

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
US
Hi all,

Thank you so much for help on this topic. After discussing
with my advisor, we thought that we would have to switch to
another idea because the previous data was not what we really want.

My previous description was:
**************************************************
I am going to generating 100 8 base-pair long dependent motif binding sites using Perl.

The thread is:

(1)First, randomly select 4 pairs of nucleotides as the group A.

For example, we randomly select 4 pairs of nucleotides, “CA”, “TG”, “CG”, and “TC” from 16 combinations of two nucleotides :

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT


Then randomly select another 4 pairs of nucleotides, for instance, like “AA”,”CC”,”CT” and “TT” as the group B.

Two requirements:

(1) Any pair in group A should be different from the one in group B; Obviously, no two pairs in group A or B are identical.
(2) 1st pair in group A will compete the position in the final motif against 1st pair in group B with 85% probability, as well as for the 2nd, 3rd, 4th pairs.

In the example, we have:

The major group (A)which includes pairs: CA,TG,CG,TC
The minor group (B)which includes pairs: AA,CC,CT,TT

We can randomly get:

CA --- 1st position
CC --- 2nd position
CG --- 3rd position
TC --- 4th position

What we need to do next is to generate 99 sequences so that we will make a total of 100 sequences including the first sequence: CA-CC-CG-TC.

In all these sequences: 85 1st pairs should be CA,
85 2nd pairs should be TG,
85 3rd pairs should be CG,
85 4th pairs should be TC,

vs.
15 1st pairs should be AA,
15 2nd pairs should be CC,
15 3rd pairs should be CT,
15 4th pairs should be TT,

This wil satisfying that all pairs in major group will be selected with 85% probability while all pairs in minor group with 15% probability.


Right now we thought that it might be better if we allow that group B include all 16 pairs and 1st pair in group A will compete the position in the final motif against any other pair in group B except the same one as 1st pair in group A with 85% probability, as well as for the 2nd, 3rd, 4th pairs.

Can anybody suggest this? Many thanks!

The previous code by rharsh is:

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100);

my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major++; }
my $num_minor = $records - $num_major;

{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
}
@temp = @fullset;
foreach (0..3) {
my $r_index = int(rand($#temp+1));
my $r_pair = ($temp[$r_index] ne $major[$_]) ? splice(@temp, $r_index, 1) : redo;
$minor[$_] = $r_pair;
}}

foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }
foreach (1..$num_minor) { push(@list, $minor[$pos]); }
foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
}
}

open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;

 
After the time and effort a few people spent helping you with your previous requests you decide the data wasn't what you wanted?

[sadeyes][sadeyes][sadeyes]
 
Hi Kevin,

Thanks! Since we had to change our ideas, the code should
be changed slightly.

BTW, I fixed another code which used subroutine. Thanks again!
 
I notice two changes, one that I noticed after posting in the first thread, but didn't get to posting. Change:

Code:
if ($num_major != int($num_major)) { [red]$num_major++[/red]; }
to
Code:
if ($num_major != int($num_major)) { [blue]$num_major = int($num_major++)[/blue]; }

Then, on to the point of this post - I think this change will accomplish what you want, if I understand it correctly:
Code:
my $r_pair = [red]($temp[$r_index] ne $major[$_]) ? splice(@temp, $r_index, 1) : redo;[/red]
to
Code:
my $r_pair = [blue]splice(@temp, int(rand($#temp+1)), 1);[/blue]

This would guarantee that each element in an array is unique (no duplicates within the array.) However, there can be duplicate pairs between arrays, including duplicates in the same positions. So, for example, you could get:
@major = ('AC', 'CG', 'TA', 'GT')
@minor = ('AT', 'CG', 'GT', 'CC')

Is this what you're looking for?
 
Hi rharsh,

Thanks for the message! I changed the code as you suggested and run into the result like:

CCAGCGCT
CCAGCGCT
AAATCGCT
CCAGCGCT
CCAGACCT
AAAGCGCT
CCAGACCT
CCAGCGTG
CCAGCGCT
CCAGCGCT
CCATCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCATCGCT
CCAGCGCT
CCAGACCT
CCAGACCT
AAAGCGCT
CCAGCGCT
CCAGCGCT
CCAGCGTG
CCATCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
AAAGCGCT
CCAGCGCT
CCAGACCT
CCAGCGCT
CCAGCGTG
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
CCATCGCT
CCAGCGCT
CCAGCGTG
CCAGCGCT
CCAGCGCT
AAAGCGCT
CCAGCGCT
CCAGCGTG
AAAGCGCT
CCAGCGCT
CCAGCGCT
CCATCGCT
CCAGCGTG
AAAGCGCT
CCATCGCT
CCAGCGCT
CCAGCGTG
CCAGACCT
CCATCGCT
CCAGCGCT
CCAGCGCT
CCAGCGCT
AAAGCGCT
CCATCGCT
CCAGCGCT
CCAGCGTG
CCATACTG
CCAGACTG
CCAGCGCT
CCAGCGCT
CCATCGCT
CCAGCGCT
CCAGCGCT
AAAGCGCT
CCATCGTG
CCAGCGCT
CCAGCGCT
CCAGCGTG
CCATCGTG
AAAGCGCT
CCAGCGCT
CCATCGCT
CCAGACCT
CCAGCGCT
CCAGCGCT
AAAGACCT
CCAGCGCT
AAAGACCT
CCAGACCT
CCAGACCT
AAAGACCT
CCAGACCT
CCAGCGCT
CCAGCGTG
CCAGCGCT
CCAGCGCT
CCAGCGTG
CCATCGCT
CCAGCGCT
CCAGCGCT
AAAGCGCT
AAAGCGCT


For the first pairs, they are CC or AA. CC is from Major Group and AA is from Minor Group. Actually, what I really want is to expand the minor group to the whole 16 recombinations of nucleotides:

AA,AC,AG,AT,
CA,CC,CG,CT,
GA,GC,GG,GT,
TA,TC,TG,TT

So for the first pair, it can be any one in the above group except CC because the requirement is that it can not be the same one as the first pair in Major Group as well as for the second pair, third pair and fouth pair.

I appreciate!
 
Everwood

Accepted wisdom for software projects is to identify requirements first, then (optionally, depending on complexity) do a design, and then to code it. Doing things in any other order just doesn't work.

Have a serious think about exactly what it is you want to achieve. In other words, think carefully about your requirements specification so that you don't have to keep coming back after you've changed your mind.

This isn't a rant - the folks here are a pretty tolerant bunch, but you don't want to wear out your welcome...

... and in any case, on this thread rharsh's cool recursive algorithm alone is worth the price of admission, even if it doesn't do exactly what you want!
 
Everwood, here is a modified script that does what, I think, you're looking for.
Code:
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100); # Probably of major pair, num records to generate

# This is a basic ceiling function for $num_major
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major = int($num_major++); }
my $num_minor = $records - $num_major;

{ my @temp = @fullset;
foreach (0..3) {
    my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
    $major[$_] = $r_pair;
}}

foreach my $pos (0..$#major) {
    my @list;
    foreach (1..$num_major) { push(@list, $major[$pos]); }
    my @minor_list = grep {$_ ne $major[$pos]} @fullset;
    foreach (1..$num_minor) { push(@list, $minor_list[int(rand($#minor_list+1))]); }
    foreach (0..$records-1) {
        $combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
    }
}

open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
    print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;
You've mentioned talking to your advisor a couple times, and changing the requirements for your program. I think the point KevinADC was trying to make is that people volunteered time to help you out with your project, you could make an effort to decide what you want first before getting others involved.

We're all here to help, but this isn't a script writing/programming service. Have you tried your hand at understanding and modifying the code? It's hard to learn unless you jump in and break things. :)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top