Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to generate negative dependent nucleotide pair 1

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
US
Hi,all! I am going to generate some nagative dependent
nucleotide pairs and need your help.

I have enclosed the description of the question and
my code here. Any suggestion is highly appreciated.



***********************************
The description

To generate negative dependent pair binding sites for motif

1. We can get 16 combinations of any 2 nucleotides. They are:
AA, AT, AC, AG,
TT, TC, TG, TA,
CC, CT, CG, CA,
GG, GC, GT and GA

For example, if we say pair “AA” is a positive dependent pair, which means that “A” always comes with another “A” across many sequences with probability x%. In other words, it looks like:
…………………
……AA………..
……AA……….
……AA……….
……AA……….
……AA……….
……AA……….
………………..

In contrast to positive pair, the negative pair “AG” looks like in some sequences:
……………….
……A………..
……A………..
……A………..
……..G……….
……..G……….
……..G……….
……………….

Which means that “A” is less likely to be with “G” across these sequences than other nucleotides G, T, C. But if we count the frequency of each nucleotide along the column, we can find that the “A” and “G” have the highest frequencies in its columns. By generating 4 negative pairs, we can end up with motif binding sites of length 8. Finally we are going to make 100 binding sites.

2. (1) Randomly pick 4 pairs from the 16 combinations which will be used as “negative pairs” in the sequences. For example, we get pairs AG, CT, CT, GG.
(2) Suppose the probability for each negative pair is 70%. In the 100 binding sites, we let the all the 1st nucleotides be A with probability 70%. In other words, there are 70 As in the 100 binding sites on the 1st positions.
If 1st position is A, then 2nd position will be G with probability 57% and A or C or T with probability (1-0.57)/3;
If 1st position is not A, then let 2nd position be G automatically;
(3) Repeat this for other three negative pairs.

3. Generally speaking, we have negative pair XY.
(a) let 1st nucleotides in 100 sites be X with probability 70% and other with probability 10%
(b) if 1st nucleotide = X, then let 2nd nucleotide in 100 sites be Y with probability 57% and other with probability (1-57%)/3;
(c) Else, let 2nd nucleotide in 100 sites be Y automatically;
(d) Repeat (a) (b) (c) for other three pairs.



*************************************
The code:

#!/usr/bin/perl
use strict;
use warnings;
my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major,$y, $i, @combinations);
my ($prob, $prob2, $prob3, $records) = (70, 57, 14,100); # Probably of major pair, num records to generate

# This is a basic ceiling function for $num_major
for ($y=0; $y<=99; $y++) {
my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major = int($num_major++); }


{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
for ($i =0;$i<=7; $i++)
{ if ($major = the first nucleotide of the first negative pair) {my $num_major2 = ($prob2 / 100) * $records;}
else {my $num_major2 = $records;}
}}

foreach my $pos (0..$#major) {
my @list;
foreach (1..$num_major) { push(@list, $major[$pos]); }

my $pair = $major[$pos]; # because array indices don't work in
# a pattern match.

foreach (0..$records-1) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)),
1);
}
}

 
the updated code as below:

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);

{ my @temp = @fullset;
foreach (0..3) {
my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
$major[$_] = $r_pair;
}
}

foreach my $pos (0..$#major) {
@s_major=split //,$major[$pos];
@matched_at_least_1_base_against_major = grep { /$s_major[0]\w/ or /\w$s_major[1]/ } @fullset;

foreach (@fullset) { $num{$_} = 0; }
foreach (@matched_at_least_1_base_against_major) { $num{$_} = 10; }
$num{$major[$pos]}=40;

my @list;
foreach (@fullset) { push(@list, ($_) x $num{$_}); }

foreach (0..99) {
$combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
}
}

open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;

####### Check
open CHECK,"outfile.txt" or die $!;
foreach (@major) {
@s = split //,$_;
while (<CHECK>) {
$first_base++ if /^$s[0]/;
$second_base++ if /^\w$s[1]/;
$first_pair++ if /^$s[0]$s[1]/;
}
print "@s\t$first_base\t$second_base\t$first_pair\n";
}
close CHECK;
 
This can probably be simplified quite a bit, but it does what you want:
Code:
my (@sites, @pairs);
my @chars = qw/A C G T/;
for (1..8) { push @pairs, $chars[rand($#chars+1)]; } # Populate @pairs w/ 8 chars
my @first_prob = (70,10,10,10);
my @second_prob = (57,14,14,15);
sub returnList($$);

foreach my $pair (0..3) {
    my @prim_chars = $pairs[$pair*2];
    push @prim_chars, randomizeList(grep {$_ ne $prim_chars[0]} @chars);
    my @sec_chars = $pairs[($pair*2)+1];
    push @sec_chars, randomizeList(grep {$_ ne $sec_chars[0]} @chars);
    my @second_count = (0,0,0,0);
    my @temp_list = returnList(\@prim_chars, \@first_prob);
    
    my %index;
    foreach (0..$#sec_chars) { $index{$sec_chars[$_]} = $_; }
    foreach (0..$#temp_list) {
        my $rand_nuc = splice(@temp_list, int(rand($#temp_list+1)),1);
        if ($rand_nuc ne $prim_chars[0]) {
            $sites[$_][$pair*2] = $rand_nuc;
            $sites[$_][($pair*2)+1] = $pairs[($pair*2)+1];
            $second_count[$index{$pairs[($pair*2)+1]}]++;
        } else {
            $sites[$_][$pair*2] = $rand_nuc;
            while (1) {
                my $rand = int(rand($#sec_chars+1));
                if ($second_count[$rand] < $second_prob[$rand]) {
                    $sites[$_][($pair*2)+1] = $sec_chars[$rand];
                    $second_count[$rand]++;
                    last;
                } else {
                    redo;
                }
            }
        }
    }
}


print "Pairs: ",@pairs,"\n";

open TEMP, "> tempfile.txt";
select TEMP;

foreach (@sites) {
    print join("", @{$_}), "\n";
}
select STDOUT;
close TEMP;

sub returnList ($$) {
    my $chars = shift;
    my $prob = shift;
    my @returnList;
    foreach my $char (0..$#{$chars}) {
        foreach (0..${$prob}[$char]-1) {
            push @returnList, ${$chars}[$char];
        }
    }
    return @returnList;
}

sub randomizeList {
    my @temp;
    foreach (0..$#_) { push @temp, splice(@_,int(rand($#_+1)),1); }
    return @temp;
}
Regarding the probablities of the secondary nucleotieds, (100-57)/3 works out to ~ 14.33. To make up for the fraction, my code is biased toward the last of the three remaining nucleotides. It orders the nucleotides randomly, so you shouldn't see the same one always appearing 15% of the time - but one will be a bit more dominant.
 
Hi rharsh,

Thank you so much!

Is "print "Pairs: ",@pairs,"\n";" to print the negative pairs?

Thanks again!
 
Everwood, yeah - that prints the pairs that were selected. You probably don't need it - I used it to verify everything was being generated correctly.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top