Interesting results (tangent of the 8 base-pairs thread) 4

KevinADC · Jul 22, 2005

This is a tangent to this thread:

http://www.tek-tips.com/viewthread.cfm?qid=1095895&page=1

As I was testing code, I noticed that the pair "AC" seemd to be starting most of the 8 base-pair sequences. I would have thought that since it's random any base-pair could start a sequence. Maybe there is a flaw in my code or the hash to count the frequency that base-paris occur in:

Code:

#!perl
use strict;
#use Benchmark qw(:all);
#use Data::Dump qw(dump);

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $probab  = 0.851;
my %results = ();
my %pairs = map ($_ => 0) for (0..3);

for (0..99) {
   my %base_pairs = ();
   my @motif = ();
   until (scalar keys %base_pairs == 8) {
      my $index = int(rand(@fullset));
      $base_pairs{$fullset[$index]} = $fullset[$index];
   }
   my @base_pairs = keys %base_pairs;
   my @major = @base_pairs[0..3];
   my @minor = @base_pairs[4..7];
   for my $i (0..3) {
      push @motif, rand(1) < $probab ? $major[$i] : $minor[$i]; 
      $pairs{$i}{$motif[$i]}++;
   }
   my $motif = join('',@motif);
	redo if exists $results{$motif};
   $results{$motif} = $motif;
#   print "major = @major , minor = @minor , results = $motif\n";
}
#print "$_\n" for keys %results;
print "\n\n";
for my $keys (sort{$a <=> $b} keys %pairs) {
   print "Frequency of pairs at position @{[$keys+1]}:\n";
   for my $freq (sort { $pairs{$keys}{$b} <=> $pairs{$keys}{$a} } keys %{$pairs{$keys}}) {
	   print "   $freq = $pairs{$keys}{$freq}\n";
	}
   print "\n";
}

quite literally "AC" is always the most frequent pair in the first position, as in this example print out:

Code:

Frequency of pairs at position 1:
   AC = 47
   CC = 18
   AG = 12
   GG = 8
   GA = 6
   TG = 4
   AA = 3
   CT = 3
   AT = 1
   TT = 1
   CG = 1
   TA = 1

there appear to be other patterns in the frequency of occurences of other pairs too, any thoughts concerning this? I feel like I must be missing something obvious.

rharsh · Jul 31, 2005

If you generate a number of sequences other than 100 (either higher or lower) the percentages of major/minor nucleotides will be off (not equal to 85%).

My suggestion? If you're going to change the number of sequences that need to be generated, you'll need to fix the problem!

Everwood · Jul 31, 2005

Thanks! In case that I need the number other than 100, I guess I can change the probability accordingly. Right?

Thanks!!!

rharsh · Jul 31, 2005

No problem... actually, use this instead. I fixed the bit I mentioned earlier and, while doing it, found a real problem.

Code:

my @fullset = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my (@major, @minor, @combinations);
my ($prob, $records) = (85, 100);

my $num_major = ($prob / 100) * $records;
if ($num_major != int($num_major)) { $num_major++; }
my $num_minor = $records - $num_major;

{ my @temp = @fullset;
foreach (0..3) {
    my $r_pair = splice(@temp, int(rand($#temp+1)), 1);
    $major[$_] = $r_pair;
}
@temp = @fullset;
foreach (0..3) {
    my $r_index = int(rand($#temp+1));
    my $r_pair = ($temp[$r_index] ne $major[$_]) ? splice(@temp, $r_index, 1) : redo;
    $minor[$_] = $r_pair;
}}

foreach my $pos (0..$#major) {
    my @list;
    foreach (1..$num_major) { push(@list, $major[$pos]); }
    foreach (1..$num_minor) { push(@list, $minor[$pos]); }
    foreach (0..$records-1) {
        $combinations[$_][$pos] = splice(@list, int(rand($#list+1)), 1);
    }
}

open OUTPUT, "> outfile.txt" or die "Cannot open output file.\n";
foreach (@combinations) {
    print OUTPUT join('', @{$_}), "\n";
}
close OUTPUT;

jonadab · Sep 6, 2005

Hi, not sure if this thread still matters to anyone, but for the sake of posterity...

> well, if anyone is following this thread, I decided that
> when the hash keys are converted into an array here:
>
> my @base_pairs = keys %base_pairs;
>
> something is happening that I do not understand. I
> thought it would make a random list

No, keys is not guaranteed to return the keys in random
(or even pseudorandom) order. It returns them in an order
that *looks* random to the casual observer, but you may
find, depending on your version of Perl, that it is the
*same* apparently random order every single time you run
the program. If you want to guarantee at least a
pseudorandom order, you can run it through a Schwartzian
Transform, something like this:

Code:

my @base_pairs = map{ $$_[0] } sort {
  $$a[1] <=> $$b[1] } map { $_ => rand(1000)
} keys %base_pairs;

(Whether that is pseudorandom or a cryptographically-sound random is probably platform dependent and may also be dependent on your Perl version and other factors; for the purposes it's being used for in this thread this probably
does not matter.)

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Interesting results (tangent of the 8 base-pairs thread) 4

KevinADC

Technical User

rharsh

Technical User

Everwood

Technical User

rharsh

Technical User

jonadab

Technical User

Similar threads

Part and Inventory Search

Sponsor