Counting the number of repeat "units" 2

Captainrave · Nov 23, 2008

Hi everyone. I need help again! Basically I have DNA repeats in a csv file with some other information in adjacent columns.

The repeats can be anything like (always in the first column):
AT|AT|AT|AT (4)
ATTA|ATTA|ATTA|ATTA (4)
A|A|A|A|A|A (6) etc....

What I need to do is to output a new column that contains the "number of repeat units". I have broken up the repeats (above) to try and show you what I mean. So something like ATATATAT when broken down would be assigned the number 4...since it is AT repeated 4 times.

I had an idea for this. I was hoping to use the regex I originally used to locate repeats and use $& to output the number of repeat units in the computers memory. I have no idea how to implement this, and wondered if anyone had a better idea?

Code:

#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file
print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE,$repeat_filename);
open(OUTFILE,">$outfile");

###########################
#      Output Unit Size   #
###########################

#Splits the line into columns
while (my$line = <REPEATFILE>){
  my($firstcol)= split /,/, $line;

#Filters the repeats
if ($firstcol = m/([acgt]+)(\1){3,39}(?!\1)?/xig){
#
      print $& into new empty column;
      }
}
exit;

rharsh · Nov 23, 2008

Captainrave said:
So something like ATATATAT when broken down would be assigned the number 4...since it is AT repeated 4 times.

How would you determine that AT is the pattern that's repeated? AT is repeated 4 times, but ATAT is repeated twice. Would you just use the shortest repeating pattern?

Captainrave · Nov 23, 2008

Yes, I would use the shortest repeating pattern. Perl can work it out from the regex: m/([acgt]+)(\1){3,39}(?!\1)?/xig)

rharsh · Nov 23, 2008

I don't really like running two regexes against that string, but this works:

Code:

#Splits the line into columns
while (my $line = <REPEATFILE>){
  chomp(my @temp = split /,/, $line);

  #Filters the repeats
  my $num_repeats;
  if ($temp[0] =~ m/([acgt]+)(\1){3,39}(?!\1)?/i){
    my $pattern = $1;
    $num_repeats = () = $temp[0] =~ m/\G$pattern/gco;
  } else {
    $num_repeats = 0;
  }  
  print OUTFILE join(',', @temp, $num_repeats), "\n";
}

Depending on what else is in the data you're working with, you may want to consider using a module like Text::CSV_XS to handle the CSV parsing and printing.

prex1 · Nov 24, 2008

Captainrave, can't fully understand your regex. Why a minimum of three repetitions for the subpattern? Also you lack the start and end of string markers: I suppose that you don't want anything to be in your string, but the repeated subpattern?
I would do that this way:

Code:

while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  if($firstcol=m/^([acgt]+?)\1+$/i){
    print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
  }
}

Zero will be printed if there are no repeats in the string, and also if there are extra characters before, between or after the repeating patterns.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

prex1 · Nov 24, 2008

Correction to the above

Code:

while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  $firstcol=m/^([acgt]+?)\1+$/i;
  print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
}

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Captainrave · Nov 24, 2008

The reason I originally went for a minimum of repeat repetitions in the pattern is because they are too frequent in the source files. So I set a higher value to try and cut some of the smaller repeats out. For the analysis now it doesnt matter since I have all the repeat sequences I want, I just need to process them.

I tried both methods. Prex1's seems to get closest. I end up with this file (this is a test file I am running):

http://myfreefilehosting.com/f/50ec11e78e_0.1MB

Can you notice anything obvious that is going wrong?

P.S. I am still working on your idea rharsh.

Captainrave · Nov 24, 2008

Modified your idea and it works Prex1!!!

Code:

while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  $firstcol=~m/^([acgt]+?)\1+$/i;
  print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
}

As ever, many thanks!!!!!!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Counting the number of repeat "units" 2

Captainrave

Technical User

rharsh

Technical User

Captainrave

Technical User

rharsh

Technical User

prex1

Programmer

prex1

Programmer

Captainrave

Technical User

Captainrave

Technical User

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Counting the number of repeat &quot;units&quot; 2

Technical User

Technical User

Technical User

Technical User

Programmer

Programmer

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor

Counting the number of repeat "units" 2