Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Counting the number of repeat "units" 2

Status
Not open for further replies.

Captainrave

Technical User
Nov 16, 2007
97
GB
Hi everyone. I need help again! Basically I have DNA repeats in a csv file with some other information in adjacent columns.

The repeats can be anything like (always in the first column):
AT|AT|AT|AT (4)
ATTA|ATTA|ATTA|ATTA (4)
A|A|A|A|A|A (6) etc....

What I need to do is to output a new column that contains the "number of repeat units". I have broken up the repeats (above) to try and show you what I mean. So something like ATATATAT when broken down would be assigned the number 4...since it is AT repeated 4 times.

I had an idea for this. I was hoping to use the regex I originally used to locate repeats and use $& to output the number of repeat units in the computers memory. I have no idea how to implement this, and wondered if anyone had a better idea?

Code:
#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file
print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE,$repeat_filename);
open(OUTFILE,">$outfile");

###########################
#      Output Unit Size   #
###########################

#Splits the line into columns
while (my$line = <REPEATFILE>){
  my($firstcol)= split /,/, $line;

#Filters the repeats
if ($firstcol = m/([acgt]+)(\1){3,39}(?!\1)?/xig){
#
      print $& into new empty column;
      }
}
exit;
 
Captainrave said:
So something like ATATATAT when broken down would be assigned the number 4...since it is AT repeated 4 times.
How would you determine that AT is the pattern that's repeated? AT is repeated 4 times, but ATAT is repeated twice. Would you just use the shortest repeating pattern?
 
Yes, I would use the shortest repeating pattern. Perl can work it out from the regex: m/([acgt]+)(\1){3,39}(?!\1)?/xig)
 
I don't really like running two regexes against that string, but this works:
Code:
#Splits the line into columns
while (my $line = <REPEATFILE>){
  chomp(my @temp = split /,/, $line);

  #Filters the repeats
  my $num_repeats;
  if ($temp[0] =~ m/([acgt]+)(\1){3,39}(?!\1)?/i){
    my $pattern = $1;
    $num_repeats = () = $temp[0] =~ m/\G$pattern/gco;
  } else {
    $num_repeats = 0;
  }  
  print OUTFILE join(',', @temp, $num_repeats), "\n";
}
Depending on what else is in the data you're working with, you may want to consider using a module like Text::CSV_XS to handle the CSV parsing and printing.
 
Captainrave, can't fully understand your regex. Why a minimum of three repetitions for the subpattern? Also you lack the start and end of string markers: I suppose that you don't want anything to be in your string, but the repeated subpattern?
I would do that this way:
Code:
while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  if($firstcol=m/^([acgt]+?)\1+$/i){
    print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
  }
}
Zero will be printed if there are no repeats in the string, and also if there are extra characters before, between or after the repeating patterns.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Correction to the above
Code:
while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  $firstcol=m/^([acgt]+?)\1+$/i;
  print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
The reason I originally went for a minimum of repeat repetitions in the pattern is because they are too frequent in the source files. So I set a higher value to try and cut some of the smaller repeats out. For the analysis now it doesnt matter since I have all the repeat sequences I want, I just need to process them.

I tried both methods. Prex1's seems to get closest. I end up with this file (this is a test file I am running):


Can you notice anything obvious that is going wrong?

P.S. I am still working on your idea rharsh.
 
Modified your idea and it works Prex1!!!

Code:
while(<REPEATFILE>){
  chomp;
  my($firstcol)=split/,/;
  $firstcol=~m/^([acgt]+?)\1+$/i;
  print OUTFILE $_,',',length($1)?length($firstcol)/length($1):0,"\n";
}

As ever, many thanks!!!!!!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top