Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing entries from a file 4

Status
Not open for further replies.

Captainrave

Technical User
Nov 16, 2007
97
GB
Hi everyone. So I am completely stuck. Basically I have a csv file with data like:

First column
AAAAAA more data in adjacent columns <keep
ATATAT more data in adjacent columns <keep
AATAAT more data in adjacent columns <keep

I want to delete any line that has three letters repeated over and over again. This should be possible with a reasonably simple regular expression right? Anyone got any ideas? Do anyone have any experience with the csv Perl module? So far all I have is:

Code:
#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file

print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE, $repeat_filename);

open(OUTFILE, ">$outfile");

#read the repeats from file and store them
@repeat = <REPEATFILE>;
chomp @repeat;
#close repeat file
close REPEATFILE;

#Split each line of the input file
#IF first column does not equal AAAAAA or ATATAT then delete
for my $line (@repeat) {


exit;

I would appreciate any help/suggestions that you have.
 
In your code you split in a loop and never do anything with the split data. You have to do the split and the comparison inside the same loop:

Code:
#split each line of the input file
for my $line (@repeat) {
     my @data1 = split /,/, $line;
     print OUTFILE if(substr($data1[0],0,3)ne substr($data1[0],-3));
}

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Not only you need to follow Kevin in the last post, but also you need to use (watch the red portion and also further modified in blue for faster execution)
Code:
for my$line(@repeat){
  my[COLOR=blue]($firstcol)[/color]=split/,/,$line;
  print OUTFILE [COLOR=red]$line[/color] if(substr([COLOR=blue]$firstcol[/color],0,3)ne substr([COLOR=blue]$firstcol[/color],-3));
}
Also don't [tt]chomp[/tt] your array (as you seem to prefer to read the file into an array, but that's unnecessary).
And oops! [blush] I realize now from your file that the [tt]substr[/tt] method doesn't work for you, as it refuses the strings with a single repeated char. But, as you have strings with 5 chars, you also need to decide what to do with a string like [tt]ACACA[/tt] (or [tt]ACACACA[/tt] for that matter): in this one the first 3 chars are the same as the last 3. Will you throw it away or keep it?
Assuming you'll keep those, your set of rules would be:
-accept all strings with an odd number of chars
-in the remaining group, accept all strings where the first 3 chars are not equal to the last three.
-in the remaining group, accept all strings with a single repeated char
This would give the following code (also made case insensitive, though you doesn't seem to need that):
Code:
while(<REPEATFILE>){
  my($firstcol)=split/,/;
  my$len=length$firstcol;
  if($len>>1<<1==$len){
    print OUTFILE;
    next;
  }
  $firstcol=uc$firstcol;
  if(substr($firstcol,0,3)ne substr($firstcol,-3)){
    print OUTFILE;
    next;
  }
  $firstcol=~tr/A-Z//s;
  if(length$firstcol==1){
    print OUTFILE;
    next;
  }
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
When using this code it still doesn't filter the "OUTFILE". However it is in the right format now (I have removed the chomp):

#!C:/Perl/bin/perl.exe -w

Code:
#Opening repeat distribution file
print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE, $repeat_filename);

open(OUTFILE, ">$outfile");

#read the repeats from file and store them
@repeat = <REPEATFILE>;
#close repeat file
close REPEATFILE;

###########################
#      3mer deletion      #
###########################

while(<REPEATFILE>){
  my($firstcol)=split/,/;
  my$len=length$firstcol;
  if($len>>1<<1==$len){
    print OUTFILE;
    next;
  }
  $firstcol=uc$firstcol;
  if(substr($firstcol,0,3)ne substr($firstcol,-3)){
    print OUTFILE;
    next;
  }
  $firstcol=~tr/A-Z//s;
  if(length$firstcol==1){
    print OUTFILE;
    next;
  }
}

exit;

Your rules have given me a lot more to think about. Extending what you told me Prex1, the universal rules would be:
-delete any line where there are an odd number of characters in the first column EXCEPT those where all the characters are the same e.g. AAAAAA, CCCCC, GGGGG, TTTTT.
-Keep everything else.

Once again thank-you for your help.
 
If I'm reading this correctly (and based on the letters used), you're basically trying to break up a DNA strand looking for repeats. I don't think you're going to achieve this with a reg-exp (however Keven will likely prove me wrong). Anyway, consider the following as a subroutine call
Code:
$sequence = <a string from col 1> ;
$split_into = 2 ;

$repeats_found = 0 ; # Start w/ 1 here for optB below.
while (1) {
   $segment_size = length($sequence) / $split_into ;
   last if ($segment_size < 3) ; # since AC AC AC is ok.
   if (int($segment_size) != $segment_size) {
      # Don't want fractional strings
      ++$split_into ; next ;
   }
   # If here, we know we can split the string into 
   # "$split_into" equal segments
   @segments = &split_me_into_equal_segments($sequence, $split_into) ;
   $first_segment = shift @segments ;
   for ($i=0 ; $i < $#segments ; ++$i) {
      # Here I'm not sure what you need.  Do you need to
      # fail the test if:
      #  OptA -any- sub_sequence matches or
      #  OptB -all- sub_sequences match
      # ?? something else ??
      if ($first_segment eq $segments[$i]) {
         # Opt A: claim fail because something matched
         $repeats_found = 1 ;
         return $repeats_found ; # Note 'return' here.
      };
      # ?or?
      if ($first_segment ne $segments[$i]) {
         # Opt B: claim ok because something didn't match
         $repeats_found = 0 ;
         last ; # don't return - need to verify all sub-seqs
      };
   };
}
return $repeats_found
Please note that this is just off the top of my head. It's not syntactically pretty nor very efficient. It's just a first attempt to get the idea across.
 
Perhaps to explain a little more of what I'm trying to achieve with the above:
- ATTGGATTGG
= (ATTGG)(ATTGG)
= will claim repeats for opts A&B
- AGGAGGAGG
= (AGG)(AGG)(AGG)
= will claim repeats for opts A&B
- AGGGAAAGG
= (AGG)(GAA)(AGG)
= will claim repeats for A but not B

 
More or less what I am trying to do if I understand correctly. I have found the repeats from the DNA sequences, but now I need to remove trinucleotide repeats from the first column of a csv excel sheet (and the rest of the line they are on).

I will try your suggestion and see how it works.

I do believe that these rules cover what I am trying to do now:
-delete any line where there are an odd number of characters in the first column EXCEPT those where all the characters are the same e.g. AAAAAA, CCCCC, GGGGG, TTTTT.
-Keep everything else.
 
Forgot about the AAAAAA type strings - ie:
($seq =~ /^([ATCG])\1+$/)
This can be added as an additional check at the beginning of the subroutine.

Your current set of rules are not indicating what to do with 3 character sequences that repeat. Has this goal changed?

The subroutine I tried to put together will compare 3, 4, ... n character sub-sequences (where n = (len(seq) / 2)). So if you're only looking for the first 3 to match the last 3, what I tried is overkill.

However, thinking about the subroutine I submitted for consideration, pass back the $spit_into var and not just a 'true/false' value - This could be used downstream to select sequences that have min/max repetitive sequence strings.
 
No I originally thought the first 3 and last 3 would work as a rule. However it removes repeats like AAAAAA and CCCCCC etc...

I dont want to keep 3 character sequences that repeat. I want to keep everything else which is why the following should work:

"delete any line where there are an odd number of characters in the first column EXCEPT those where all the characters are the same e.g. AAAAAA, CCCCC, GGGGG, TTTTT."

 
Just a reminder and incase you didn;t see it PinkeyNBrain. The input file I am filtering it here:


I want to remove all trinucleotide repeats from the first column. So the following rules *SHOULD* work:
-delete (or skip) any line where there are an odd number of characters in the first column EXCEPT those where all the characters are the same e.g. AAAAAA, CCCCC, GGGGG, TTTTT.
-Keep everything else (i.e. write to the outfile).
 
CaptainRave, your code can't work, as you read [tt]REPEATFILE[/tt] into an array first, then you try to read again from file. Choose one: either you directly read from file (preferred to me) or read into an array then process the array.
Your last rules (hope they won't change once more...) would give the following code:
Code:
open(REPEATFILE,$repeat_filename);
open(OUTFILE,">$outfile");
while(<REPEATFILE>){
  my($firstcol)=split/,/;
  my$len=length$firstcol;
  if($len>>1<<1!=$len){
      #length is an odd number
    $firstcol=~tr/A-Z//s;  
      #squash duplicated chars
    print OUTFILE if length$firstcol==1;
      #this means all chars were the same
  }else{ 
    print OUTFILE;
  }
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Thank-you to everyone who has helped. I will try these suggestions and post my results here.
 
Have you considered using Bioperl Captainrave?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I have used it in the past, and it is very useful. However on this occassion I need to do something very very specific. The biggest problem being that I haven't used Perl in ages and that my boss expects miracles (when he can't even suggest anything himself!).
 
Me again! So I am working on Prex1's code. But I keep getting a blank OUTFILE and I think it is because I am having trouble telling it exactly WHAT to print TO the outfile. I am trying to print the entire line.

This is what I have using Prex1's code (as you can see I am having trouble implementing it)

Code:
#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file
print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE,$repeat_filename);
open(OUTFILE,">$outfile");

###########################
#      3mer deletion      #
###########################

while(<REPEATFILE>){
  my($firstcol)=split/,/;
  my $firstcol = split /,/, $line;
  my$len=length$firstcol;
  if($len>>1<<1!=$len){
      #if length is an odd number skip (i.e. trimers)
    $firstcol=~tr/A-Z//s;
      #bring duplicated characters together
    print OUTFILE "$line\n" "if length$firstcol==1;
      #prints to outfile if all chars were the same (monomers)
  }else{
    print OUTFILE "$line\n";
  }
}


exit;
 
Correct the while as follows
Code:
while[COLOR=red] my$line[/color](<REPEATFILE>){
  [COLOR=red][s]my($firstcol)=split/,/;[/s][/color]
  my[COLOR=red]([/color]$firstcol[COLOR=red])[/color]= split /,/, $line;
  my$len=length$firstcol;
  if($len>>1<<1!=$len){
      #if length is an odd number skip (i.e. trimers)
    $firstcol=~tr/A-Z//s;
      #bring duplicated characters together
    print OUTFILE $line if length$firstcol==1;
      #prints to outfile if all chars were the same (monomers)
  }else{
    print OUTFILE $line;
  }
}
This should work. You have some problems understanding the use of the special variable [tt]$_[/tt] , so I've suppressed the use of it in the above. Also note that [tt]$line[/tt] already includes the line terminator, you don't need to add it.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I get a syntax error near "while my" and near the last "}"?
 
Changed it to while (my$line = <REPEATFILE>). I think that is correct.

However the first column is still not being filtered :(.

For example, the line starting with AATAAT is still there? The rules should remove this right. Will do some more digging...
 
If I change,

if($len>>1<<1!=$len){ to if($len>>2<<1!=$len){

then I just get all the monomers (all the lines with single character types like AAAAAA, AAAAA, CCCCC etc...). This is half way towards my goal.
 
Ok, so I designed a better test file (more sensitive, less entries)...and the script works!!!!!! It can even be tweaked to do other things that I might want.

Thank you so so much everyone. I will let this thread die now so other people can get help.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top