Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing entries from a file 4

Status
Not open for further replies.

Captainrave

Technical User
Nov 16, 2007
97
0
0
GB
Hi everyone. So I am completely stuck. Basically I have a csv file with data like:

First column
AAAAAA more data in adjacent columns <keep
ATATAT more data in adjacent columns <keep
AATAAT more data in adjacent columns <keep

I want to delete any line that has three letters repeated over and over again. This should be possible with a reasonably simple regular expression right? Anyone got any ideas? Do anyone have any experience with the csv Perl module? So far all I have is:

Code:
#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file

print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE, $repeat_filename);

open(OUTFILE, ">$outfile");

#read the repeats from file and store them
@repeat = <REPEATFILE>;
chomp @repeat;
#close repeat file
close REPEATFILE;

#Split each line of the input file
#IF first column does not equal AAAAAA or ATATAT then delete
for my $line (@repeat) {


exit;

I would appreciate any help/suggestions that you have.
 
Sorry, when i say

First column
AAAAAA more data in adjacent columns <keep
ATATAT more data in adjacent columns <keep
AATAAT more data in adjacent columns <DELETE

Working on the regular expression. Cant come up with anything at the moment.

Also, would I use truncate to delete the line?
 
Since there are only 4 possible letters, acgt, I could do it a really long way. As in:

If

a+
c+
g+
t+
ac+
ag+

for all possibilites (of which there are 16)

else delete line?

There must be a better way though!
 
It would really help if you could define some rules that make sense. This is too vague:

I want to delete any line that has three letters repeated over and over again.

"Over and over again" is not something you can quantify.

This also makes little sense because AAA is repeated in the first line:

AAAAAA more data in adjacent columns <keep
ATATAT more data in adjacent columns <keep
AATAAT more data in adjacent columns <DELETE

So try and come up with a rule or set of rules that can be applied to your data to decide wether to keep or delete.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I agree that the rules are a little confusing. Assuming AAAAAA should be deleted, then how about this.

Code:
#!/usr/bin/perl
use strict;

my @foo = ('AAAAAA', 'ATATAT', 'AATAAT');

foreach (@foo) {
    print "$_\n" if substr($_,0,3) eq substr($_,3,6);
}

--
 
Firstly, thanks so far for all your responses!

I have worked a little bit more on this. The regular expression itself I have more or less worked out (or at least am working on it). It is very difficult to explain. How do I implement this into my script?

Code:
#!C:/Perl/bin/perl.exe -w

#Opening repeat distribution file

print "please type the filename of the repeatdistribution.csv file:";
$repeat_filename = <STDIN>;
chomp $repeat_filename;

print "please type the filename to save the results to (.csv format !!important!!):";
$outfile = <STDIN>;
chomp $outfile;

open(REPEATFILE, $repeat_filename);

open(OUTFILE, ">$outfile");

#read the repeats from file and store them
@repeat = <REPEATFILE>;
chomp @repeat;
#close repeat file
close REPEATFILE;

###########################
#      3mer deletion      #
###########################

#Split each line of the input file
for my $line (@repeat) {
     my @data1 = split /\s+/, $line;
     
      if (m/([acgt]+)(3,3) {

delete line

     }
     
print OUTFILE ()

exit;

Notice the last bit incomplete. How would I get it to delete the line and then output the new file with the lines removed?
 
I think you are trying to write this:

if (m/[acgt]{3,3}/) {

But thats not going to do what you want (I think). Since you haven't defined any rules I don't know what to suggest.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Yes thats it. I appreciate your patience. Basically I want anything with a word length of 3 deleted.

Something like AAAAAAA, CCCCCCC doesnt count, since in their simplest form they can be broken down to single letters. ACACACAC in its simplest form can be broken down to AC, AC, AC which we would keep. Anything like ACAACA in its simplest form would be broken down to ACA, ACA which I want deleted. Like I said it is difficult to explain.

I *think* I can solve the regular expression. However the next problem is fitting it into my script.

As in, If regex = true then line deleted (for every line). If I can get this workinjg I can easily test it.
 
As the first column appears to contain 6 chars (otherwise I'll insist with Kevin: specify your conditions), you have only one possibility of repeating three letters, i.e.: first 3 equal to the last 3; your code could look like this one:
Code:
open(REPEATFILE,$repeat_filename);
open(OUTFILE,">$outfile");
while(<REPEATFILE>){
  my@data1=split;
  print OUTFILE if(substr($data1[0],0,3)ne substr($data1[0],3));
}




Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Actually that rule would work well. However the first column can contain any number of characters. But yes, I want to delete any lines where the first column begins and ends with the same three characters, and then output the new file.
 
What about this (assuming my regular expression is correct...feel free to suggest something better!)

Code:
#Split each line of the input file
for my $line (@repeat) {
     my @data1 = split /','/, $line;

foreach $line (@repeat) {
print OUTFILE $line if @data1 (m/([acgt]+)(3,3)xig);
}

close(FILE);
 
Actually it is more likely to be along the lines of:

print OUTFILE $line unless @data1 eq (m/([acgt]+)(3,3)xig);

How do I do that in Perl?
 
Sorry for writing so much.

But this SHOULD work...rather than deleting a line just looking for the right line (of which there are 16) and then printing THOSE to the outfile.

Code:
#Split each line of the input file
for my $line (@repeat) {
     my @data1 = split /','/, $line; }

foreach $line (@repeat) {

#test for acacac...
if (@data1 =~ m/(ac){0,}) { print OUTFILE "$line/n/n" }
#test for agagag...
elseif (@data1 =~ m/(ag){0,}) { print OUTFILE "$line/n/n" }
#test for atatat...
elseif (@data1 =~ m/(ag){0,}) { print OUTFILE "$line/n/n" }

}

exit;

Of course repeating the elseifs 15 or so times. Does that work? The code doesnt seem to be working properly in my perl editor, but it should? Shouldn't it? Are my brackets wrong?
 
Ok, so I missed out xig there! My final script is this:

Code:
#split each line of the input file
for my $line (@repeat) {
     my @data1 = split /','/, $line; }

#test all lines
foreach $line (@repeat) {

#pull out all "a" repeat dinucleotides and mononulceotides
#test for acacac...
if ( @data1 =~ m/(ac){0,}/xig )     { print OUTFILE "$line/n"; }
#test for agagag...
     elseif ( @data1 =~ m/(ag){0,}/xig ) { print OUTFILE "$line/n"; }
#test for atatat...
     elseif ( @data1 =~ m/(at){0,}/xig ) { print OUTFILE "$line/n"; }
#test for aaaaaa...
     elseif ( @data1 =~ m/(aa){0,}/xig ) { print OUTFILE "$line/n"; }
     
#test for cacaca...
     elseif ( @data1 =~ m/(ca){0,}/xig ) { print OUTFILE "$line/n"; }
#test for cgcgcg...
     elseif ( @data1 =~ m/(cg){0,}/xig ) { print OUTFILE "$line/n"; }
#test for ctctct...
     elseif ( @data1 =~ m/(ct){0,}/xig ) { print OUTFILE "$line/n"; }
#test for cccccc...
     elseif ( @data1 =~ m/(cc){0,}/xig ) { print OUTFILE "$line/n"; }
     
#test for gagaga...
     elseif ( @data1 =~ m/(ga){0,}/xig ) { print OUTFILE "$line/n"; }
#test for gcgcgc...
     elseif ( @data1 =~ m/(gc){0,}/xig ) { print OUTFILE "$line/n"; }
#test for gtgtgt...
     elseif ( @data1 =~ m/(gt){0,}/xig ) { print OUTFILE "$line/n"; }
#test for gggggg...
     elseif ( @data1 =~ m/(gg){0,}/xig ) { print OUTFILE "$line/n"; }
     
#test for tatata...
     elseif ( @data1 =~ m/(ta){0,}/xig ) { print OUTFILE "$line/n"; }
#test for tctctc...
     elseif ( @data1 =~ m/(tc){0,}/xig ) { print OUTFILE "$line/n"; }
#test for tgtgtg...
     elseif ( @data1 =~ m/(tg){0,}/xig ) { print OUTFILE "$line/n"; }
#test for tttttt...
     elseif ( @data1 =~ m/(tt){0,}/xig ) { print OUTFILE "$line/n"; }

#anything else skip
          else { next; }

}

exit;

However I am getting syntax errors on the elseifs. Can you see what is wrong?
 
I still don't know what regexp to suggest, but if you looked up the perl documentation you would find there is no "elseif" condition, but "elsif". I also doubt you can use a regexp on a list:

Code:
use warnings;
@array = qw (foo bar);
if (@array =~ /foo/) {
   print "foo";
}

perl returns:

Applying pattern match (m//) to @array will act on scalar(@array) at script line 3.

scalar @array will be the length of the array:

Code:
@array = qw (foo bar);
if (@array =~ /2/) {
   print "foo";
}

The above prints "foo" because the "if" condition is true.








------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
CaptainRave said:
I want to delete any lines where the first column begins and ends with the same three characters
so
Code:
  print OUTFILE if(substr($data1[0],0,3)ne substr($data1[0],-3));

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
That certainly makes it more simple. However the output is no longer in list form and it doesn't seem to be filtering the first column. I suspect there is a problem with the split?

I have included my input and output file from my long winded method:

Input -
Output -

Using the following code from prex1 I get a blank output file?

Code:
#split each line of the input file
for my $line (@repeat) {
     my @data1 = split /','/, $line; }

#test all lines
foreach $line (@repeat) {
print OUTFILE if(substr($data1[0],0,3)ne substr($data1[0],-3));

}

exit;

Am I missing something really obvious? It has been a while since I have used Perl!
 
I think this part is messed up:

Code:
#split each line of the input file
for my $line (@repeat) {
     my @data1 = split /','/, $line; }

How should it be presented when working with a csv file?
 
Ok I had the newline as /n rather than \n.

So at least the format is correct. However it still doesnt filter out the lines I want it to :(.

I will play around somemore. Suggestions are welcome. I will keep working on the methods suggested here and get back to you.

Many thanks as ever!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top