Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Script not working correctly :( 1

Status
Not open for further replies.

Captainrave

Technical User
Nov 16, 2007
97
GB
Hi all,

I wrote a script that basically finds a "start" coordinate in File 1, goes to the correct position in file 2 (then either jumps forward 1 or 2 positions), then counts the number of positions (in groups of 3) until either TAG, TAA or TGA are located; the number of positions are recorded in a new file. The problem is, it is either stoping at random groups of 3 or completely missing TAG, TAA or TGA. I really don't understand why. Can anyone help?


File 1 -
File 2 -
Script -
 
I don't know what your script is actually doing, or what the data is with which you're working. However, I was able to run your code through Perl::Tidy and then clean it up some.

Maybe this will help you or at the very least it will help other people help you with your code.

If you explain in detail what you're trying to do in analyzing this data, I might be able to give you better advice. But that's probably asking a lot of my time:

Code:
# Polymerase slip prediction

use strict;
use warnings;

# Requirements
my $csv   = "File1.csv";
my $fasta = "File2.fasta";

# Output
open my $out_fh, '>', "new-$csv" or die $!;

# Fasta File
my $seq = do {
	open my $fh, $fasta or die $!;
	local $/;
	<$fh>;
};

open my $in_fh, $csv or die $!;
while (<$in_fh>) {
	chomp;
	my $line = $_;

	# Header Row
	if ($. == 1) {
		print $out_fh "$line,Y,Z\n";
		next;
	}

	my @line = split ',', $line;
	my $start = $line[4];

	my @results;
	for my $y (1..2) {
		my $subseq = substr $seq, $start + 1 + $y;
		my $count = 0;
		while ( $subseq =~ s/^(.{3})//g ) {
			last if (grep {$1 eq $_} qw(TAG TAA TGA));
			$count++;
		}
		print "$y FOR $start - $count\n";
		push @results, $count;
	}

	print $out_fh join(',', $line, @results), "\n";
}
close($in_fh) or die $!;
close($out_fh) or die $!;

- Miller
 
Interesting, that actually changed the result of the output! Still not quite right though and I can still not see why. Have been working all day on this. This is some rough pseudocode -

Code:
- Goto File 1 (csv file) and for each line, store the number in column 4 (lets call it X)

- Goto File 2 (FASTA text file) and count each character until X is found.
- At X, skip forward 1 and count the characters in groups of 3 until TAG, TAA or TGA are reached (call this Y)
- At X, skip forward 2 and count the characters in groups of 3 until TAG, TAA or TGA are reached (call this Z).
- Add Y and Z to new columns in File 1 at the end of the line where X was found.

- Go back to File 1, goto next line and repeat.

I just can't see why it is either terminating before TAG, TAA or TGA are reached OR is counting past them?
 
Ok, I don't yet have enough information to really help you yet, but I have some questions I can ask.

Regarding your Fasta file:

Code:
>gi|51572834|gb|CP000013.1| Borrelia garinii PBi, complete genome
ATAAATTAACTACTTCTCGATGGGAATTCCCTAAAGAAAATATAATTAAAAAAAAAATGAAAATAGGTAT
AATTTACCATAATTATATAAATCCTATCTTTTATAATGAAAACTATAAATATATTGCCTTTATCGGGATA
TTAACAGCTTATAATGAATGGATTGAAATACAGTTCAGTCCCATAAATTTTTTTACTATCCCAATAAATC
AAGATTTTGTTTCAAATGCTTATTTTAATTTAGCTTTCACTATTTACATTGTTAAATATTCAATTTTAGC
CGATACACTTACTATAAAATTCTTTATTGGAACCCAAGTCGATTTAACTCTAAGAACTAGTATATTTATA
GGAAAAACAACTAATGCATTTCTCTATCCAATCCTTCCCCTAATTACCTTAAAATTTGAAATTGATTTCA
TACCTAATAACTATAGCATTTACTATAAATTATCAACTTCTTTTAAAGAATTTATTCTTTTGGATCTAGG
AATTTCTATATTTATACCATAAAAAATTTTAAACATTTTTCTTCTTTTTGAAATATTTTAACAAAAGACT
ATCCAATTGTCTAAATATAAAAAATAATAAACTGACAAAGACTAAAAACAAATTTTTAAGCAATTTTACT
AAGGAATAAGAACTATATTTATACCTGAACTTAAAGACTTTTTAATAGCATTTTCAATATTAGAAAAACT
TCTTGTAATTGCTCCCATATTATAAGAATCTGTTATCATAATACTAGTAATATTTAAATTTTCTCTTATA
ATATTAACAATACTTTTAGACATGCTAGTTATATCTTTAGAAATTTTAGGAACATTTGCATGGCCAATCA
TGATAAATTTAGAAGTCTTGCCAAAAACAAACGGCACAAAATTATTTAACATTAAAAAGCTTTTACTATA
AGGTAAAAAAGCCAAATATTTATGTGTATCTGTGGTTATTCCCCCTAATCCTGGAAAATGCTTAATTGCC
GAAAATACTCCATGATTCTGCATGCCATCAATAAAAGCTTCTACCATAAGCCCAATATTATAAGCAGAAT
...

The first line of your file appears to be a header of some sort. Do you want to actually be searching through that line?

Also, your file appears to have line breaks every 70 characters. Do you want to strip those out before you start searching through the DNA sequence?

Finally, let's take an example of a single search. Say that you were searching from the very beginning of the Fasta file, spot ATA. What would you want the results for X and Y to be?

As you can see, there is a TGA at the beginning of line 14, so in theory your regex would currently give you a value of 13 for both X and Y as it currently stands. Is this correct?

Basically, you should limit your code to a single iteration and make sure that works before you expand it to looping through the entire file.

Just add this line to the end of your loop to end the processing early.

Code:
     last if $. == 2;

- Miller
 
You make a couple of very good points. Firstly, I must remember to delete that header before I run the script. Second, if my script is counting the characters present, to get to position X in my example, would it find anything in the line break that it could also be counting? I have just realised that it probably is identifying the correct sequences (i.e. TAG, TAA and TGA), however, it could be starting the count from the wrong X position (I didn't even think of this!).

Taking your example, assuming I started from the very beginning (ignoring the header), X would be the very beginning (A), my Y value would then be 0 (since it would skip the first A and find a match from "TAA" immediatly) and my Z value would be really long (since it would skip the first A and T; there is no TAG, TAA or TGA for quite a while after that).

Any ideas?
 
Well, starting off with the first issue. If you want the Fasta file to not include the header and not be one continuous string without line breaks, then the following will fix your code:

Code:
# Fasta File
my $seq = do {
	open my $fh, $fasta or die $!;
	<$fh>; # Skip Header
	my @lines = <$fh>;
	chomp @lines;
	join '', @lines;
};

Secondly, for your search loop, I do not believe you want to s///, but instead just want to match.

Code:
open my $in_fh, $csv or die $!;
while (<$in_fh>) {
	chomp;
	my $line = $_;

	# Header Row
	if ($. == 1) {
		print $out_fh "$line,Y,Z\n";
		next;
	}

	my @line = split ',', $line;
	my $start = $line[4];

	my @results;
	for my $y (1..2) {
		my $subseq = substr $seq, $start + 1 + $y;
		
		# Test string for 300 chars.
		print substr($subseq, 0, 300), "\n";
		
		my $count = 0;
		while ( $subseq =~ /(.{3})/g ) {
			last if (grep {$1 eq $_} qw(TAG TAA TGA));
			$count++;
		}
		print "$y FOR $start - $count\n";
		push @results, $count;
	}

	print $out_fh join(',', $line, @results), "\n";
	
	last if $. == 2;
}

The above will output the following:

Code:
$perl pred.pl
TAAAAGATAAAGCTTTTAAAAAGCCCATCTACAACACCCATTAAGCAATCCATAATCTTTTGATAACAAAGTAGAAAGAG
AACAATACAAATCTTTAGTAGCATTAGAATTACCAACTCTCTTTCTACAACAATATTTGTACCTATCTTCAAGATAATTA
CTATTAAGCAACATGTCGATAAATAACTTCGCAGGAAGGAATCTTTTACTTCTGTCAACCGTTAAAATCAATGGATTTTC
CATTAATTTTAATACTTCCCACGTTATTTGCTCAACCGAAATATATTTACCAACTTCTTT
1 FOR 1618 - 0
AAAAGATAAAGCTTTTAAAAAGCCCATCTACAACACCCATTAAGCAATCCATAATCTTTTGATAACAAAGTAGAAAGAGA
ACAATACAAATCTTTAGTAGCATTAGAATTACCAACTCTCTTTCTACAACAATATTTGTACCTATCTTCAAGATAATTAC
TATTAAGCAACATGTCGATAAATAACTTCGCAGGAAGGAATCTTTTACTTCTGTCAACCGTTAAAATCAATGGATTTTCC
ATTAATTTTAATACTTCCCACGTTATTTGCTCAACCGAAATATATTTACCAACTTCTTTA
2 FOR 1618 - 2

As you can see, the X will equal 0 because there is TAA at the very beginning of the string. However, Y will equal 2 because the 7-9th chars are TAA.

Is this what you want? I believe so.

The biggest question to me is whether you really want the 5th element of the csv values, ie. Repeat End. And whether this value will actually correspond to the place in the string that we're referencing after removing all the carriage returns. I just can't answer that question for you though.

Hope this helps.
- Miller
 
Yes, that is exactly what I want it to do. I just need it to locate the 4th element of the csv value i.e. Repeat Start in the FASTA file, then to output X and Y at the end of the row. I still don't think it is locating the correct place in the string that we are referencing though. If it is skipping the header and the line breaks, I don't see what else it could be counting?
 
After further thought, I think the problem is 100% with the Fasta file. The script must be counting characters that it shouldn't be - it should only be counting A, C, G or T and ignoring the header. Is there a way to tell it to skip out anything other than A, C, G or T? For example, I believe there are blank spaces that it is counting too.

 
All sorted now. Many thanks for your help :)

The final error, was that I completely forgot that Perl is 0-indexed, so I was directing it to the wrong column :(
 
Cool, I'm glad I asked whether you really wanted the 5th element of the csv file.

Can I ask that you share your final script, mostly because I'm curious?

Glad you got what you wanted.
- Miller
 
Script is below - I reverted back to the old format to make some of the changes. It now also outputs the sequences that it finds into a new file (makes it easier to verify them). Unfortunately, I realised there is another problem with the approach I am taking, but at least the script now works - hopefully I can modify it to suit my needs.

Code:
use strict;
use warnings;

# REQUIREMENTS
my $csv   = "File1.csv";
my $fasta = "File2.fasta";

# OUTPUT
open(SEQS,">","seqs-$csv") or die $!;
open(OUT,">","new-$csv") or die $!;
print OUT "Composition,CDS Start,CDS END,Repeat Start,Repeat End,Repeat Midpoint,Simulated CDS start,Simulated CDS end,Simulated Repeat Midpoint,% Distribution,Y,Z\n";

my @starts = ();
open(CSV,$csv) or die $!;
while(<CSV>)
	{ next if ($. == 1);
	  my $line = $_; chomp($line);
	  my @line = split(/\,/,$line);
	  push(@starts,[$line[3],\@line]);
	}
close(CSV) or die $!;

my $seq = ""; my $seq_ref = \$seq;
open(FASTA,$fasta) or die $!;
while(<FASTA>)
	{ my $line = $_; chomp($line);
	  next if ($line =~ /^\>/);
	  $seq .= $line;
	}
close(FASTA) or die $!;

for (my $x=0;$x<@starts;$x++)
	{ my $pc = sprintf("%.4f",(($x/$#starts)*100)); print "$pc% COMPLETE...\n";
	  my $start = $starts[$x][0];
	  my $plus_one; my $plus_two;
	  for (my $y=1;$y<3;$y++)
		{ my $pos = ($start-1)+$y;
		  my $subseq = substr(${$seq_ref},$pos);
		  my $final_seq = '';
		  my $count = 0;
		  while($subseq =~ s/^(.{3})//g)
			{ $final_seq .= $1;
			  last if (($1 eq "TAG") or ($1 eq "TAA") or ($1 eq "TGA"));
			  $count++;
			}
		  my $position;
		  if ($y == 2)
			{ $position = '+1';
			  print SEQS "$start,$position,$final_seq\n";
			}
		  else
			{ $position = '+2';
			  print SEQS "$start,$position,$final_seq\n";
			}
		  ($y == 2) ? ($plus_one = $count) : ($plus_two = $count);
		}
	  my @origi_line = @{$starts[$x][1]};
	  my $origi_line = join(",",@origi_line);
	  my $line = "$origi_line,$plus_one,$plus_two";
	  print OUT "$line\n";
	}

close(OUT) or die $!;
close(SEQS) or die $!;
exit 1;

I love this forum, have always have much success here :)
 
Glad I could help. Would like it more if you'd keep some of the other improvements, but truly only matters that it works.

Note, you should at least change this line:

Code:
while($subseq =~ s/^(.{3})//g)

to

Code:
while ($subseq =~ /(.{3})/g)

It will run 100 times faster and give you the same result because it isn't needlessly modifying the string.

- Miller
 
Btw, here is the final script that I would've recommended that you use. It produces the exact same results as the one that you most recently posted, but removes some of the extraneous code and is easier to follow in my opinion.

Use it or not at your discretion. Just didn't want to lose this work if you came back with further questions.

Code:
use strict;
use warnings;

# REQUIREMENTS
my $csv   = 'File1.csv';
my $fasta = 'File2.fasta';

my $seq = '';
open my $fasta_in, $fasta or die $!;
while (<$fasta_in>) {
	chomp;
	if (/[^ATCG]/) {
		die "Unknown data at line $.: $_\n" if $. > 1;
		next;
	}
	$seq .= $_;
}
close $fasta_in or die $!;

# OUTPUT
open my $seqs_out, '>', "seqs-$csv" or die $!;
open my $csv_out, '>', "new-$csv" or die $!;

my @csv = ();
open my $csv_in, $csv or die $!;
while (<$csv_in>) {
	chomp;

	# Header
	if ($. == 1) {
		print $csv_out "$_,Y,Z\n";
		next;
	}

	my @line = split ',';
	push @csv, [$line[3], $_];
}
close $csv_in or die $!;

for my $i (0..$#csv) {
	printf "%.4f%% COMPLETE...\n", ($i / $#csv * 100);
	
	my $start = $csv[$i][0];
	my $line = $csv[$i][1];
	
	my $plus_one;
	my $plus_two;
	for my $y (1..2) {
		my $subseq = substr $seq, $start - 1 + $y;
		
		my $final_seq = '';
		my $count = 0;
		while ($subseq =~ /(.{3})/g) {
			$final_seq .= $1;
			last if $1 eq 'TAG' || $1 eq 'TAA' || $1 eq 'TGA';
			$count++;
		}
		
		my $position = $y == 2 ? '+1' : '+2';
		print $seqs_out "$start,$position,$final_seq\n";

		($y == 2) ? ($plus_one = $count) : ($plus_two = $count);
	}
	
	$line = join(',', $line, $plus_one, $plus_two);
	print $csv_out "$line\n";
}

close $csv_out or die $!;
close $seqs_out or die $!;

Also, yet another way to write the inner loop would be the following:

Code:
	for my $y (1..2) {
		my $subseq = substr $seq, $start - 1 + $y;
		
		if ($subseq =~ /^((?:.{3})*?)(TAG|TAA|TGA)/) {
			my $count = length($1) / 3;
			my $final_seq = $1 . $2;

			my $position = $y == 2 ? '+1' : '+2';
			print $seqs_out "$start,$position,$final_seq\n";

			($y == 2) ? ($plus_one = $count) : ($plus_two = $count);
		} else {
			warn "No match found\n";
		}
	}

This method isn't any faster than the one I already recommended. However, it does have the added benefit that it lets you catch any case where the dna seq TAA,TAG,TGA isn't found during the search. There are no such instances with this file, but might be useful for other data.

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top