Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

how to get a Fasta format file in Perl? 1

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
US
Hi all!

I use code (see below) to generate a bunch of sequences like:

************************************************
46
8
GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTGACATAGAGCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCTCACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTCGAACCTTGGACTAACCACTGTCTGGATA
79
8
AGGATTACCCGCTGGACTTCAAACGCTCGTGAAGCATCGTATTGCGAGGCAACCGAGTCATAGCCCAGTCCGGGGGCCATCGCCATCCCAGCATCTGCGTTGTTCATCGGTCCTCAGTCTCCCATCAACGTGGTCCACACCTAGCATCCTGGTTTTGCATCCGTAACAAAGGACGTTCGAAGTTTTTTGCCGGCGGGAAG
70
8
TGATAATTGGTGCAATATTCTCCATAACAGATCCTCGCCAATACGGATTTGAGGGATCCCTCTGCATTTCTTGACTTAGTGTCACCGATAGAGCAGAAATGCTTTACCGCCGCAGTGATTAGGCGGGTACAGTTGTCCAAACGCACACAACCGAAACCTCCCCATGCGTACTCGTTCGTTTAGTCGCGTACAGAGGGAAC
...................
**************************************************

Please ignore the digital numbers. I stored these data
in a txt file and transfer it between my laptop and a server
using Filezilla.

When I open the file on the server, there is a "$" sign in
the end of each sequence which resulted in a failure to be tested by a software.

Actually the requirement of the input file for the software is Fasta file. I am thinking if I can get a Fasta format file directly?

Thank you very much for your help!

Regards,
Alex



The code is :

*********************************
#!/usr/bin/perl

use strict;
use warnings;

my (@short, @long, $offset); # the 'short' array will hold the short
#sequences while 'long' array the long sequences

open(SHORT, '<', "short_sequences.txt") || die "Can't open short_sequences.txt: $!\n";
while(<SHORT>){
chomp $_;
chop $_;
push(@short, $_);
}
close SHORT; #Close the file

open(LONG, '<', "long_sequences.txt") || die "Can't open long_sequences.txt: $!\n";
while(<LONG>){
chomp $_;
chop $_;
push(@long, $_);
}
close LONG; #Close the file

# replacement
for(my $i = 0; $i <= $#short; $i++){
$offset = int(rand(193));
print $offset."\n";
#print length($short[$i]);
substr($long[$i],$offset,length($short[$i]),$short[$i]);
print "\n", $long[$i], "\n";

}
********************************
 
have you got a definition of the fasta file format?

Mike

You cannot really appreciate Dilbert unless you've read it in the
original Klingon.

Want great answers to your Tek-Tips questions? Have a look at faq219-2884

 
Is filezilla doing any "text conversion" during transfer? I'm wondering whether the $s you see signify a non-printing character? Text conversion from *ix to dos adds a CR character before each NL.

f

&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
Hi fishiface,

I have no idea about if filezilla
does sth during transfer. But if
I transfer a txt file which containes
short sequences, I will not see the "$" signes.

 
Which OS do you have on the server and which on the laptop? When you say "open" the file, what tool/editor did you use? Do you have access to the od program (standard on *ix and part of the cygwin suit on windows)?

f




&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
I am using Windows XP on my laptop
and the Linux for the server.

I "open" the file on the server using
"Nano" which is almost the same with "PECO".
 
Can you run
Code:
od -xc serverfile | more
on the server copy of the file and look for \n characters to signify line endings. You should be able to see whether the previous character is a '$' - in which case something weird is going on - or something else. My guess is a \r.

f

&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
I end up with:

[gaozhang@adenine motiftest]$ od -xc 3.txt | more

0000000 533e 7165 614e 656d 0d31 470a 4141 4354
> S e q N a m e 1 \r \n G A A T C
0000020 5441 5441 5441 4154 5447 5443 4343 4341
A T A T A T T A G T C T C C A C
0000040 5441 4341 4354 4743 5454 4743 4754 4341
A T A C T C C G T T C G T G A C
0000060 4343 5441 4154 4343 5443 4754 4341 5441
C C A T T A C C C T T G A C A T
0000100 4741 4741 4343 4341 4741 4143 4341 4754
A G A G C C A C A G C A A C T G
0000120 4154 4147 4354 4354 4147 4741 5454 4147
T A G A T C T C G A A G T T G A
0000140 4143 4747 4747 4143 4341 4154 4147 4747
C A G G G G C A A C T A G A G G
0000160 4343 4354 4741 4141 5454 5443 4143 5443
C C T C A G A A T T C T C A C T
0000200 5443 4754 4741 4147 4147 4147 4741 4354
C T T G A G G A G A G A A G T C
0000220 4154 4741 4341 5443 4341 4741 4154 4754
T A A G A C C T A C A G T A T G
0000240 5447 4743 4747 5454 4741 5454 5454 4754
G T C G G G T T A G T T T T T G
0000260 5454 4343 5447 4743 4141 4343 5454 4747
T T C C G T C G A A C C T T G G
0000300 4341 4154 4341 4143 5443 5447 5443 4747
A C T A A C C A C T G T C T G G
0000320 5441 0d41 3e0a 6553 4e71 6d61 3265 0a0d
A T A \r \n

..............................

Yes. there are "\r" and "\n". So what is your suggestions?

Thanks,
 
\r\n is the default line-ending sequence from the Windows universe whereas your linux box just expects a \n and is rendering the \r as a '$'. You need a way to strip the \r characters from the file and there are several.

One approach is to strip it during transfer. ftp has an 'ASCII' mode which does just this. (It will also strip a trailing ^Z character if present, which may also cause you a problem. It's the EOF character in Windows but linux has no such convension.)

You may have a utility on your linux box that would do the trick. Such things are typically called dos2unix or d2u. You could use tr or sed but syntax varies between versions and there are shell escapes to consider so I'd avoid that. If you haven't a suitable tool, this perl one-liner does the trick:
Code:
perl -pi'.bak' -e 's/\r$//' myfile

man perlrun explains the rather arcane syntax.

fish
(actually rebuilding the utility room and not playing computers at all;-)

&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
thanks!

I use the one-line Perl code and it works.

 
no problem.

f

&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
but the software still can not open
the data file...

maybe there is something wrong
 
Any error messages?

You can watch a linux program's interactions with the system in minute detail using the strace command.
Code:
strace myprog myargs 2>&1 | more
This will help debug a problem with the program opening the file. If it's the contents of the file that's upsetting it, I don't know enough about the fasta format to help. od is your man for inspecting individual bytes in the file.

f

&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top