Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

about how to handle two files simultaneously 2

Status
Not open for further replies.

Everwood

Technical User
Jul 18, 2005
78
US
Hi all,


I have two txt files to handle. One is "short_sequences" and the other
one is "long_sequences". The "short_sequences" holds
100 short sequences (8 nucleotide long) and 100 long sequences (200 nucleotide
long) in the "long_sequence".

For example, the first short sequence is "TTGACATA" and the first long sequence
is "GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGA
GCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCT
CACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTC
GAACCTTGGACTAACCACTGTCTGGATA".

Basically, we want to generate a random position as a starting site to replace a substring
in the long sequence with a short sequence. In this example, we can choose a starting site
as 5th nucleotide in the long sequence, after replacing using "TTGACATA", the replaced
long sequence is "GAATTTGACATAAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGA
GCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCT
CACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTC
GAACCTTGGACTAACCACTGTCTGGATA".

Then I want replace the 2nd long sequence with the 2nd short sequence and repeat this over and over again until the last long sequence is reached and replaced. I think
the only problem is that the starting site should not be larger than 193. Otherwise, there are
not enough nucleotides in the long sequence for replacement.

Furthurmore, I want to keep track the starting replacement site for each long sequence.


I am copying my code in the below.

use strict;
use warnings;

my (@short, @long, $offset); # the 'short' array will hold the short
#sequences while 'long' array the long sequences

open(FILE1, '<', "short_sequences.txt") || die "Can't open short_sequences.txt: $!\n";
while(<FILE1>){
chomp;
push(@short, $_);
}
close FILE1; #Close the file

open(FILE2, '<', "long_sequences.txt") || die "Can't open long_sequences.txt: $!\n";
while(<FILE2>){
chomp;
push(@long, $_);
}
close FILE2; #Close the file


# replacement
foreach my $short(@short){
foreach my $long(@long){
$offset = int(rand(length($long)%193));
substr($long,$offset,length($short),$short);
printf "%3d", $offset+1;
print "\n", $long, "\n";

}
}


But I just realized that there is a problem for the two
loops. Actually each short sequence will replace all long sequences not the corresponding one.

So I seek your suggestions on how to handle two files
simultaneously for my case.

Thank you very much and look forward to your reply!

Best Regards,
Alex

 
in a similar fashion to the following!?

Code:
#!/usr/bin/perl

@array1 = qw( A B C D E );
@array2 = qw( 1 2 3 4 5 );

for ($x=0; $x<=$#array1; $x++) {
  print "$array1[$x] | $array2[$x]\n";
}

outputs:-

A | 1
B | 2
C | 3
D | 4
E | 5



Kind Regards
Duncan
 
yes, thank you!

I also came up with the similiar code like:

# replacement
for(my $i = 0; $i < $#short; $i++){
$offset = int(rand(length($long)%193));
printf "%3d", $offset+1;
substr($long[$i],$offset,length($short[$i]),$short[$i]);
print "\n", $long, "\n";

}


what is your suggestion?

thanks!

Alex
 
I'm not sure why you are using the modulus operator (%) here:

$offset = int(rand(length($long)%193));

I thought you just wanted to make sure the offset was not greater than 193. In which case you would just use that number for your random offset value:

$offset = int(rand(193));

but maybe I am not understanding something. I tried this code and it seems to work fine:

Code:
#build some data
my @letters = qw(A C G T);
my @short = ();
my @long = ();
for (0..99) {
   my ($s,$l);
   for (0..4) {
     $s .= $letters[rand @letters];
   }
   for (0..199) {
     $l .= $letters[rand @letters];
   }
   push(@short,$s);
   push(@long,$l);
}

#do the substituting
for (0..$#short){
substr($long[$_],int(rand(193)),length($short[$_]),$short[$_]);
   print "\n$long[$_]\n";
}

if the length of $short is always the same you could just use that value instead of using length()

Code:
substr($long[$_],int(rand(193)),[b]5[/b],$short[$_]);

but if the length is variable then you have to stick with using length()


 
i'm very tired - so i'm not sure if this is of use or not!?

Code:
[b]#!/usr/bin/perl[/b]

open (SHORT, "< short.txt");
chomp (@short = <SHORT>);
close SHORT;

open (LONG, "< long.txt");
chomp (@long = <LONG>);
close LONG;

open (OUT_HTML, "> output.html");
print OUT_HTML "<pre>";
open (OUT_NORM, "> output.txt");

for ($x=0; $x<=$#short; $x++) {
  $r=int(rand(length ($long[$x]) - length ($short[$x]) + 1));
  print "### $r ###\n";
  
  print "$long[$x]\n";
  
  # this section is for visual purposes only
  $output_norm = substr($long[$x], $r, length $short[$x]);
  print " " x $r;
  print "$output_norm\n";
  
  substr($long[$x], $r, length $short[$x]) = "<font color=red><b>$short[$x]</b></font>";
  print OUT_HTML "$long[$x]\n";
  
  $long[$x] =~ s/<[^>]+>//g;
  print OUT_NORM "$long[$x]\n";
}

close OUT_HTML;
close OUT_NORM;


Kind Regards
Duncan
 
To KevinADC,

Thank you very much first!

(1) Yes, you are right. I want to control the position
of starting site of replacement which must not be greater
than 193. I agree with you. $offset = int(rand(193)) will be simpler.
*******************************************************
I'm not sure why you are using the modulus operator (%) here:

$offset = int(rand(length($long)%193));

I thought you just wanted to make sure the offset was not greater than 193. In which case you would just use that number for your random offset value:

$offset = int(rand(193));
*******************************************************

2. I guess the code from yours:

for (0..$#short){
substr($long[$_],int(rand(193)),length($short[$_]),$short[$_]);
print "\n$long[$_]\n";
}

will be equal to :

for(my $i = 0; $i <= $#short; $i++){
$offset = int(rand(193));
print $offset."\n";
#print length($short[$i]);
substr($long[$i],$offset,length($short[$i]),$short[$i]);
print "\n", $long[$i], "\n";

}

right?


Thank you very much!

Regards,
Alex
 
Yes - Kevin's solution is sweet!

And i've learnt something by his response - it had never occurred to me that you could do the following:-

for (0..$#short)

... and then use the default $_ variable for more than one variable inside the loop. i.e. I have always used a temporary variable. I can't believe i overlooked this. I'm always using the default variable - just not quite like this!

I'm glad i took the time to read your solution. Thank you Kevin!


Kind Regards
Duncan
 
Hi Duncdude,

I tried your code and there are two error messages:

"Global symbol "$output_norm" requires explicit package name at replace2.pl line 27.
Global symbol "$output_norm" requires explicit package name at replace2.pl line 29.
Execution of replace2.pl aborted due to compilation errors.
"

I then define the "output_norm" at the beginning of the code
but it still doesn't work.

My code is:

#!/usr/bin/perl

use strict;
use warnings;

my (@short, @long,$x,$r, @output_norm);

open (SHORT, "< short_sequences16_1.txt");
chomp (@short = <SHORT>);
close SHORT;

open (LONG, "< long_sequences.txt");
chomp (@long = <LONG>);
close LONG;

open (OUT_HTML, "> output16_1.html");
print OUT_HTML "<pre>";
open (OUT_NORM, "> output16_1.txt");

for ($x=0; $x<=$#short; $x++) {
$r=int(rand(length ($long[$x]) - length ($short[$x]) + 1));
print "### $r ###\n";

print "$long[$x]\n";

# this section is for visual purposes only
$output_norm = substr($long[$x], $r, length $short[$x]);
print " " x $r;
print "$output_norm\n";

substr($long[$x], $r, length $short[$x]) =

"<font color=red><b>$short[$x]</b></font>";
print OUT_HTML "$long[$x]\n";

$long[$x] =~ s/<[^>]+>//g;
print OUT_NORM "$long[$x]\n";
}

close OUT_HTML;
close OUT_NORM;

I want to get two output files at the same time. The first file contains the ID of each replaced long sequences and then followed by each sequence. Just like:

>SeqName1
GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTGACATAGAGCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCTCACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTCGAACCTTGGACTAACCACTGTCTGGATA
>SeqName2
AGGATTACCCGCTGGACTTCAAACGCTCGTGAAGCATCGTATTGCGAGGCAACCGAGTCATAGCCCAGTCCGGGGGCCATCGCCATCCCAGCATCTGCGTTGTTCATCGGTCCTCAGTCTCCCATCAACGTGGTCCACACCTAGCATCCTGGTTTTGCATCCGTAACAAAGGACGTTCGAAGTTTTTTGCCGGCGGGAAG


The other file should include the starting site of replacement, which is represented by "$r" in your code and then followed by each sequence. The replaced string can be highlighted in red color as you suggested.

Thanks a lot!

 
i think it is because you have my @ rather than my $ - i.e. array rather than scalar


Kind Regards
Duncan
 
Hi Duncan,

Thanks for pointing out the mistake. It works now.
But it seems that the format of output file is not
ideal.

A quick question about the code you suggested.

"for ($x=0; $x<=$#short; $x++) {
$r=int(rand(length ($long[$x]) - length ($short[$x]) + 1));
print "### $r ###\n";
print "$long[$x]\n";"

I guess "print "$long[$x]\n" here only print the original
long sequence. Is it correct?

I am wondering why we need that.


Thanks,
 
Hi Duncan,

How to modify the code

"print OUT_NORM ">SeqName$x","\n", $long[$x],"\n";"

to print the results in OUT_NORM file like:

>SeqName0
GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGAGCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCTCACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTCGAACTTCATAATTTGACTGACTGGATA
>SeqName1
AGGATTACCCGCTGGACTTCAAACGCTCGTGAAGCATCGTATTGCGAGGCAATTTACCATTTGTCTGGCCGGGGGCCAACGCAGTGCCAGCATCTGCGTTGTTCATCGGTCCTCAGTCTCCCATCAACGTGGTCCACACCTAGCATCCTGGTTTTGCATCCGTAACAAAGGACGTTCGAAGTTTTTTGCCGGCGGGAAG

thanks!
 
you don't need either:-

print "### $r ###\n";
print "$long[$x]\n";

i'm sorry - i have binned my input files for this but:-

does your print OUT_NORM ">SeqName$x","\n", $long[$x],"\n"; not work? looks o.k. to me!?

this is slightly simplified:-

print OUT_NORM ">SeqName$x\n$long[$x]\n

... i'm trying to figure this out blind so please let me know what is happening


Kind Regards
Duncan
 
First of all,

"print OUT_NORM ">SeqName$x\n$long[$x]\n" print the results
like:

>SeqName0GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGAGCCACAGCAATTCATAATTTGACTGAGACAGGGGCAACTAGAGGCCTCAGAATTCTCACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTCGAACCTTGGACTAACCACTGTCTGGATA
>SeqName1AGGATTACCCGCTGGACTTCAAACGCTCGTGAAGCATCGTATTGCGAGGCAACCGAGTCATAGCCCAGTCCGGGGGCCAACGCAGTGCCAGCATCTGCGTTGTTCATCGGTCCTTTTACCATTTGTCTGGGGTCCACACCTAGCATCCTGGTTTTGCATCCGTAACAAAGGACGTTCGAAGTTTTTTGCCGGCGGGAAG
>SeqName2TGATAATTGGTGCAATATTCTCCATAACAGATCCTCGCCAATACGGATTTGAGGGATCCCTCTGCATTTCCACGAAGCGTGTCACCGATAGAGCAGAAATGCTTTACCGCCGCAGTGATTAGGCGGGTACAGTTGTCCAAACGCACACAACCGAAACCTCCCCATGCGTACTCGTTTTTATAATTTGACTGAAGGGAAC

I want the sequence followed the ID like:
>SeqName1
ATTTTACCCCCC........

(2)print OUT_NORM ">SeqName$x\n$long[$x]\n
should be changed to
print OUT_NORM ">SeqName$x\n$long[$x]\n";

otherwise, error message appears.......

(3)my code
print OUT_NORM ">SeqName$x","\n", $long[$x],"\n";

does work for some of sequences, leading to the desired
format like:
>SeqName1
ATTCGGG

but I still get some output like yours:
>SeqName5ATCTGTCT

Thanks,

 
i'm very confused? we are printing to the terminal... correct? i.e. not HTML. sorry if i am taking a little while to get back up to speed...


Kind Regards
Duncan
 
oh, no.

I thought we were printing to a txt file.

See "open (OUT_NORM, "> output16_1.txt");"

right?

 
can you check that the text file is in UNIX format & not DOS


Kind Regards
Duncan
 
Hi Duncan,

Interesting&great! It does look well on a linux
server. I appreciate!

But if we look back on your code, it will print
the "r" which is the starting site of replacement
as well as a long sequence, which I am not sure
whether it is original long sequence or replaced
sequence.

Besides that, the code also prints some short
sequence followed the long sequence, like:

### 54 ###
GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGAGCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCTCACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTCGAACCTTGGACTAACCACTGTCTGGATA
GAGCCACAGCAACTGTA

I have no idea about what "GAGCCACAGCAACTGTA" is where it
is from.

BTW, I want to write the starting site of replacement and
the replaced sequences into a file instead of being printing
on the terminal screen. Can you give some suggestions?

Thanks,
Alex
 
Hi Everwood

the strings are from some sample data i created a while ago

$r is a random number generated from the entire string length minus the replacement string length - rather than being explicit about the 'cieling' of the random choice


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top