Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Data Extraction 3

Status
Not open for further replies.

biobrain

MIS
Jun 21, 2007
90
GB
I have a Data Like This

[quot]
Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be here

This is a fixed line

ABCDEF anything any data any item
AJKLFF anything any data any item
NMCDEF anything any data any item
KMLOPN anything any data any item
ABCDEF anything any data any item

This is also a fixed item

any data any data anything any data any item nything any data any item anything any data any item
[/quot]

In the above data example I am interested in to extract the first six letters in each line after the sentence "This is a fixed line" imagine this sentence will remain the same in any of my file.

and the will like to extract the data till it matches the sentence "This is also a fixed item
 
Where are you stuck? What have you tried?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Well I can write a matching statement like

if (/^This\s/)

but I do not know how to tell the script to read actually next line to this matching statement as my required data start in the next line.

 
Use a variable to flag the next line and read want you need.

[Blue]Blue[/Blue] [Dragon]

If I wasn't Blue, I would just be a Dragon...
 
Something like this might work for you:
Code:
while (<DATA>) {
	if (/This is a fixed line/) {
		my $line = "";
		while ($line !~ /This is also a fixed item/) {
			print "$1\n" if $line =~ /^(\w{6})\b/;
			$line = <DATA>;
		}
	}
}

__DATA__
Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be here

This is a fixed line

ABCDEF anything any data any item
AJKLFF anything any data any item
NMCDEF anything any data any item
KMLOPN anything any data any item
ABCDEF anything any data any item

This is also a fixed item

any data any data anything any data any item nything any data any item anything any data any item
 
This is where I would use the ".." operator.
The .. operator used in a scalar context returns true when the left argument becomes true until after the right argument becomes true.

So...

while( <FILEHANDLE> ) {
if ( /START/ .. /END/ ) {
print;
}
}

will print all the lines starting with the first that matches the /START/ regex up to and including the /END/ regex, and then from the next /START/ match until the next /END/ match, and so on. This behaviour has named it the 'flip-flop' operator.

You don't want to process those 'delimters' which changes it to....

while( <FILEHANDLE> ) {
if ( /START/ .. /END/ and !/START/ and !/END/) {
print;
}
}

and I assume you dont want to process empty lines...

while( <FILEHANDLE> ) {
if ( /START/ .. /END/ and !/START/ and !/END/ and !/$\s+^/) {
print;
}
}

That should get you where you want to go....
 
Thanks all for your valuable posts

I have used

Code:
while (<DATA>) {
    if (/This is a fixed line/) {
        my $line = "";
        while ($line !~ /This is also a fixed item/) {
            print "$1\n" if $line =~ /^(\w{6})\b/;
            $line = <DATA>;
        }
    }
}

__DATA__
Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be hereAny thing can be here any thing can be here Any thing can be here any thing can be here

This is a fixed line

ABCDEF anything any data any item
AJKLFF anything any data any item
NMCDEF anything any data any item
KMLOPN anything any data any item
ABCDEF anything any data any item

This is also a fixed item

any data any data anything any data any item nything any data any item anything any data any item

It really works for me.
 
Code:
use strict;

my $filename;

# allow for commandline option
if ($ARGV[0]) {
   $filename = $ARGV[0];
}
   
else {
   print "Please type the file name you want to reformat\n";
   #user input for file name
   $filename= <STDIN>;
}
open(FILE, $filename) or
     die "Cannot open file $filename\n"; 

open (OUTPUT, ">data");

while (<FILE>) {
    if (/Sequences producing significant alignments/) {
        my $line = "";
        while ($line !~ /\>/) {
            print (OUTPUT "$1\n") if $line =~ /^(\w{5})\b/;
               }
    }
}

exit;


I do not know what is wrong with this code that it is reading my files only for about 160 lines of the data and do not collect any further data.

Although according to the code above I have assigned an end point > for my file.
 
At first glance, I'm surprised it's working. Within the while loop you're not reading any more info from your file. Near the bottom of the loop you need [blue]$line = <FILE>;[/blue]
 
Yes I forgot to write that.

But with this <FILE> it is working.


Not reading more than 160 files
 
I have noticed that it dose not read a line as shown below the last line

2PTOA 360 XRAY 2.30 0.260 0.303 no Mitogen-activated protein kin... 185 5e-47
2OJJA 380 XRAY 2.40 0.215 0.268 no Mitogen-activated protein kin... 185 5e-47
2OJIA 380 XRAY 2.60 0.239 0.267 no Mitogen-activated protein kin... 185 5e-47
2OJGA 380 XRAY 2.00 0.233 0.260 no Mitogen-activated protein kin... 185 5e-47
1GOLA 364 XRAY 2.80 NA NA no EXTRACELLULAR REGULATED KINASE 2 <U... 185 5e-47
2E14A 368 XRAY 3.00 0.260 0.281 no Mitogen-activated protein kin... 185 5e-47
1PMEA 380 XRAY 2.00 0.212 0.273 no ERK2 <UNP MK01_HUMAN> [HOMO S... 184 7e-47


So it reads all the records an stop reading the data as there was < symbol used within my own data. That is why it stopped reading
 
Did you get the problem fixed?

If not, post the code you're using and exactly what you want for output.
 
Here is my code

Code:
use strict;

my $filename;

# allow for commandline option
if ($ARGV[0]) {
   $filename = $ARGV[0];
}
   
else {
   print "Please type the file name you want to reformat\n";
   #user input for file name
   $filename= <STDIN>;
}
open(FILE, $filename) or
     die "Cannot open file $filename\n"; 

open (OUTPUT, ">mylist");

while (<FILE>) {
    if (/Sequences producing significant alignments/) {
        my $line = "";
        while ($line !~ /\>/) {
            print (OUTPUT "$1\n") if $line =~ /^(\w{5})/;
            $line = <FILE>;
        }
    }
}

system ("cp -u mylist  ../results");

exit;


Now in this code at line

Code:
 while ($line !~ /\>/) {
I want to replace > with something other, this > is also found some where other in the data.

So I have replaced with some other matching statment
 
From your sample code, it looks like you start parsing after the line 'Sequences producing significant alignments' - what does the final delimiter look like?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top