reading large file

ovince · Feb 27, 2007

hi all,

In the file that I should use, there are about 50000 spectra. Each of them starts with '#' and have 6000 rows (with 2 column). Something like:

# spectrum 1
100 23.456
101 23.435
...
# spectrum 2
100 22.456
101 23.435
...
...
...
# spectrum 50000
100 53.456
101 53.435
...

So I have 50000 * 6001 rows alltogether in files. I wrote program in C that process each of these spectra. In the loop, I call awk in order to extract spectrum from these large files into separate file like:

awk '{if (NR == last) break} {if (NR >= first && NR <= last) print $0}' spectra.dat > extraxtedSpectra.dat

where 'first' and 'last' are loop variabeles. I use 'break' statement in awk to fast up extracting the spectrum. Extraction in this case is fast for the first lets say 100 spectra and then slows down. How to make extraction more efficient and faster?

thanks in advance
oliver

taupirho · Mar 5, 2007

csplit might do what you want. Alternatively messing about with dd might be useful - have a look at the manual page for it especially the conv=block to get your file into fixed length records, then a combination of seek and /or skip. For example say your file had fixed length records of 100 bytes (after using conv=block)

I dont have a unix system but from memory to get the first 6001 records use something like

dd if=myfile of=outfile ibs=100 skip=0 count=6001

To get the next 6001 records after that you'ld use

dd if=myfile of=outfile ibs=100 skip=1 count=6001 conv=notrunc

ovince · Mar 5, 2007

thanks for reply. I would like to use dd ... sounds as something I might use.

How to determine isb for block that contain one spectrum?

thanks again
oliver

ovince · Mar 5, 2007

... sorry, not isb but ibs in:

dd if=myfile of=outfile ibs=100 skip=0 count=6001

PHV · Mar 5, 2007

Seems your C program already reads your file as it knows first and last record numbers for the given spectrum.
So, I wonder why it doesn't write extraxtedSpectra.dat itself while reading spectra.dat ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

ovince · Mar 5, 2007

C program calculates the 'first' and 'last' and these values are used by awk to extract the particular spectra. The problem is that awk slows down rapidly as we read out spectra with larger spectral number

PHV · Mar 5, 2007

My real question was:
why the C program doesn't extract each particular spectrum itself, thus avoiding open (ie rewind !) the input file 50000 times ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

ovince · Mar 5, 2007

C program is just a test program. The final program should be written in IDL. It has many good sides but for, while-do and other cycle-statements are very slow. So the part of the program in IDL that extracts spectra from large fajl should be replaced. Calling SED, grep, awk and similar progs from IDL are easy to call.

taupirho · Mar 6, 2007

How to determine isb for block that contain one spectrum?

Its the number of records in a spectrum * length of each record

dd if=myfile of=outfile cbs=100 conv=block

This should give you a file with all fixed length records = 100 bytes.

Now process the outfile

dd if=outfile of=file1 ibs=600100 skip=0 count=1
(outputs your first spectrum file)

dd if=outfile of=file2 ibs=600100 skip=1 count=1
(outputs your second spectrum file)

dd if=outfile of=file3 ibs=600100 skip=2 count=1
(outputs your third spectrum file)

etc ....

This may or may not be faster than your original method but its worth a try

LKBrwnDBA · Mar 6, 2007

Why not just use awk and forget the C program:

Code:

awk '{
if ($1 == "#" && $2 == "sectrum") fn=$3;
print $0 > "spectra"fn".dat";
}'  spectra.dat

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

PHV · Mar 6, 2007

LKBrwnDBA, I don't think oliver want to create 50000 files.
Furthermore, without using the close function, the above awk program is likely to run out of file descriptors ...

taupirho, as the spectrum number isn't formatted, you can't use a constant bs value.

oliver, is by chance IDL able to call a C function with static variables or deal with named pipes or ...

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

ovince · Mar 6, 2007

thanks for all reply.

When posted this question I made a simplification but I realize now that I made a mistake. The header for each spectrum are like:

#: spec_ 0.1103_ 0.0279
#: spec_ 0.1889_ 0.0972
#: spec_ 0.1014_ 0.0622
#: spec_ 0.1837_ 0.1168
#: spec_ 0.2295_ 0.0974

So they are of equal length.

taupirho, I tried dd. After:

dd if=myfile of=outfile cbs=100 conv=block

I got a really huge file 'outfile'. Using a second second command line:

dd if=outfile of=file1 ibs=600100 skip=0 count=1

'file1' is extracted. It is in a (for me) strange format and I do not know how to use it. Is it posible to write 'file1' back into ASCII?

PHV(MIS), You are absolutelly right. I do not want to make 50000 individual files with awk. Beside, split commad would do that more faster than awk since I tried both. In IDL I can call C functions. I am not sure about static variables but will look in manual or ask somebody.

PHV · Mar 6, 2007

While reading the manual, have a look at the ftell and fseek functions.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

reading large file

ovince

Programmer

taupirho

Programmer

ovince

Programmer

ovince

Programmer

PHV

MIS

ovince

Programmer

PHV

MIS

ovince

Programmer

taupirho

Programmer

LKBrwnDBA

MIS

PHV

MIS

ovince

Programmer

PHV

MIS

Similar threads

Part and Inventory Search

Sponsor