Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

parse files to 65000 rows pieces 1

Status
Not open for further replies.

Corwinsw

Programmer
Sep 28, 2005
117
BG
Hi guys. I had pretty big report, which sbd had to view with Excel. But Excel don't support more than 65 000 row. My source is:

Code:
#!/usr/bin/perl -w
use strict;

my $input_file   = shift;
my $line_counter = 65000;
my $extention    = 1;
print "Input file not defined!\n"  unless $input_file;

open INPUT,  "< $input_file"
   or die "Couldn't open input file! $! \n";

&open_new_output_file;

while (<INPUT>) {
   unless ($line_counter) {
      &open_new_output_file;
   }
   print OUTPUT $_;
   $line_counter--;
}


sub open_new_output_file {
   close OUTPUT;
   open OUTPUT, "> $input_file.$extention";
   $extention++;
   $line_counter = 65000;
}

It is pretty simple task and the script works perfect. So I just want to ask if there is any way to do it better or simpler, or how the code structure would look better.

Corwin
 
Normally I'm not a fan of using globals in subs, but for a script that short, I think it works better than the added complexity of passing all of the stuff down by reference.

I would, however, encourage you to close your file handles when you're done. Also, the first time you run your sub, you close OUTPUT when it's not open yet. I'm not sure if there's any explicit "is this file handle open" check, but I think if you use a scalar instead of a typeglob as a file handle, a lot of that is taken care of for you, even closing files.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
I see what you mean. The globals $linecounter and OUTPUT don't make the code easy to read or follow. I'd be happier with something like
Code:
use constant LINE_COUNTER => 65000;
my $output_file = $input_file . '0'; #- watch this space

LOOP: while(1) {
  open( local *OUTPUT, '>', $file++ );
  my $c = LINE_COUNTER;
  while( $c-- ) {
    last LOOP unless defined ( local $_ = <INPUT> );
    print OUTPUT;
  }
}

Firstly note the magic [tt]$file++[/tt] - this is perl at it's sly best and it does exactly what you would want it to do. Note also the localisation of *OUTPUT. This gives us an automatic close on block exit and saves a huge amount of grief.

The only slight irritant is [tt]$c[/tt] but I can't find a good way to get rid of it. The tempting
Code:
  for( 0 .. $line_counter ) {
    last LOOP unless defined ( local $_ = <INPUT> );
    print OUTPUT;
  }
is horribly inefficient as it builds an anonymous array with 65000 entries before it starts executing the loop. Perl may have some optimistic delayed construction which would help on the last loop but actually make things worse on all others! I think we're stuck with a loop-counting variable.

What do you think?

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
fishiface said:
...is horribly inefficient as it builds an anonymous array with 65000 entries before it starts executing the loop.

I actually thought that for a long time, too, until I re-read that section of perlop earlier this week:

perlop said:
In the current implementation, no temporary array is created when the range operator is used as the expression in foreach loops, but older versions of Perl might burn a lot of memory...

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
is horribly inefficient as it builds an anonymous array with 65000 entries before it starts executing the loop.

seems a bit exaggerated no? maybe if the number of entries was 65_000_000 it would be horribly inefficent, but 65_000 is practically an insignificant number.
 
Well, sure, in its own scale. Generating an 520k byte array when you really wanted a 8 byte counter is excessive. But since modern perl doesn't do that anymore, the point is relatively moot.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
The documentation specifically says "foreach" loops. I assume that a "for" loop never had this problem, do you know if that's a correct assumption?
 
Actually, both issues came up in this recent thread219-1135351
rharsh said:
for and foreach are the same thing.
The foreach keyword is actually a synonym for the for keyword, so you can use foreach for readability or for for brevity. (Or because the Bourne shell is more familiar to you than csh, so writing for comes more naturally.)
See the docs here for more info.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo
 
Typo: I was using $output_file and $file interchangeably in my ealier posts. Friday brain-fade.

The consensus seems to be
Code:
use constant LINE_COUNTER => 65000;
my $output_file = $input_file . '0';
LOOP: while(1) {
  open( local *OUTPUT, '>', $output_file++ );
  for( 0 .. LINE_COUNTER ) {
    last LOOP unless defined ( local $_ = <INPUT> );
    print OUTPUT;
  }
}

I've got one gripe with this, which is that $file is not available for error messages on [tt]open[/tt] or [tt]print[/tt] because it has already been advanced.

If you don't mind your first output file having an extension of 1 rather than 0, you might be better with:
Code:
use constant LINE_COUNTER => 65000;
my $file = $input_file . '0';
LOOP: while(1) {
  open( local *OUTPUT, '>', ++$output_file )
    or die "$0: $output_file: $!";
  for( 0 .. LINE_COUNTER ) {
    last LOOP unless defined ( local $_ = <INPUT> );
    print OUTPUT or die "$0: $output_file: $!";
  }
}

What do you think?

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Hi guys.
Ok I couldn't really understand the meaning of:
open( local *OUTPUT, '>', ++$output_file )
So, ++$output_file should increase the last numbers. Do this only happens when $input_file . '0', because of the concatetanetion? Whay would happen if $input_file had trailing numbers? And why do we define output as glob and with local?

Second :
I liked that: while( $c-- ). Whay is it's problem? Is that while worse than the for?

Corwin
 
I'll build up
Code:
open( local *OUTPUT, '>', ++$output_file )
from bits.

Start with the straight open:
Code:
open( OUTPUT, ">$outputfile" )

According to perlopen, there is an equivalent three argument form:
Code:
open( OUTPUT, '>', $outputfile )

This is slightly more efficient (we're not building a new string from '>' and $outputfile) but, more importantly, it lets us play games with $outputfile.

Before we get there, let's look quickly at localisation. local() gets bad press because it doesn't do what people familiar with other languages expect - [tt]my()[/tt] is usually what's required. [tt]my()[/tt], however, cannot be used on filehandles and, as it happens, [tt]local()[/tt] both restricts the accessibility and the life of the filehandle to the enclosing block. It's not quite lexicalisation (a la my()) but it's prefereable to global filehandles. The catch is that you have to localise the entire typeglob (*OUTPUT) rather than just the filehandle.

Digression: a typeglob is how perl keeps track of all gloabl things with the same name, so *OUTPUT refers to the scalar $OUTPUT, the array @OUTPUT, the hash %OUTPUT, the code &OUTPUT as well as the filehandle OUTPUT. Localising the typeglob localises all of these things. If you stick to the handy convention of using all uppercase for filehandles (and for nothing else) then this won't bite you as you won't have a $OUTPUT to worry about.

To explicitly declare a filehandle's typeglob as local, we'd say
Code:
while( [i]something[/i] ) {
  local *OUTPUT;
  open( OUTPUT, ...
  ....
  last if [i]condition[/i]; # OUTPUT closed here if condition is true
  ....
} # OUTPUT closed here otherwise

This means that, amongst other things, the filehandle OUTPUT is more protected from possible interference from the rest of your code and, as a bonus, it automatically closes as control leaves the enclosing block.

As a shorthand for
Code:
local *OUTPUT;
open( OUTPUT, ...
you can say
Code:
open( local *OUTPUT, ...
just as you would usually condense
Code:
my $i;
foreach $i (@list) {
   ...
to
Code:
foreach my $i (@list) {
   ...

That gets our example to
Code:
open( local *OUTPUT, '>', $outputfile )

Now lets look at $outputfile and that wonderful ++ operator. According to perlop,
If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has been used in only string contexts since it was set, and has a value that is not the empty string and matches the pattern /^[a-zA-Z]*[0-9]*\z/ , the increment is done as a string, preserving each character within its range, with carry:
Code:
    print ++($foo = '99');	# prints '100'
    print ++($foo = 'a0');	# prints 'a1'
    print ++($foo = 'Az');	# prints 'Ba'
    print ++($foo = 'zz');	# prints 'aaa'

undef is always treated as numeric, and in particular is changed to 0 before incrementing (so that a post-increment of an undef value will return 0 rather than undef).

The auto-decrement operator is not magical.

Because we split off the '>' into a separate argument to open, it's now easy to apply the magical pre- or post-increment operator to $outputfile:
Code:
open( local *OUTPUT, '>', ++$output_file )

So we've now got more protected filehandles, automatic filehandle closure (however we exit the loop) and automatic filename generation in one line. I think it's quite legible in the circumstances - more so if it becomes part of your normal toolkit.

HTH,

fish

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
10x fishiface, that was in detail and definitely useful for me. Last thing. If we have $input_file = input1 for example, when we make
Code:
$output_file = $input_file.'0'
and we call 9 times we will have input20 instead of the expected input110. So I still think that is better to use $extention.

Corwin
 
Your call. I hate it when the real world gets in the way of slick code!

f

[&quot;]As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.[&quot;]
--Maur
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top