Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

25 Files @ 400K lines, Calculate Average of Each Line

Status
Not open for further replies.

jr8rdt

IS-IT--Management
Feb 9, 2006
59
US
I have 25 txt files, each with 400K lines, each file only has 1 column and I need to calculate the average sum of each line for the entire 25 files (e.g average sum of line 1 for file 1 to 25). I can't use database because it's gonna be too big, this calculation will be repeated approx 20 times in a day.

what are my choices. what approach should I pursue?

thanks
 
ok.. what do you have done so far?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
open the files and read them sequentially using a variable to sum and total. So look into perls file I/O functions/operators and the basic math operators. If you want help with actual code you have to post the code you have written so far, but if this is school work don't bother, no student posting is allowed here.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I am still at the design stage. I need ideas. oh ya, in addition, the number of txt files are varies. sometimes 25 , sometimes 9, 15, or 2.

flat files?
temporary tables?
I need ideas

 
KevinADC,
this is not school work.

the keyword here is *each line*. so I have to calculate the average for line 1 for file a, file b, file c, file d, ... etc
same thing for line 2 : calculate the average for line 2 for file a, b, c, d, e, ...

and do the same for each line until line 400K.
 
Need a few more details; what behaviour do you want if one file has more records than the other? Should they be assumed to be padded with zero records?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Most probably they will come out the same way (have same number of lines). But in rare cases they don't then yeah, padded with 0

-----
Need a few more details; what behaviour do you want if one file has more records than the other? Should they be assumed to be padded with zero records?

Steve
 
I gave you some ideas:

So look into perls file I/O functions/operators and the basic math operators.

That is as much help as I am willing to give at this point.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
my $file_feed = shift or die "System Error: Missing Arguments ";
open (INDEX, $file_feed) or die "Can't open $file_feed: $!";
@ARGV = <INDEX>;
chomp @ARGV;
close (INDEX);

now I have the list of files.
so somehow I need to create multi dimensional arrays to calculate ? am I on the right path?





 
my $file_feed = shift or die "System Error: Missing Arguments ";
open (INDEX, $file_feed) or die "Can't open $file_feed: $!";
@ARGV = <INDEX>;
chomp @ARGV;
close (INDEX);
$current_file = "Blah";
$i=0;
$j = 0;
my @a;

while(<>){

if ($current_file = $ARGV) {
print $ARGV;
$a[j] = $_;
$j++;
}
else {
$current_file = $ARGV;
$j = 0;
$i++;
}
}


print $a[0][10];


the print doesn't give the correct number.
I figure once the array is correct, the sum average is easy.
can you help?
 
this is wrong (I think)

Code:
if ($current_file = $ARGV) {

the '=' symbol is for assigning values not checking one value against another.

What is the contents of $file_feed that you put into @ARGV? Is that a list of filenames?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
As I suppose you'll want to write the averages into a new file, my suggestion is to open all the input files, open the destination file, then loop reading one line at a time from all input files, calculate the average and output it. Simple as it is.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Franco's solution also has the advantage of not loading the array into memory, so it should work with files of any size. It also makes it easier to supply the zeroes for short files.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Not really, Chris. More like
Code:
my$number_of_files=@handles;
my$number_to_be_read=$number_of_files;
while($number_to_be_read){
  $sum=0;
  for my$each(@handles) {
    unless(eof($each)){
      $sum+=<$each>;
      $number_to_be_read-- if eof($each);
    }
  }
  print OUTPUT $sum/$number_of_files,"\n";
}
The above code doesn't work if all the input files are zero length and can be improved as far as efficiency is concerned.
It is now time for the OP to provide their choice and code, if they want to go farther.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I am stuck. How to make the loop to read 1 line but across all files. more fundamentally how to switch reading files?

$count_open = 0;
$count_close = 0;
open (OUTPUT, '>average_realtime_ps.txt');

$file_input = "realtime";
open (DAT, $file_input) || ("Could not open file realtime!");

while(<>) {
$count_open++;
open (LH."$count_open", $_);
push (@handles, LH ."$count_open");
}

$i = $count_open;

do {

$sum = 0;
foreach $var_name (@handles) {
unless(eof($var_name)) {
$sum+=<$handle>;
}

}

print OUTPUT $sum/$count_open, "\n";

$i--;

} while ($i >0);


do {
close (LH."$count_open");
$count_open--;
} while ($count_open > 0);

close(DAT);
close (OUTPUT);

 
To create your list of filehandles ...
Code:
open (LIST, "list_of_files.txt");
while(<LIST>) {
  local *FILE; # this is local to this while construct
  chomp;
  open ( FILE, $_ );
  push ( @handles, *FILE );
}
$count_open = length( @handles );
close LIST;
(generalized: list_of_files.txt is your realtime file, right ?)

Now you can simply iterate the array @handles for your filehandles...

 
Going on after the code provided by brigmar
Code:
open (OUTPUT, '>average_realtime_ps.txt');
my$i = $count_open;
my$sum;
while($i){
  $sum = 0;
  for my$var_name(@handles) {
    unless(eof($var_name)) {
      $sum+=<$var_name>;
      $i-- if eof($var_name);
    }
  }
  print OUTPUT $sum/$count_open, "\n";     
}
for my$var_name(@handles) {
  close $var_name;
}
close OUTPUT;
As stated above this code will indefinitely loop if all the files have no data.
Also [tt]eof()[/tt] is not very efficient and is repeatedly called. If your numbers were fixed length, you could determine at [tt]open[/tt]ing the number of available data for each one and stop reading those that have been slurped. An alternative is to close and remove the handle from [tt]@handles[/tt] when [tt]eof[/tt], but this won't help much if the files are all (more or less) the same length.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Assuming @ARGV holds a list of filenames, how about:
Code:
my $files = @ARGV;

my @sums;
while(<>) {
   chomp;
   $sums[$.-1] += $_;
}
continue {
   close ARGV if eof;
}

print "Line $_: " . ( $sums[$_-1] / $files ) . "\n" for ( 1 .. @sums );
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top