Greg,
I already tried various BLOCK sizes. You only get far longer read times when reading single bytes. The effect of that already vanishes with 256 bytes. No need to go as high as you suggest. Durations vary vastly anyway and also take longer with larger block sizes, so there are other things having a greater effect.
I got 28s to 96s for the same 1GB file with VFPs FOPEN/FREAD/FCLOSE and that's all timings gathered after rebooting and reading the file for the first time.
Using FSO I got 105-131s. So I didn't get as large a factor as Mike. In C++ I got about 20s using ifstream, C# got similar times as C++ using binaryreader.
Secondary reads go as fast as 10s, but profit from caching mechanisms in both software and hardware, that's not interesting, as the later usage for dedupllication is reading every file for the first time.
You can do your own experiments, oibviously. If you want to find out the influence of block size, you have to do a lot of tests not confusing cache gains from gains by block size. One thing is for sure, though, only very small block sizes have a negative effect as already said.
I did reboot for each test and waited for a cloud service log in that usually is the last thing popping up when I restart, so the system is quite idle at that time. That way of testing is quite time consuming, but worth it. It's astonishing, but different programming languages file operations implementations get different results. I had the record first read performance of ~16.5s from C++ ifstream and 17.9s from C# Binaryreader, so that's still about twice as fast as VFPs FREAD, FSO is clearly the worst.
I would have expected about the same timings for any implementation as the file reading speed is a hardware feature, mainly. No programming language does the low level stuff of reading sectors (or the equivalent of SSDs) this all the time goes through the OS filesystem layer. But since I knew Mike Lewis testing of FSO from a similar thread about importing a large CSV last year, I spent a bit in implementing file reading with different languages and or mechanisms.
I already posted VFP code using FSO vs FOPEN/FREAD/FCLOSE, or didn't I? At least for comparing two files. It can easily be reduced to just reading one file.
Now here's a C++ implementation:
Code:
#include <iostream>
#include <fstream>
#include <chrono>
using namespace std;
const streamsize chunksize = 0x2000;
int main()
{
streamsize readBytes = 0;
ifstream fileStream;
float timeElapsed = 0.0;
auto start = chrono::high_resolution_clock::now();
fileStream.open("C:\\large\\archive.zip", ifstream::binary);
// File open a success
if (fileStream.is_open()) {
// Create a buffer to hold each chunk
char* buffer = new char[chunksize];
// Keep reading until end of file
while (!fileStream.eof()) {
fileStream.read(buffer, chunksize);
// gcount() returns number of bytes read from stream.
readBytes += fileStream.gcount();
}
// Close input file stream.
fileStream.close();
}
else { cout << "Error opening file!"; }
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::milliseconds>(stop-start);
cout << readBytes << "Bytes read\n";
cout << duration.count()/1000.0 << " seconds \n";
cout << "press ENTER...";
cin.ignore(); // wait for input (ENTER) and don't process or store it (therefore ignore)
return 0;
}
And here's the C# binaryreader implementation:
Code:
using System.Diagnostics;
const int chunksize=0x2000;
Stopwatch stopwatch = Stopwatch.StartNew();
BinaryReader br;
try
{
br = new BinaryReader(new FileStream("C:\\large\\archive.zip", FileMode.Open));
}
catch (IOException e)
{
Console.WriteLine(e.Message + "\n Cannot open file.");
return;
}
var chunk = new byte[chunksize];
int readCount;
var readTotal = 0;
try
{
while ((readCount = br.Read(chunk, 0, chunksize)) != 0)
{
readTotal += readCount;
}
}
catch (IOException e)
{
Console.WriteLine(e.Message + "\n Cannot read from file.");
return;
}
br.Close();
stopwatch.Stop();
Console.WriteLine(readTotal);
Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000.000);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
I think C# gets to about the same timings as C++ because of JIT compilation. It has one large disadvantage: the startup time for a C# exe is longer than that of a C++ or VFP exe, because of the background of runtimes necessary to start up and the JIT compilation, obviously. The timinig is only measuring from file opening to closing, not the whole EXE. If you would go as far as using a C++ DLL or C# assembly for outsourcing file reading that spin up time is only necessary once, though and when you process two whole drives, the benefit of a faster file read is still helpful, unless you care more about the easier to maintain VFP only solution and biting the bullet of extremely slow FSO for all larger files.
About the performance, the essence to tell is that both C# and C++ implementations are about factor 2 faster than VFP, still. I guess this is due to these implementations making use of overlapped IO and multithreading, even without you needing to explicitly program that way. Since VFPs FREAD also is a function that's implemented in C++, as VFP is C++, it may just show that C++ improved since 2006, too.