Opening and reading from a binary file larger than 2GB 3

rahreg · Mar 2, 2025

Hi All,

Long time VFP programmer, first time poster on this forum. I have a program that compares 2 files by opening each using fopen and reading a certain block size from each and compares them in a loop. This has been working fine for me for decades. Unfortunately I am starting to run into the problem of VFP 9.0 not being able to deal with files larger than 2GB this way. I've looked to see if there was some file object I could use to get around it but so far no luck.

I did see something suggesting CreateObject('Scripting.FileSystemObject') but that was for a text file. I tried it anyway and used OpenTextFile and ReadLine() but it didn't work. At some point the comparison was false even though the 2 files are identical.

I'm guessing the answer is no, but is there any object that can be used to low level open a file larger than 2GB and read data from it as blocks and not lines?

Thanks
rah

tomk3 · Mar 4, 2025

Just an idea :
maybe generating an comparing hash is faster (external DLL, sampling )?

Chriss Miller · Mar 4, 2025

Tom,

Chriss Miller said:
I also wrote some routines about file deduplications and I worked with the checksums of files, precomputed at best, so no time taken to compute them per file, but also computing them just for the first block of a file and only going into detailed file comparisons where file size AND first block checksums match. That means much more sparse comparisons.

In more detail: I didn't set up this system and don't remember what it was, but it was third party. I thought MS by now had file checksums and/or hashes as part of the file system, but the only things I find are MsiGetFileHash, which as far as I understand are very specific to MS Installer MSI files and not generally available.

It would be viable to have something like that, though it will cost background processing, just like antivirus checking and could also potentially cause concurrency problems when acting on DBFs (and related files).

It is truly helpful to identify only potential matches and the MD5 hashes I think we used (even though that's now considered a weak hashing algorithm) were more unique than file sizes.

The one case of files of the same size and different content that's most probable for VFP is DBF files. In the lifecycle of data you usually update recent record that change the end of the file without changing its size, as that's reccount()*recsize(), mainly, and recsize() is a constant value for all records. So only inserts will change the file size. In DBFs where you more often update than insert records, that's candidates for same size different files.

They'd still have different MD5 hashes, pretty guaranteed. If they had the same, it's almost not worth comparing the actual content, but it's so rarely happening that you could implement that, to be 100% sure. A hash difference, which would happen most probably even for same size files with different content, is a guarantee for file differences. Two exactly same files of same size will have 100% the same hash by definition of hashes, that's the basis on which this reduces most file comparisons as unnecessary.

The only problem with that is that you would not necessarily have the hashes available for external drives you normally have detached. It would require a mechanism that computes a hash everytime a file changes and even if you have a background process doing so automatically that concentrate on files changed with priority, every small change of a file (like 1 record updated in a table) requires reading the whole file to hash it

That means file hashes done with such a mechanism will lag with file changes, just like Windows search index does. If you make it a side process of a backup routine to 1. backup a file and 2. determine its hash and store it in a hash database or as separate file or in an alternate file stream (though that's only possible with NTFS) you would have a basis for much faster sorting out which files are definitely different even with same size. And then it's your decision to still compare files with same size and hash for differences before deleting one of them.

So after all that, I agree with Tom that having file hashes would speed up deduplication very much, even though it has a cost in advance of deduping of keeping file hashes up to date.

mJindrova · Mar 4, 2025

md5.fll and vfpencryption71.fll has functions for calculating file's hash.
md5.fll use "modern" API for file access.
vfpencryption71.fll use "old" 16 bit API for file access.

Chriss Miller · Mar 4, 2025

There's more to do than having a good implementation of a hash algortihm, if you aim for hashing all files. One thing is balancing background processing load with foreground activities and not drown the performance of the system.

You could of course therefore also make it a "nighshift" job and dedicate full forground processing power to it.
Like this powershell script will do (only run it, if you don't need your system for a while):

Code:

Get-ChildItem C:\ -Recurse -File -Force -ea SilentlyContinue -ev errs |
  Get-FileHash -Algorithm MD5 |
    Out-File C:\test.txt

Taken from https://stackoverflow.com/questions/62625980/get-filehash-in-the-entire-c-drive (read the discussion for thoughts about user permissions, for example) and better output formats like CSV.

Chriss Miller · Mar 4, 2025

Some more thoughts (and I surely repeat myself in this, too, but for sake of concentrating the arguments):

You might aks yourself whether it pays to go through all files, all bytes of them just to have a hash and not yet even a single file comparison after all that work. Comparing same size files seems less work to do and when you need to read the whole file contents of all files for hashing you win, as files with unique size are never read at all.

The pigeon hole principle and the nature of hashes to absolutely identify non-duplicates is still helpful, especially when you don't need to recompute hashes for all files for every single deduplication pass.

1. Once you have hashes, ideally store them with the files last modification datetime so you can identify outdated hashes and only recompute them for changed files
2. Once you have a file list with hashes, the way to find candidates for duplicates is by sorting by a compound of hash and size. Same size and different hash will skip a lot of unnecessary comparisons and that's the major win you'll have.
3. Same size and same hash is not a 100% guarantee of same content, but duplicates will have that. And the larger the files are, the more likely the same hash AND size also means duplicate.

The argument for 3 is that how would two about 100 MB large files exist with same size exactly to the byte, same hash but different content? It's even hard to generate two files with same hash - even just MD5 hash - intentionally, even when you'd be allowed to generate two files with different sizes that obviously can't be duplicates. You may entertain yourself for a few hours to try out. Just don't forget the files have to differ in content, it's obviously no problem to copy a large file and get the same hash for it, that's the nature of hashes.

Same hash and same small size makes it a cheap comparison you don't need to avoid for sake of ensuring it's not a hash collision, but the vast majority of comparisons can be avoided because of different hashes, even with same sizes in the range of file sizes that fall into the pigeon holes. Even with more than 2 files of the same size and hash. And these pigeon holes are not only about 0 byte size files or a few 100 bytes sized files.

I also talked about a principle we used to only hash a first block of a file, not whole files. That can put together files with the same first block has that turn out to be incomplete vs complete downloads, or other files with same start but overall obviously the longer one is the complete version of the shorter one, which also can help to identify incomplete files of which you have the complete file. It was helpful for MSDN ISO downloads, for example,

ggreen61 · Mar 5, 2025

You might be able to speed up the VFP version by increasing the BLOCK size value. Instead of 8192, you might try sizes greater than 10 megabytes... This would be less reads performed. I believe that when comparing two string values, VFP will stop comparing when a byte is different between the two strings.

Chriss Miller · Mar 6, 2025

Greg,

I already tried various BLOCK sizes. You only get far longer read times when reading single bytes. The effect of that already vanishes with 256 bytes. No need to go as high as you suggest. Durations vary vastly anyway and also take longer with larger block sizes, so there are other things having a greater effect.

I got 28s to 96s for the same 1GB file with VFPs FOPEN/FREAD/FCLOSE and that's all timings gathered after rebooting and reading the file for the first time.
Using FSO I got 105-131s. So I didn't get as large a factor as Mike. In C++ I got about 20s using ifstream, C# got similar times as C++ using binaryreader.

Secondary reads go as fast as 10s, but profit from caching mechanisms in both software and hardware, that's not interesting, as the later usage for dedupllication is reading every file for the first time.

You can do your own experiments, oibviously. If you want to find out the influence of block size, you have to do a lot of tests not confusing cache gains from gains by block size. One thing is for sure, though, only very small block sizes have a negative effect as already said.

I did reboot for each test and waited for a cloud service log in that usually is the last thing popping up when I restart, so the system is quite idle at that time. That way of testing is quite time consuming, but worth it. It's astonishing, but different programming languages file operations implementations get different results. I had the record first read performance of ~16.5s from C++ ifstream and 17.9s from C# Binaryreader, so that's still about twice as fast as VFPs FREAD, FSO is clearly the worst.

I would have expected about the same timings for any implementation as the file reading speed is a hardware feature, mainly. No programming language does the low level stuff of reading sectors (or the equivalent of SSDs) this all the time goes through the OS filesystem layer. But since I knew Mike Lewis testing of FSO from a similar thread about importing a large CSV last year, I spent a bit in implementing file reading with different languages and or mechanisms.

I already posted VFP code using FSO vs FOPEN/FREAD/FCLOSE, or didn't I? At least for comparing two files. It can easily be reduced to just reading one file.

Now here's a C++ implementation:

Code:

#include <iostream>
#include <fstream>
#include <chrono>

using namespace std;
const streamsize chunksize = 0x2000;

int main()
{
    streamsize readBytes = 0;
    ifstream fileStream;
    float timeElapsed = 0.0;

    auto start = chrono::high_resolution_clock::now();
    fileStream.open("C:\\large\\archive.zip", ifstream::binary);
    // File open a success
    if (fileStream.is_open()) {
        // Create a buffer to hold each chunk
        char* buffer = new char[chunksize];

        // Keep reading until end of file
        while (!fileStream.eof()) {
            fileStream.read(buffer, chunksize);
            // gcount() returns number of bytes read from stream.
            readBytes += fileStream.gcount();
        }
        // Close input file stream.
        fileStream.close();
    }
    else { cout << "Error opening file!"; }
    auto stop = chrono::high_resolution_clock::now();
    auto duration = chrono::duration_cast<chrono::milliseconds>(stop-start);

    cout << readBytes << "Bytes read\n";
    cout << duration.count()/1000.0 << " seconds \n";
    cout << "press ENTER...";
    cin.ignore(); // wait for input (ENTER) and don't process or store it (therefore ignore)

    return 0;
}

And here's the C# binaryreader implementation:

Code:

using System.Diagnostics;

const int chunksize=0x2000;

Stopwatch stopwatch = Stopwatch.StartNew();

BinaryReader br;
try
{
    br = new BinaryReader(new FileStream("C:\\large\\archive.zip", FileMode.Open));
}
catch (IOException e)
{
    Console.WriteLine(e.Message + "\n Cannot open file.");
    return;
}

var chunk = new byte[chunksize];
int readCount;
var readTotal = 0;
try
{
    while ((readCount = br.Read(chunk, 0, chunksize)) != 0)
    {
        readTotal += readCount;
    }
}
catch (IOException e)
{
    Console.WriteLine(e.Message + "\n Cannot read from file.");
    return;
}
br.Close();

stopwatch.Stop();
Console.WriteLine(readTotal);
Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000.000);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();

I think C# gets to about the same timings as C++ because of JIT compilation. It has one large disadvantage: the startup time for a C# exe is longer than that of a C++ or VFP exe, because of the background of runtimes necessary to start up and the JIT compilation, obviously. The timinig is only measuring from file opening to closing, not the whole EXE. If you would go as far as using a C++ DLL or C# assembly for outsourcing file reading that spin up time is only necessary once, though and when you process two whole drives, the benefit of a faster file read is still helpful, unless you care more about the easier to maintain VFP only solution and biting the bullet of extremely slow FSO for all larger files.

About the performance, the essence to tell is that both C# and C++ implementations are about factor 2 faster than VFP, still. I guess this is due to these implementations making use of overlapped IO and multithreading, even without you needing to explicitly program that way. Since VFPs FREAD also is a function that's implemented in C++, as VFP is C++, it may just show that C++ improved since 2006, too.

Chriss Miller · Mar 8, 2025

ggreen61 said:
you might try sizes greater than 10 megabytes... This would be less reads performed.

Clearly undeniable

ggreen61 said:
I believe that when comparing two string values, VFP will stop comparing when a byte is different between the two strings.

Also true, clearly.

But the fastest way to find a difference of files then would be to read bytewise, as you expect a difference already in the first few bytes. If you read blocks of 10MB, your first block comparison also only starts after reading 2x10MB=20MB and if the first byte differs and you spare comparing 20MB-2 bytes in memory, fine. But you didn't spare reading 20MB-2 bytes into memory, first. There's more detail that would speak for even reading unlimited blocks, if that a) can be done in parallel and b) another prallel process compares what's read in and c) that reading can be cancelled, stopped immidiately once the comparing process finds a difference. The time spent in parallel reading is not wasteful. It could theoretically even pay to make multiple file pair comparisons at once, especially with SSD devices, where reading more files doesn't have the penalty of needing to reposition a read head, which already is a limitation of reading in two files in parallel, by the way, if they would be on the same drive.

Why 8192? I searched for sample codes showing how files are read in different programming languages and all samples you see operating block wise or often also called chunk wise use sizes that are (or were) comparing to the block sizes of the storage devices. Sample code was using sizes ranging from 512 to 4096 in powers of two, as that's how block sizes grew, I even aimed one power of 2 higher with 8192.

Hard drives, even SSDs are block storage devices at the lowest level. Blocks grew larger as we had FAT, as FAT had a limit of blocks (or clusters, whatever the term is). That changed with FAT32 and NTFS now can address up to 8PB for which clusters would need to be 2048KB (2MB) - see https://learn.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview, 10MB has no justification from that point of view.

One thing is still open for further tests: Using the block/cluster size as you find by getting hard drive formatting information especially for your specific drives. It pays that you can actually optimize this for your very specific drives, when you're already programming very individually and custom to your sepcific idea of deduplication, you can also adapt it to exactly those two external drives and their specific performance behaviour and adapt it when you go for the next generation update. But you also surely will take into account that one hour of your time costs more than a process that takes 8 hours to run, once for a generation update theat happens every 2 years only - for example. Even if that could be accelerated to only take 1 hour, but need one more hour of work to get there.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Opening and reading from a binary file larger than 2GB 3

rahreg

Programmer

tomk3

Programmer

Chriss Miller

Programmer

mJindrova

Programmer

Chriss Miller

Programmer

Chriss Miller

Programmer

ggreen61

Programmer

Chriss Miller

Programmer

Chriss Miller

Programmer

Similar threads

Part and Inventory Search

Sponsor