Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Opening and reading from a binary file larger than 2GB 2

rahreg

Programmer
Mar 2, 2025
8
Hi All,

Long time VFP programmer, first time poster on this forum. I have a program that compares 2 files by opening each using fopen and reading a certain block size from each and compares them in a loop. This has been working fine for me for decades. Unfortunately I am starting to run into the problem of VFP 9.0 not being able to deal with files larger than 2GB this way. I've looked to see if there was some file object I could use to get around it but so far no luck.

I did see something suggesting CreateObject('Scripting.FileSystemObject') but that was for a text file. I tried it anyway and used OpenTextFile and ReadLine() but it didn't work. At some point the comparison was false even though the 2 files are identical.

I'm guessing the answer is no, but is there any object that can be used to low level open a file larger than 2GB and read data from it as blocks and not lines?

Thanks
rah
 
Tom,
I also wrote some routines about file deduplications and I worked with the checksums of files, precomputed at best, so no time taken to compute them per file, but also computing them just for the first block of a file and only going into detailed file comparisons where file size AND first block checksums match. That means much more sparse comparisons.
In more detail: I didn't set up this system and don't remember what it was, but it was third party. I thought MS by now had file checksums and/or hashes as part of the file system, but the only things I find are MsiGetFileHash, which as far as I understand are very specific to MS Installer MSI files and not generally available.

It would be viable to have something like that, though it will cost background processing, just like antivirus checking and could also potentially cause concurrency problems when acting on DBFs (and related files).

It is truly helpful to identify only potential matches and the MD5 hashes I think we used (even though that's now considered a weak hashing algorithm) were more unique than file sizes.

The one case of files of the same size and different content that's most probable for VFP is DBF files. In the lifecycle of data you usually update recent record that change the end of the file without changing its size, as that's reccount()*recsize(), mainly, and recsize() is a constant value for all records. So only inserts will change the file size. In DBFs where you more often update than insert records, that's candidates for same size different files.

They'd still have different MD5 hashes, pretty guaranteed. If they had the same, it's almost not worth comparing the actual content, but it's so rarely happening that you could implement that, to be 100% sure. A hash difference, which would happen most probably even for same size files with different content, is a guarantee for file differences. Two exactly same files of same size will have 100% the same hash by definition of hashes, that's the basis on which this reduces most file comparisons as unnecessary.

The only problem with that is that you would not necessarily have the hashes available for external drives you normally have detached. It would require a mechanism that computes a hash everytime a file changes and even if you have a background process doing so automatically that concentrate on files changed with priority, every small change of a file (like 1 record updated in a table) requires reading the whole file to hash it

That means file hashes done with such a mechanism will lag with file changes, just like Windows search index does. If you make it a side process of a backup routine to 1. backup a file and 2. determine its hash and store it in a hash database or as separate file or in an alternate file stream (though that's only possible with NTFS) you would have a basis for much faster sorting out which files are definitely different even with same size. And then it's your decision to still compare files with same size and hash for differences before deleting one of them.

So after all that, I agree with Tom that having file hashes would speed up deduplication very much, even though it has a cost in advance of deduping of keeping file hashes up to date.
 
md5.fll and vfpencryption71.fll has functions for calculating file's hash.
md5.fll use "modern" API for file access.
vfpencryption71.fll use "old" 16 bit API for file access.
 

Part and Inventory Search

Sponsor

Back
Top