Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Maximum file size when using FGETS 3

Status
Not open for further replies.

Gary Sutherland

Programmer
Apr 18, 2012
31
0
6
GB
Large CSV files is a topic that has cropped up before, I know, but I can't find a reference to this particular problem.

I'm writing a tool import address data from the UK PAF.CSV postcodes/addresses file. This is a CSV file containing (at present) 31,827,747 individual UK postal addresses. I'm opening the file with FOPEN and reading each line with FGETS.

To get around the maximum size limit for a DBF I'm writing these to a set of ten identical tables in blocks of 3,500,000 records which is working well. The problem I'm encountering is when it reaches line 25,214,532 the FGETS generates an error that the file is too big. This is when it's already 714,531 to table 8 which should still have room for another 2,785,468 records. Remember there's still 7,000,000 capacity in tables 9 and 10.

So, this appears to be a limitation of the FOPEN/FGETS functionality in VFP.

Short of using something to split the PAF.CSV file into two files and importing them sequentially, I'm open to suggestions.

Regards
Gary
 
Ah, that probably explains it, then, Chris. The PAF.CSV file is roughly 2.5gb in size.

Thanks. I guess I'm going to have to split the CSV file and import each file sequentially.

Edit: I found a utility called Huge CSV Splitter that appears to have done the job.

Regards
Gary
 
No, you don't need to split the file, just use other code to read it, as said above.

Chriss
 
That would be neater, I agree, Chris.

I'll take a look at doing it that way,

Regards
Gary
 
For using FileSystemObject, see thread184-1811358

Mike Lewis said:
Code:
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("c:\MyFile.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
  lcLine = loFile.ReadLine()
  * code here to add the lines to an array or to 
  * write them out to multiple smaller files
ENDDO 	
loFile.Close()

While you're at the single lines, you can also import them into your DBFs, just as you did with FGETS.

Chriss
 
You might also want to look at my article, http://www.hexcentral.com/articles/foxpro-large-files.htm]Working with large text files in Visual FoxPro[/url] (which contains the code that Chris quoted).

But note that the technique described is very slow. For further information on that, see FileSystemObject performance issues.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
If performance is an issue, you can also use the ReadFile API function I mentioned. It should work just as fast as VFPs FREAD/FGETS.

Chriss
 
For a 32bit process, such as what VFP9.exe is or any EXE you build with it, the OS only allows 2GB of memory accessable, so you don't even have to try. That your PC has more RAM or more memory can be stored to pagefile.sys is of no importance. The moment you read in something into a string we only talk about the addressable RAM of the process limited to 2GB.

On 64bit Windows systems 32bit processes can have 4GB process memory if they are built large memory aware, which VFP9 does not do, so even on any current Windows, which are usually 64bit on PCs/Laptops with 8,16,32GB RAM or even more, VFP processes are limited to 2GB overall of addressable RAM.

Out of curiosity I just zipped up some iso files so the ZIP file becomes larger than 2GB and if you try loading it with FILETOSTR() you encounter what I would have predicted:
Error 43: There is not enough memory to complete this operation.

And if instead you try APPEND MEMO it should be clear by the 2GB dbf limit you get
Error 1190: File c:\users\chrissm\appdata\local\temp\0000ewi100iq.tmp is too large.
(for example).

If something of the VFP language would be able to read larger than 2GB files it would be FREAD or FGETS as it would be able to read it in smaller portions, but even that is something VFP does not support. It could, because the ReadFile Windows API function I now mentioned 3 times is a 32bit function that you can declare and call and get up to (well, theoretically) 2GB blocks of files that are as large as the filesystem used allows. More practically you read in single lines - in other scnearios maybe blocks of a few MB size - one by one.

Chriss
 
Chris Miller said:
If something of the VFP language would be able to read larger than 2GB files it would be FREAD or FGETS as it would be able to read it in smaller portions, but even that is something VFP does not support. It could, because the ReadFile Windows API function I now mentioned 3 times is a 32bit function that you can declare and call and get up to (well, theoretically) 2GB blocks of files that are as large as the filesystem used allows. More practically you read in single lines - in other scnearios maybe blocks of a few MB size - one by one.

A few notes:
1) VFP internaly use "signed int" data type for low level file functions - this is a base limitation.

2) VFPA 10.1 a VFPA 10.2 has not this limitation.

3) WINAPI ReadFile() function can read data 0xFFFFFFFF length in one step only (it's possible in VFPA 10.2), but it's very slow because sending big string buffer from VFP to API is very slow (VFP allocate new buffer).
[affective buffer size for frequent calls API from VFP it's 0xFFFF bytes]

4) WINAPI ReadFile() function can read data from file greather 2GiB, because internal WINAPI structure for file position use two DWORD.





mJindrova
 
Gary et al

Please do note that the PAF ( Postcode Address File ) which you are working with, is the property of the Royal Mail here in the UK, and requires a licence to have a copy and to work with it !

Generally, one purchases a commercial package that would have such licensing by the authors and the relevant parts of the licence passed to their users.

Please do not leave yourself open to a legal challenge

From the Royal Mail


Colin
 
1) It's not the reason VFP can't read from a >2GB file, as you say in 4) the position within a file is maintained outside of VFP, VFP only has a file handle and the filesystem manages information about the file (which file, file position and likely more) internally, you just need to refer to a file handle. So it would have been poassible to make use of that in the implementation of VFPs FREAD/FGETS functions, too. That does just require basing VFPs functions on the Windows API functions that enable large file access. That has nothing to do with VFP only supporting signed integers.

2) Yes, I'm only talking of VFP9, I don't think it's a solution to replace VFP9 with VFPA10.1/10.2 to have that limit removed, if the intent is to write a tool that can be used in VFP9 applications. Solving that with VFP10(A) you'd limit that solution to VFP10(A) developers, who don't need that, as they can simply use APPEND or IMPORT with its extended abilities anyway.

3) I said you could get up to 2GB blocks, I didn't say that's the way you should operate. Getting blocks of 256 Bytes or something of that magnitude isn't much of an overhead of shared memory needing to be copied within memory twice, the bottleneck still is the hdd access, all you need is FGETS like access, all you have is ReadFile() loading blocks, but you can easily turn that into line by line processing with ALINES of read in file blocks of sizes between 256B or perhaps 8kB, depends on record size.

I'm not seeing the big impact of a little slower than VFPs FREAD processing, also when using FileSystemObject, because the main pain is the 2GB file reading limit, and to solve that with file splits like the tool "Huge CSV Splitter" would mean reading the full file in, writing it out in smaller files and then reading these in, which means the base time factor here is 3, even worse as writing a file is usually slower than reading it. You can easily beat that, and prevent the need to split the file, even when ReadFile would only be half or 33% of FREADs performance.

Chriss
 
I just compared FSO FileStream with ReadFile on a 3GB zip file, both reading in blocks of 8KB, and Readfile is taking 20% of the time FSO needs.

To be able to compare that to native VFP function I then created a ~1GB file and read this with VFPs FREAD, FSO FieStream.Read() and Windows API ReadFile and then
VFPs FREAD processing is taking about the same time as WinAPI ReadFile() and FSO takes double the time. No idea why FSO times are so inconsistent.

I would redo the tests with same starting conditions for all measurements by restarting the computer between measurements, as there's a great likelyhood a test running after any other profits from hardware and filesystem/OS caching when reading the same file multiple times. But I'm already sure about FSO being the slowest candidate and timings of ReadFile vs VFPs FREAD being on an equal level. Because I read the same files multiple times so each test profits from the caching the same amount.

The only disadvantage of WinAPI is not giving you a method to read in a file in text mode line by line, you can only read in blocks. But there are easy ways to split blockswith ALINES, even enabling to split depending on how the CSV splits lines with CRLF, CR only or LF only. You just need to think about this lines splitting a bit to not process incomplete lines.





Chriss
 
Colin is perfectly right about the PAF being subject to licensing. But given that Gary already has access to it, I think we can assume it is legitimate. As far as I know, it is not available to be freely downloaded from the internet. If it is, it would almost certainly be a pirate copy.

That said, you do occasionally find copies of the outward postcodes available for download. That file is of course very much smaller. There used to be one on GitHub, for example, but appears to have been taken down - possibly because of the licensing issue.

Interestingly, when Fox Software launched FoxPro 2.0 (the first version that used Rushmore), they used the PAF to demonstrate the speed of index searches. They said that the PAF was the largest file they could find that would fit on a hard drive. That was in 1992.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Yes, the customer for whom I'm developing the software has purchased a license for the PAF file. There is also a set of sample files (much smaller) that are freely available to download.

Any, thanks for all the replies and the interesting background. The software is finished and on-site.

Regards
Gary
 
Hi again

The company i work for has commercial applications for address management. (Simply Postcode)

for which we get a great set of tools, and an API / web service that I can use with VFP or C # depending on what section of the business needs I'm trying to cater for.

and most importantly - the PAF is not sent out on a regular basis, whereas our API is updated Daily and will have new addresses available within hours of being created
and the searches are awesome - by part of address, part of postcode, loads of others too.

Anyway - I am NOT here to sell someone's products for them, there are loads of alternatives.

What I am trying to say - and not very well - is that getting a copy of the PAF and then reading from it / searching for what your after / extracting addresses etc, is not necessarily the right way to go about things, in this case, it seems as if you are trying to reinvent the wheel !

Find out what the customer's needs are then propose a solution ! - in this case it already exists - and in quite a few different products all with different licensing and pricing models... this methodology does not just apply to the PAF

Colin

 
Colin, that's very good advice.

I would add that, if your software or licensing scheme allows, consider a vertical subset of the data file. For example, I once worked on an application that only needed the postcode and the corresponding grid refs from the PAF - not the postal address or any of the other geographic data. (This was for a process that calculated distances between places.)

I don't remember the details, but I think that, by only taking those two items, the data came well into the 2 GB limit.

This applies to any external data of course, not just the PAF.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,

Thanks... Love your website and collection of articles - I often take a look if I need a brain break from being productive ! !

All the commercial packages that I am aware off, encrypt and compress the data and perform their searches on that, and so get around data size limits, that is the course of action that I as a professional would take, but approaches like that require proper funding, research and development and rather a lot of time. and that is why where appropriate I would use a ready made solution. I am sure that we have all done so many times... A simple example being the use of Active-X controls.

Gary - Please do not take any of my comments as any type of criticism. I am just pointing out that as in the old adage, there is more than one way to skin a cat. Also I am not suggesting that skinning cats is a good idea - Other small animals are available !

Colin
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top