Mike Lewis
Programmer
We recently had some discussions here about manipulating large text files -including files larger than 2 GB. In particular:
thread184-1811358
thread184-1810484
I happen to have a text file containing a dictionary of 194,433 English words, each on a separate line. Just for fun, I thought I would explore the various ways of processing this file in VFP, including doing some performance tests.
As a first step, I just wrote some simple code to count the lines, using six different methods. I'll paste this code in below, along with the results of the tests. Here is a summary of the methods:
Test 1
Create a cursor; use APPEND to import the lines. (For this test, we only need to import the first character of each line, as the aim is simply to count the lines.)
Test 2
Use low-level file functions to read and count the lines.
Test 3
Read the file into a memory variable; load the contents of the variable into an array.
Test 4
Read the file into a memory variable; count the number of LFs.
Test 5
Read the file into a memory variable; use MEMLINES() to count the lines. Note that, in the original file, each line ended with a LF, that is CHR(10). But MEMLINES() only recognises a CR, that is CHR(13), as a line-ending, not a LF on its own. For that reason, I used a modified version of the file for this test, in which the LFs had all been changed to CRs. This issue does not affect any of the other methods used here.
Test 6
Use File System Object (FSO) to read and count the lines.
To make it easer to measure the results, I tripled the size of the input file, by appending it to itself twice. This resulted in a file of 5.45 MB, with 583,299 lines. That is still a lot smaller than the files we discussed in the above threads, but I hope the results will still be useful.
Here is the code I used for the tests. Although I am showing it here as a single PRG, I actually ran each test separately, after re-starting VFP each time. This was to avoid any effects of buffering or caching.
And here are the results of the tests.
Not surprisingly, using the FSO was by the far the slowest, mainly because it involved a large number of COM calls. On the other hand, this is the only one of the methods that will work with text files larger than 2 GB. Also not surprisingly, the methods that involved looping were slower than the ones that didn't.
Of course, simply counting the lines in a file isn't particularly useful. So my next step will be to adapt the various methods to import the file into a DBF. But I'll leave that for another time. (Or someone else might like to do it first.)
Final word: I downloaded the file from I hope to use it to generate some word games and puzzles. In case anyone is interested, I'll post any of the puzzles that I manage to get working.
Mike
__________________________________
Mike Lewis (Edinburgh, Scotland)
Visual FoxPro articles, tips and downloads
thread184-1811358
thread184-1810484
I happen to have a text file containing a dictionary of 194,433 English words, each on a separate line. Just for fun, I thought I would explore the various ways of processing this file in VFP, including doing some performance tests.
As a first step, I just wrote some simple code to count the lines, using six different methods. I'll paste this code in below, along with the results of the tests. Here is a summary of the methods:
Test 1
Create a cursor; use APPEND to import the lines. (For this test, we only need to import the first character of each line, as the aim is simply to count the lines.)
Test 2
Use low-level file functions to read and count the lines.
Test 3
Read the file into a memory variable; load the contents of the variable into an array.
Test 4
Read the file into a memory variable; count the number of LFs.
Test 5
Read the file into a memory variable; use MEMLINES() to count the lines. Note that, in the original file, each line ended with a LF, that is CHR(10). But MEMLINES() only recognises a CR, that is CHR(13), as a line-ending, not a LF on its own. For that reason, I used a modified version of the file for this test, in which the LFs had all been changed to CRs. This issue does not affect any of the other methods used here.
Test 6
Use File System Object (FSO) to read and count the lines.
To make it easer to measure the results, I tripled the size of the input file, by appending it to itself twice. This resulted in a file of 5.45 MB, with 583,299 lines. That is still a lot smaller than the files we discussed in the above threads, but I hope the results will still be useful.
Here is the code I used for the tests. Although I am showing it here as a single PRG, I actually ran each test separately, after re-starting VFP each time. This was to avoid any effects of buffering or caching.
Code:
* Timing test for word list.
* Input file contains 583,299 English words, one per line, all lower case, each line terminated
* by a single line-feed, CHR(10). File is 5.45 MB.
* Test 1
* Create a cursor; append from the text file; check the record count.
lnSecs1 = SECONDS()
CREATE CURSOR Words (F1 C(1))
&& For this test, we only need to import the first char. of each line.
APPEND FROM ManyWords.txt TYPE SDF
lnCount = RECCOUNT()
lnSecs2 = SECONDS()
? "Test 1 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?
***********************************************************
* Test 2
* Use low-level file functions to read and count the lines
lnSecs1 = SECONDS()
lnHandle = FOPEN("ManyWords.txt")
lnCount = 0
DO WHILE NOT FEOF(lnHandle)
x = FGETS(lnHandle, 32)
lnCount = lnCount + 1
ENDDO
FCLOSE(lnHandle)
lnSecs2 = SECONDS()
? "Test 2 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?
***********************************************************
* Test 3
* Read file into memory variable; load the variable into an array
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = ALINES(laWords, lcWords)
lnSecs2 = SECONDS()
? TRANSFORM(lnCount) + " lines"
? "Test 3 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
?
***********************************************************
* Test 4
* Read file into a memory variable; count the LFs.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = OCCURS(CHR(10), lcWords)
lnSecs2 = SECONDS()
? "Test 4 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?
***********************************************************
* Test 5
* Read file into a memory variable; use MEMLINES() to count the lines.
* NOTE: For this test, I used a different version of the text file, in
* which all the LFs were changed to CRs. That's because MEMLINES() does
* not recognise a LF on its own as a line terminator.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("WordsCR.txt") && temp change to input filename
lnCount = MEMLINES(lcWords)
lnSecs2 = SECONDS()
? "Test 5 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?
***********************************************************
* Test 6
* Use File System Object to read and count the lines
lnSecs1 = SECONDS()
lnCount = 0
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("ManyWords.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
lcLine = loFile.ReadLine()
lnCount = lnCount + 1
ENDDO
loFile.Close()
lnSecs2 = SECONDS()
? "Test 6 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?
***********************************************************
And here are the results of the tests.
Code:
[i]Test[/i] [i]Time (secs)[/i]
1 0.218
2 1.529
3 0.281
4 0.125
5 0.063
6 2.387
Not surprisingly, using the FSO was by the far the slowest, mainly because it involved a large number of COM calls. On the other hand, this is the only one of the methods that will work with text files larger than 2 GB. Also not surprisingly, the methods that involved looping were slower than the ones that didn't.
Of course, simply counting the lines in a file isn't particularly useful. So my next step will be to adapt the various methods to import the file into a DBF. But I'll leave that for another time. (Or someone else might like to do it first.)
Final word: I downloaded the file from I hope to use it to generate some word games and puzzles. In case anyone is interested, I'll post any of the puzzles that I manage to get working.
Mike
__________________________________
Mike Lewis (Edinburgh, Scotland)
Visual FoxPro articles, tips and downloads