Large text files revisited

Mike Lewis · Sep 1, 2021

We recently had some discussions here about manipulating large text files -including files larger than 2 GB. In particular:

thread184-1811358

thread184-1810484

I happen to have a text file containing a dictionary of 194,433 English words, each on a separate line. Just for fun, I thought I would explore the various ways of processing this file in VFP, including doing some performance tests.

As a first step, I just wrote some simple code to count the lines, using six different methods. I'll paste this code in below, along with the results of the tests. Here is a summary of the methods:

Test 1

Create a cursor; use APPEND to import the lines. (For this test, we only need to import the first character of each line, as the aim is simply to count the lines.)

Test 2

Use low-level file functions to read and count the lines.

Test 3

Read the file into a memory variable; load the contents of the variable into an array.

Test 4

Read the file into a memory variable; count the number of LFs.

Test 5

Read the file into a memory variable; use MEMLINES() to count the lines. Note that, in the original file, each line ended with a LF, that is CHR(10). But MEMLINES() only recognises a CR, that is CHR(13), as a line-ending, not a LF on its own. For that reason, I used a modified version of the file for this test, in which the LFs had all been changed to CRs. This issue does not affect any of the other methods used here.

Test 6

Use File System Object (FSO) to read and count the lines.

To make it easer to measure the results, I tripled the size of the input file, by appending it to itself twice. This resulted in a file of 5.45 MB, with 583,299 lines. That is still a lot smaller than the files we discussed in the above threads, but I hope the results will still be useful.

Here is the code I used for the tests. Although I am showing it here as a single PRG, I actually ran each test separately, after re-starting VFP each time. This was to avoid any effects of buffering or caching.

Code:

* Timing test for word list.

* Input file contains 583,299 English words, one per line, all lower case, each line terminated
* by a single line-feed, CHR(10). File is 5.45 MB.

* Test 1
* Create a cursor; append from the text file; check the record count.
lnSecs1 = SECONDS()
CREATE CURSOR Words (F1 C(1))
    && For this test, we only need to import the first char. of each line.
APPEND FROM ManyWords.txt TYPE SDF 
lnCount = RECCOUNT()
lnSecs2 = SECONDS()
? "Test 1 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?

***********************************************************

* Test 2
* Use low-level file functions to read and count the lines
lnSecs1 = SECONDS()
lnHandle = FOPEN("ManyWords.txt")
lnCount = 0
DO WHILE NOT FEOF(lnHandle)
  x = FGETS(lnHandle, 32)
  lnCount = lnCount + 1 
ENDDO 
FCLOSE(lnHandle)
lnSecs2 = SECONDS()
? "Test 2 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?

***********************************************************

* Test 3
* Read file into memory variable; load the variable into an array
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = ALINES(laWords, lcWords)
lnSecs2 = SECONDS()
?  TRANSFORM(lnCount) + " lines"
? "Test 3 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
?

***********************************************************

* Test 4
* Read file into a memory variable; count the LFs.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = OCCURS(CHR(10), lcWords)
lnSecs2 = SECONDS()
? "Test 4 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?

***********************************************************

* Test 5
* Read file into a memory variable; use MEMLINES() to count the lines.
* NOTE: For this test, I used a different version of the text file, in 
* which all the LFs were changed to CRs. That's because MEMLINES() does
* not recognise a LF on its own as a line terminator.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("WordsCR.txt")  && temp change to input filename
lnCount = MEMLINES(lcWords)
lnSecs2 = SECONDS()
? "Test 5 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?

***********************************************************

* Test 6
* Use File System Object to read and count the lines
lnSecs1 = SECONDS()
lnCount = 0
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("ManyWords.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
  lcLine = loFile.ReadLine()
  lnCount = lnCount + 1
ENDDO 	
loFile.Close()
lnSecs2 = SECONDS()
? "Test 6 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."
? TRANSFORM(lnCount) + " lines"
?

***********************************************************

And here are the results of the tests.

Code:

[i]Test[/i]      [i]Time (secs)[/i]

1           0.218
2           1.529
3           0.281
4           0.125
5           0.063
6           2.387

Not surprisingly, using the FSO was by the far the slowest, mainly because it involved a large number of COM calls. On the other hand, this is the only one of the methods that will work with text files larger than 2 GB. Also not surprisingly, the methods that involved looping were slower than the ones that didn't.

Of course, simply counting the lines in a file isn't particularly useful. So my next step will be to adapt the various methods to import the file into a DBF. But I'll leave that for another time. (Or someone else might like to do it first.)

Final word: I downloaded the file from

http://www.gwicks.net/dictionaries.htm.

I hope to use it to generate some word games and puzzles. In case anyone is interested, I'll post any of the puzzles that I manage to get working.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Nigel Gomm · Sep 1, 2021

Mike,
in the spirit of XKCD i used a dictionary file as a password generator..... and some of the suggestions are wonderful.

Think my favourite was "SquirtsFecalLookalike".

n

Chris Miller · Sep 1, 2021

Thank you very much, Mike,

this is very useful. On one hand I wouldn't have though MEMLINES is that fast. On the other hand it really just counts lines while ALINES() creates an array of these lines, which is more work.

But your goal was only counting lines the fastest. You know a problem is text fields delimited with quotes can contain linefeeds that are not a new record but just a linefeed in a multiline text of a single field of the CSV data. I used Excel for reading in the csv files for that reason. But not all csv files have multiline texts. And commas can be adjusted when you get more fields than expected for a row, you can look for a value starting with a quote but not ending in one, when you split at commas. Then you know the next element completes this, unless there are more commas.

What I found was not trying to read files fast but asynchronous, without using multithreading:

http://server.postuni.si/postuni/clanki/articles/Asynchronousreading.html

It's also not working for large files.

Earlier I said I experiment with Overlapped IO, but I had to put this aside. Maybe I can pick this up when the workload decreases and the weather gets colder. I also got a completely different idea and found a few nice sidetracks, for example this discussion:

https://www.levelextreme.com/Home/SupportArea?Process=45&Numero=1674989

VFP already has a way to access a string in memory by address with SYS(2600), too, so maybe no need for a DLL, but an FLL would be nice to expand the VFP language with new file and string functions working on allocated memory instead of string variables.

Looking forward to your word puzzles.

Chriss

Mike Lewis · Sep 2, 2021

Chris, I take your point about delimited fields that might contain an LF as part of their data. I guess that's the same reason that MEMLINES() didn't recognise the LFs as line terminators (which is why I changed them all to CRs). I can see it would also be an issue if you did an APPEND or an IMPORT from a CSV. But not, it seems, with SDF (which I prefer to use when possible).

You also mentioned Excel. I did think of writing a function to read the file into an Excel worksheet, and to return the highest row number used, but I haven't worked out how to do that yet.

I'm now in the process of doing some timing tests on importing the file into a DBF. I'll post the results soon (although I think I can guess which method will be the fastest).

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Mike Lewis · Sep 2, 2021

I've now done some performance tests on actually importing the text file into a table. No great surprises in the results, but one or two quirks came to light that might be of interest.

Here is a summary of the tests:

Test 1

Straightforward APPEND FROM ... TYPE SDF. This is the simplest and most obvious method, and the one I would expect to be the fastest (as it was).

Test 2

Use low-level file functions to read each line in turn.

Test 3

My aim here was to use ALINES() to get the text into an array, and then to use APPEND FROM ARRAY to get it into the table. But I hit a snag. ALINES() creates a one-dimensional array. When using a one-dimensional array, APPEND FROM ARRAY only inserts a single row. To load the entire array into the table, the array has to have multiple columns. I tried using AINS() to insert a second, blank, column, but that resulted in the text from the second half of the rows being moved up to the second column in the first half of the rows, which mucked the whole thing up. The only solution I could think of was to copy the one-column array to the two-column array in a loop. (Can anyone suggest a better method of dealing with this issue?)

Test 4

As in Test 3, I first got the text into the array. I then looped through the array, doing an INSERT for each individual row.

Test 5

Read the file into a memory variable; use MLINE() to extract the individual lines. This suffered from the same issue as in the first batch of tests: MLINE() doesn't recognise a LF on its own as a line terminator, so I used the modified version of the file for this test, in which the LFs had all been changed to CRs.

Test 6

Use File System Object (FSO) to read the lines. Again, this is the only method that would work with an input file larger than 2 GB. However, that point is theoretical in this case, as the target table would also be subject to the 2 GB limit.

As before, I am showing the code here as a single PRG, but I actually ran each test separately, after re-starting VFP each time. Before each test, I re-created the target table. This was a free table, named Words, with one field: cWord C(32). (The longest word in the dictionary has 31 letters. Does anyone know what it is?)

Code:

* Timing test for word list. Part 2: Load the file into a table.

* Input file contains 583,299 English words, one per line, all lower case, each line terminated
* by a single line-feed, CHR(10). File is 5.45 MB.

* Test 1
* Append the text file into the table.
lnSecs1 = SECONDS()
SELECT Words
APPEND FROM ManyWords.txt TYPE SDF 
lnSecs2 = SECONDS()
? "Test 1 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

* Test 2
* Use low-level file functions to import the lines
lnSecs1 = SECONDS()
lnHandle = FOPEN("ManyWords.txt")
DO WHILE NOT FEOF(lnHandle)
  lcLine = FGETS(lnHandle, 32)
  INSERT INTO Words VALUES (lcLine)
ENDDO 
FCLOSE(lnHandle)
lnSecs2 = SECONDS()
? "Test 2 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

* Test 3
* Read file into memory variable; load the variable into a 
* two-dimensional array; append from the array into the table.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = ALINES(laWords, lcWords)
DIMENSION laWords1(lnCount, 2)
FOR lnI = 1 TO lnCount
  laWords1(lnI, 1) = laWords(lnI)
      && Unfortunately, we need to copy the array to a second array
      && with two columns. With a one-dimension array, APPEND FROM
      && only inserts one row.
ENDFOR   
APPEND FROM ARRAY laWords1 
lnSecs2 = SECONDS()
? "Test 3 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

* Test 4

* Read file into memory variable; load the variable into an array;
* loop through the array to copy to the table
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = ALINES(laWords, lcWords)
FOR lnI = 1 TO lnCount
	INSERT INTO Words VALUES (laWords(lnI))
ENDFOR   
lnSecs2 = SECONDS()
? "Test 3 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

* Test 5
* Read file into a memory variable; use MLINES() to extract the lines.
* NOTE: For this test, I used a different version of the text file, in 
* which all the LFs were changed to CRs. That's because MEMLINES() and
* MLINE() do not recognise a LF on its own as a line terminator.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("WordsCR.txt") 
lnCount = MEMLINES(lcWords)
_MLINE = 0
FOR lnI = 1 TO lnCount
  INSERT INTO Words VALUES (MLINE(lcWords, 1, _MLINE))
    && Note use of _MLINE here, as explained in MLINE() Help topic.
ENDFOR 
lnSecs2 = SECONDS()
? "Test 5 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

* Test 6
* Use File System Object to import the lines
lnSecs1 = SECONDS()
loFS = CREATEOBJECT("Scripting.FileSystemObject")
loFile= loFS.OpenTextFile("ManyWords.txt", 1)
DO WHILE NOT loFile.AtEndOfStream
  lcLine = loFile.ReadLine()
  INSERT INTO Words VALUES (lcLine)
ENDDO 	
loFile.Close()
lnSecs2 = SECONDS()
? "Test 6 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

***********************************************************

And here are the results of the tests.

Code:

[i]Test      Time (secs)[/i]

1           0.327
2           2.013
3           0.905
4           0.780
5           see below
6           2.808

With Test 5 (the one that used MLINE()), I abandoned it after about ten seconds. At that point, it had only read a couple of thousand rows. Obviously, this was much too slow to be of any use. I don't know why. Was there some fault in my code?

So, the most obvious method (a straight APPEND FROM) is not only the simplest, but also the fastest. However, it is the only method that does not allow for any kind of intervention, that is, you cannot alter the data in any way before it reaches the target table, for example, to discard any unwanted data in order to bring it down below 2 GB (which was the point of our original discussions).

I hope someone will find this useful. I'd be glad to hear any feedback.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Chris Miller · Sep 2, 2021

Mike, I think you could DIMENSION the array created by ALINES itself to a 2D array with 1 column, not 2.

A simple test whether this works is easy:

Code:

Local Array laLines[3]
laLines[1] = "alpha"
laLines[2] = "beta"
laLines[3] = "gamma"

Create Cursor crsLines1(cLine C(254))
Append From Array laLines

? '------------------'

Create Cursor crsLines2(cLine C(254))
Dimension laLines[Alen(laLines),1]
? laLines[1,1], laLines[2,1], laLines[3,1]

Append From Array laLines

I only get one alpha in Lines1, but all lines in Lines2.

I could imagine a tricky way to turn the APPEND FROM File into an APPEND FROM ARRAY: replace all commas with CR and then use ALINES to create a 1 column array, turn that to 2d by DIMENSION with as many columns as you need and you could APPEND FROM ARRAY.

This only trips over with both linefeeds in values and commas in values which will skew the data.

Chriss

Mike Lewis · Sep 3, 2021

I think you could DIMENSION the array created by ALINES itself to a 2D array with 1 column, not 2

Chris, I must say I was surprised at what you said. But I ran your code, and it does appear to work.

In my test, I originally tried to use DIMENSION to re-dimension the array as 2D. So I did this:

[tt]DIMENSION laWords(lnCount, [highlight #FCE94F]2[/highlight])[/tt]

My aim was to give the array two columns, on the basis that a 2D array is, by definition, an array with two or more columns. But what you are saying is that:

[tt]DIMENSION laWords(lnCount, [highlight #FCE94F]1[/highlight])[/tt]

will create a 2D array with one column. I didn't know that that was possible.

I will now run my Test 3 again on that basis and see what happens.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Mike Lewis · Sep 3, 2021

OK, I've now revised my Test 3, as per the above posts:

Code:

* Test 3
* Read file into memory variable; load the variable into a 
* two-dimensional array; append from the array into the table.
lnSecs1 = SECONDS()
lcWords = FILETOSTR("ManyWords.txt")
lnCount = ALINES(laWords, lcWords)
DIMENSION laWords(lnCount, 1)
APPEND FROM ARRAY laWords
lnSecs2 = SECONDS()
? "Test 3 time: " + TRANSFORM(lnSecs2 - lnSecs1) + " seconds."

The time is now 0.484 seconds, compared to the previous 0.905 seconds. Not surprising, given that we have now eliminated the loop.

The next job is to try to create some of those word games and puzzles.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Chris Miller · Sep 3, 2021

Yes, it's 2d, but still only has 1 column, the ,1 seems unnecessary, but changes how APPEND FROM ARRAY uses it.

It also works with multiple columns:

Code:

Local lcCSV, lnLen

Text To lcCSV NoShow
A,B,C,D,E,F,G
H,I,J,K,L,M,N
O,P,Q,R,S,T,U
V,W,X,Y,Z,0,9
EndText

lcCSV = StrTRAN(lcCSV,CHR(13)+CHR(10),",")l
lnLen = Alines(aLetters,lcCSV,',')
Dimension aLetters(Int(lnLen/7),7)

Create Cursor Import (c1 C(10), c2 C(10), c3 C(10), c4 C(10), c5 C(10), c6 C(10), c7 C(10))
Append From Array aLetters

The trick here is to turn CRLF to commas too, let ALINES create an array with the single field values and dimension it to 7 columns afterwards.

Chriss

Mike Lewis · Sep 4, 2021

Thanks for that, Chris. It's funny, but in all these years I've never come across the concept of a 2D array with one column. Always good to learn something new.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Mike Lewis · Sep 13, 2021

Final word on this topic ....

I originally downloaded the 194,000-word file with the aim of trying to create various word games and puzzles. If anyone is interested, I shall post my efforts here: thread1551-1811770.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Steve Meyerson · Sep 13, 2021

Mike,

I also downloaded a (same?) file of all English words and another one of French words. I create a dbf with it to use it in my program to "cheat" on difficult crossword puzzles (e.g. NY Times weekends).

I enter something like ",,ss,o,," and get a list of all 8-character words having those letters in those places.

To do this, I create about 15 indexes, one on each character (1st, 2nd, 3rd etc.).

My weird sense of fun. [ponytails]

Steve

Mike Lewis · Sep 13, 2021

Steve, that's very interesting. I was also thinking of writing a crossword cheater, but my approach would be quite different from yours. Do you feel like continuing this discussion over in thread1551-1811770, in case any other puzzle solvers want to join in?

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Chris Miller · Oct 2, 2021

No idea why you posted your functions.

One is deprecated by TYPE(,1), I wonder where you'd need the second and do all the manual element rearrangement. If you ADEL a column it will also have an affect on APPEND FROM ARRAY.

So, for example:

Code:

#Define CRLF Chr(13)+Chr(10)

TEXT To cCSV noshow
r1c1,r1c2,r1c3
r2c1,r2c2,r3c3
ENDTEXT

Alines(laCSV,Strtran(cCSV,CRLF,','),',')
Dimension laCSV(2,3)

Create Cursor col1 (cCol1 C(4))
Append From Array laCSV
Adel(laCSV,1,2)

Create Cursor col2 (cCol2 C(4))
Append From Array laCSV
Adel(laCSV,1,2)

Create Cursor col3 (cCol3 C(4))
Append From Array laCSV

Set

I know

Help topic ADEL Remarks said:
If you delete a column, values of the elements in the deleted column are set to False (.F.), but trailing elements are not moved.

But it won't matter for this usage, for example. It may be nice to have in other cases.

Chriss

Chris Miller · Oct 4, 2021

Well, you just need TYPE(m.Variable,1)="A" or in case of adapting your code for X2IsArray.prg:

Code:

LPARAMETERS tuVariable, tcProperty 

LOCAL llRetVal

DO CASE
  ******************************************************
  CASE PCOUNT() = 1 AND NOT VARTYPE(m.tuVariable) = "O"
  ******************************************************
    llRetVal = TYPE("m.tuVariable",1) = "A"
  ******************************************************
  CASE VARTYPE(m.tuVariable) = "O" ;
       AND VARTYPE(m.tcProperty) = "C" ;
       AND NOT EMPTY(m.tcProperty)
  ******************************************************
    llRetVal = TYPE("m.tuVariable." + m.tcProperty,1) = "A"
  ******************************************************
  OTHERWISE 
  ******************************************************
    *
    *  you apparently haven't passed the parameters
    *  properly -- we could have RETURNed .NULL. here,
    *  but then every time you call X2IsArray(), you
    *  would have to check for .NULL, .T., and .F. 
    *  rather than just .T. or .F., so it's up to you
    *  to pass the parameters correctly
    *    Roses are red
    *    Violets are blue
    *    To pass parms correctly
    *    Is all up to you
    *
    llRetVal = .f.
ENDCASE

RETURN m.llRetVal

And for knowing whether a variable is an array I'd not call a prg. TYPE("varname",1)="A" is concise already, if you're like me and don't like the parameter ,1 "falling from the sky" you define it as #DEFINE ARRAY 1 and then check TYPE("varname",ARRAY)="A"

Checking a property equally is TYPE("objectname.property",ARRAY)="A" and I don't see a big benefit of other naming of the function TYPE and VARTYPE are well known for what they do and return.

Chriss

Chris Miller · Oct 5, 2021

Well, you can use it with #IF Version(5)="900".

But you're right in any case, because assuming I had previously used X2IsArray() I could phase that out myself and directly use TYPE() with its new parameter.

Chriss

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Large text files revisited

Mike Lewis

Programmer

Nigel Gomm

Programmer

Chris Miller

Programmer

Mike Lewis

Programmer

Mike Lewis

Programmer

Chris Miller

Programmer

Mike Lewis

Programmer

Mike Lewis

Programmer

Chris Miller

Programmer

Mike Lewis

Programmer

Mike Lewis

Programmer

Steve Meyerson

Programmer

Mike Lewis

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Chris Miller

Programmer

Similar threads

Part and Inventory Search

Sponsor