Performance problem 4

jamesrav · Oct 15, 2021

I have about 90,000 dbf files in a single folder. They contain from 0 to 100 records, but generally 5-20. I have a .prg file that cycles thru all the tables and does some processing based on contents. The problem/puzzle is that sometimes this .prg file can go thru all 90,000 in about 1 minute, and other times it can take 15 minutes. No change to the .prg at all. I've checked the task manager and it's not the case that the CPU is occupied by other processes when the .prg is functioning slowly. It just seems that VFP in some situations 'knows' about the tables and can move thru them quickly, and in other cases it is seeing the data for the first time (and I should add that I cycle thru the 90,000 multiple times with different criteria, and when it's operating well, the 1 minute time is consistent , and when slow, it's 15 minutes every time thru - never improves). So is this a VFP issue or Windows in general? I've tried adding some Sys() commands at the start (based on the Help file) but it does not improve things.

Chris Miller · Oct 15, 2021

I am reminded that Windows Exploreriss bad at listing directories with many files. Even if you don't ask for any specific order
What size does the whole directory have?
Do you use ADIR()? Have you tried using SYS(2000,'directory\*.dbf') folloed by calls to SYS(2000,'directory\*.dbf',1) until the result is empty, instead?

And last not least I think using Sys(1104,Alias()) everytime you're done with one dbf could keep memory usage low.
I understand you're processing files multiple times, an approach doing th several processing steps with each dbf would be better, as you then don't need that cached data anymore.
So iterate DBFS, mainley, then processing steps, not the other way around. Or is that not possible for some reason?

Chriss

jamesrav · Oct 16, 2021

thanks for replying. I do use Adir(). The sys(2000) is something I will try the next time it goes 'slow' (I kind of mimicked that in the past by doing a Dir and importing all 90,000 filenames into a different table, and then using that table to get each .dbf - did not help the issue). The sys(1104) mentions "improve performance" so I will certainly try that. But I would think I'd want to keep tables cached so they'd be 'known' in future cycles thru. But I dont know if the cycling thru or the processing each table is causing the slowdown.

the tables are about 180megs total (size on disk is 350megs). Each is quite small. Putting them in multiple folders would be possible, but require new programming to accommodate the multiple folders.

Processing the same tables over and over is inefficient I agree, but it's much easier that way. That the .prg can do this sometimes in 1 minute shows it can be done quickly, it's just the randomness of slow vs. fast. Once in slow mode, it stays stuck there, and then I try it 12 hrs later and it's fast.

Rick C. Hodgin · Oct 16, 2021

How is your PRG being run? It might be that Windows is trying to keep too many recent handles open and you're getting some internal thrashing behavior inside of Windows.

Try using a launcher program to launch new instances of your PRG compiled to a .EXE. If you need to communicate data, I can give you a little DLL that lets you send data back and forth between the two programs.

By launching a new EXE, doing your processing, and then exiting, it may prevent the issue if it's a thrashing issue.

I can also give you a FindFirstFile(), FindNextFile() and FindClose() feature inside of a DLL to let you iterate using native Win32 functions. Let me know if interested. No charge, by the way. Just helping out.

--
Rick C. Hodgin

jamesrav · Oct 16, 2021

thanks for that offer, hopefully not needed with these suggestions - but I will take you up on that if needed. I run the .prg in the Command Window. I could turn it into an .exe of its own, I have several minor variations of this .prg so if creating an .exe did solve the poor performance I could just add them all and choose which one. That would organize things and solve the performance problem.

Rick C. Hodgin · Oct 16, 2021

Something else to consider, you may not need to open every table. If it's possible (I don't know what these 90K DBF files are or are used for or how they're generated), you could keep a list of the files processed. ADIR() returns the date and time, so you could compare what ADIR() reports with the last date and time you had recorded from the last time it was processed on that file. And if you need a more accurate timing, you could use the file's FileTime information, available from a FindFirstFile() API call. It has times down to 100ns intervals in a 64-bit value.

If it's unchanged, it won't need any new processing and you could skip it.

Something else to consider:
Windows has an API that will notify you when a directory changes:

https://docs.microsoft.com/en-us/windows/win32/fileio/obtaining-directory-change-notifications

==> See the FILE_NOTIFY_CHANGE_FILE_NAME and FILE_NOTIFY_CHANGE_LAST_WRITE settings

Also an example:

https://docs.microsoft.com/en-us/windows/win32/fileio/obtaining-directory-change-notifications

A small DLL monitoring that information in its own separate thread could SendMessage() back into a VFP app and notify your app when things change. You could then process in almost real-time (depending on how fast and furious the changed files come in).

--
Rick C. Hodgin

Chris Miller · Oct 17, 2021

jamesrav said:
But I would think I'd want to keep tables cached so they'd be 'known' in future cycles thru.

You miss the important point. If the second use of a dbf comes after using the other 90,000 DBFs, the caching success can be as you experieence. If memory usage was low enough it all fits, otherwise the cache mechanism either swaps, which makes it slow, or it will pruge the oldest used DBFs, and thus you don't have any DBF cached in the next processing steps.

If you do all processing steps on one DBF, then the next, etc. you only need to have one DBF cached. And then can purge it with SYS(1104,ALIAS()). So you never have all DBFs in the cache.

Chriss

jamesrav · Oct 17, 2021

now I have to wait for it to perform slowly so that I can try these suggestions. I forgot to mention an important point: I have a 10-line routine that cycles thru and simply opens each table. The first time thru it is always slow, taking upwards of 9 minutes. But often the 2nd time I run it, it will zip thru the 90,000 in 30 seconds. At that point I know my other .prg files will operate fast. I would say this 'pre check' method fixes the problem about 70% of the time. But sometimes subsequent cycles thru are no faster and I'm done for the day (Logging off Windows does not help ; I may try a full shut down next time). So the pre-check routine would seem to indicate that a one-time Use of each table does keep the table in memory for future references. The next 'slow' occurrence I experience, I will try these things and report back. I appreciate the suggestions.

Chris Miller · Oct 17, 2021

You could be lucky if 180 meg fit into your cache. but opening a dbf all that's read and validates is the dbf header, also to postitojn to record 1. Then if a dbf is browsed, the first few records are loaded, not more.

You would do a favor to every other process on the system, if you process dbf by dbf and purge it afterwards. Then you still profit from caching it, but only one dbf at a time.
And then you could look for other ways to improve the performance.

Your current order of processing proves to be slow in enough cases that you shouldn't aim for getting more lucky, whether you chunk up SYS(3050) and give it more than enough to cache all files and have enough memory for all else, too, or you work file by file. It doesn't take 15 minutes to read 180 MB through a network, unless that has a very bad bandwidth, it would be a net bandwidth of averagely 180 MB/15 minutes = 1440 Mbit/ 900 s = about 1.6 Mbit/s. 1-10GBit/s are the norm, times with 10 Mbit/ networks were in the 90s for home users, you will have 300 Mbit/s or faster Wifi today, to profit from your ISP internet bandwidth.

Nope, there's soething else that takes teh 15 minutes and focusing on the caching alone won't solve it. If you cache all data but part of it goes to the system swap file, then you will swap out dbf and read them back from swap file in every pass, that'll be what takes longer.

You can use process (and file) monitor to find out where io time is spent, there's no guesswork about that, if you do.

I bet you'll save astonishingly much time in even just opening up the dbfs, if you switch from ADIR or data you store in a cursor or dbf about all files, to Sys(2000). Opening the file SYS(2000,1) jut returned is opening the file the file system just had in focus anyway.

So a top-level program should iterate over the dbfs of a directory and then call prgs which each work on just the current workarea. That would only let the first prg take longer to read this dbf first, but since it's just a few KB, it'll be fully in memory without any pressure on RAM usage of your process or other processes. You can the same cache effect as if all files are cached. All dbfs will be processed from cache in any but the first PRG and you only use as much cache RAM as is needed for the largest dbf at any time. This will never topple down and you can get back to your 30 seconds, perhaps 40, but stable. I think the caching profit is exactly the same.

The idea to act when changes appear is good, too, but it all depends what's the directory for. It seems you process all data each time and don't remove some, so this is a repeated process not justdone on changed data. If that was the case you'd not profit from cache anyway, as new data in the sense of an update invalidates the cache and new data in the sene of new records isn't cached yet.

Chriss

tomk3 · Oct 18, 2021

Hello,

maybe you can read all filenames into a dbf and then do multithreading.

I just learned (was remembered

) on VFEST2021 , they are using DMULT.DLL

https://virtualfoxfest.com/sessions.aspx

Another one was mentioned shortly : parallelfox

https://github.com/VFPX/ParallelFox

in vfp2c32 there are functions on multithreading (createprocessex), too.

regards
tom

german12 · Oct 21, 2021

Mike Y

What is the name of that function?
Klaus

Peace worldwide - it starts here...

Mike Lewis · Oct 21, 2021

Klaus said:
What is the name of that function?

I'm not sure, Klaus, but I think the one he was referring to is ADirEx(). It looks pretty similar to the native ADIR(), but with other options including putting the directory information into a cursor.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

german12 · Oct 21, 2021

In the Help F1 of VFP 9.0 SP2 I could not find that function Mike

Klaus

Peace worldwide - it starts here...

Mike Lewis · Oct 22, 2021

In the Help F1 of VFP 9.0 SP2 I could not find that function Mike

Klaus, ADirEx() is not a native VFP function. It is part of a third-party library called VFP2C32. There is a lot more information about it here:

https://saltydogllc.com/wp-content/uploads/Selje_Amazing-VFP2C32-Library.pdf

It is worth investigating this library, as it has a large number of useful functions to do things that you cannot easily do natively in VFP. That said, it would be quite easy to write a small VFP function to load the contents of a directory into a cursor. You just need to use ADIR() to get the information into an array, and then copy that to the cursor. But I suspect ADirEx() would be faster.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

atlopes · Oct 23, 2021

Mike Yearwood said:
If anyone can explain the attrib values that would be appreciated.

I think this is what you're looking for:

https://docs.microsoft.com/en-us/windows/win32/fileio/file-attribute-constants

german12 · Oct 23, 2021

Thanks Mike Yearwood and Mike Lewis for the information about the tremendous amount of information on the amazing VFP2C32 Library.
I didn't know this source, but at least now I know what is being discussed here regarding performance improvement.

Thanks again.
Klaus

Peace worldwide - it starts here...

Chris Miller · Oct 24, 2021

Vfp2c32 has become more or less known as it was the backbone of Carlos Alloattis' ctl32 controls library. Well, Vfp2c32 is less known because mainly the ctl32 library became known. I think at first it didn't have all the functions it has now, but he extended it as needed as it also was the basis of his libcurl binding (see

https://curl.se/libcurl/bindings.html).

Chriss

Rick C. Hodgin · Oct 25, 2021

Mike Yearwood said:
can you alter ADirEx to make it recursive

If he can't I can.

It's in the vfp2cfile.cpp, around lines 231 thru 337. Either the fpFilterFunc() needs to be smarter to recurse, or when a folder is encountered, it needs to recurse there in those loops.

I would recommend re-writing that block to get an internal array of all files specified first, and then iterate through the array and populate into the VFP array, cursor, or do the callback, then dispose the internal array afterward. I do something similar in Visual FreePro for ADIR(), but it does not support recursive searching. Maybe I should add that.

--
Rick C. Hodgin

Chris Miller · Oct 25, 2021

I would also need to brush off my C skills, but Visual Studio makes it easy to clone a github repository and work on it.

Rick, I wouldn't mind if you do it when you're already at it.

In the meantime I prepared a test scenario with 100,000 dbfs and tested Adir() vs AdirEx() vs Sys(2000):

ADIR takes 10 minutes just to create the array of files
AdirEx takes about 2 seconds
and iterating Sys(2000) takes about 1 seoond.

The advantage of AdirEx is you have more than just the file name, on the other side the file name is all you need for USE or SQL-Select.

To be fair, further passes of Adir now also only need 5 seconds. So

Code:

s=SECONDS()
nFiles1 = Adir(laTables,'*.dbf')
t=SECONDS()
nFiles2 = ADirEx("Tables",'*.dbf',0,ADIREX_DEST_CURSOR)
u=SECONDS()
nFiles3=0
cFile = SYS(2000,'*.dbf')
DO WHILE !EMPTY(cFile)
   nFiles3 = nFiles3 + 1
   cFile = SYS(2000,'*.dbf',1)
ENDDO 
v=SECONDS()

? "Adir", nFiles1, t-s
? "AdirEx", nFiles2, u-t
? "Sys2000", nFiles3, v-u

nFiles4 = Adir(laTables,'*.dbf')
w=SECONDS()
? "Adir 2nd pass", nFiles4, w-v

You can also restart VFP, the computer and turn it around. Still AdirEx is fast also in the first pass,

So, jamesrav, to get back to you:
ADIR clearly needs more time the first time to read in the dir. So it's even likely that Sys(2000) alone speeds up the whole process without any further changes.

Chriss

Rick C. Hodgin · Oct 25, 2021

I've sent email to Christian Ehlscheid.

I can modify the base source code as is and get you a DLL. It won't be upstream though unless I hear back from him.

I'll try to work on it tonight. About 7 hours from now.

--
Rick C. Hodgin

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Performance problem 4

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor