Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

I thought that working from cursors was supposed to be faster. 3

Status
Not open for further replies.

Stella740pl

Programmer
Jul 3, 2003
2,657
US
Hello everybody,

While I solved my own problem, now I would love to have an explanation of what and why actually happened, to get a perspective on things and some knowledge for the future.

Here it is.

The master tables are free tables, one for each year. Each one (for a complete year) has on average 2.5 million records and their average size is about 95 Mb. That’s why the speed actually matters. Each is indexed on all fields and then some. Since the current year’s table get updated only once a month (with all the records for the month at once), and the previous years are usually set in stone, I keep the copies of the last few years locally, for faster reporting and ad-hoc data request fulfilling (so the network speed is usually not an issue). Reference tables are not big, so I use them from the network.

Now, the reporting application (VFP6) I run on them once a month opens the files for the current year and the previous years, and creates a statistical comparison table with many fields from which a number of reports is printed then. The process is quite long, it runs in loops for different criteria, JOINing reference tables, using subqueries, and UNIONizing the results. The most important fields that get created are average figures for each entity for the last month, for the same month last year, and also average year-to-date figures for the current year and the same period of the previous year. So, the two big tables located on my hard drive (plus a few small ones located in the network) are open to run the report generation program.

Recently I got a request to get rid of year-to-date figures, and create instead a rolling average figures. That means that reports for July 2007, instead of comparing January-to-July 2006 and January-to-July 2007 average columns, should compare August 2005-to-July 2006 average and August 2006-to-July 2007 average columns.

I decided that I should open tables for the last three years and create two cursors with, say, August 2005-to-July 2006 and August 2006-to-July 2007 records, respectively. Each of them would be the size of the original 1-year table. I created them with the same aliases that previously the two source tables were using, so from that point on, it was all the same process as before, and, I expected, may even be faster, since it would be run from cursors.

Apparently not so.

Actually, it started to run about 10 times slower (didn’t do many timing tests, but instead of usual 20 to 40 minutes it now took some 5 hours). I found that creating the cursors themselves was not the problem; the problem was somewhere after that, running multiple complex SELECT statements from those cursors.

After some time spent trying to find what did I do wrong, I decided to try to create actual tables in the temporary folder instead of cursors, and – oh joy! – it started to run in time comparable to the previous process. (After I added indexing on them, it became even faster, of course, but it was nothing comparing to the effect that creating the tables instead of cursors had.)

Now, I was always under impression that cursors were supposed to be faster than the tables. I had to revisit that notion.

I made a search trying to find where it actually said so, but didn’t actually find that much, except for a few threads briefly mentioning this. Say, in thread1252-1267141, TamarGranor on 17-Aug-06 17:09 notes: “Do you need to be using a table as the result of that query? could you use a cursor instead? That should speed things up.”

Or this discussion on Foxite: that discusses what’s faster, select into cursor or select into table. While the results don’t seem to be conclusive to me, most of the people say that they always prefer cursor anyway.

What I didn’t find is the answer, what is faster, to run from cursor or to run from table, especially when a big body of data is involved.

Was I wrong about “cursor is faster” assumption? If it depends, then on what? Can someone please explain what was going on here?

Thanks.
Stella.
 
Stella,

This is a very interesting question. Thank you for sharing it.

As I was reading your post, I immediately thought that the reason for the slowdown was that you had not created the same indexes for the cursors as you had for the original table. That would be entirely consistent with the behaviour you saw.

But then I saw that you obtained a signficant speed increase by using temporary tables, before you indexed those tables. So, clearly, in this case at least, cursors are indeed slower.

My own understanding is that a cursor will not necessarily be faster or slower than a temporary table, for the simple reason that a cursor is a temporary table. It physically exists on a disk drive, and suffers the same performance issues as any other physical table. (In fact, it is very very slightly slower, because of the time needed to delete it when it's closed, but that will usually be negligible.)

The only explanation I can suggest for what you are seeing is that, in this case, the cursors were being created on a slower drive than the temporary tables. Is that possible?

Cursor physically reside in the current user's Temp directoy. If that directory happens to be on a network server (I can't imagine any reason why it would be, but let's just assume), then the cursor would be very much slower. The easiest way to check that would be to create a cursor, and then do a DBF() on it.

It will be very interesting to see if you get to the bottom of this.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

My Visual FoxPro site: www.ml-consult.co.uk
 
Stella
Did you create the Cursor and the Temp table with basicly the same code?

On large dbf files I have noticed a speed increase when the dbf's are sorted in the same order as the index.

Example #1
use table
copy to temp for between(datefield, date1, date2)
or
Select * from table where between() into temp
use temp excl
index on datefield

Example #2
use table
sort to temp on datefield for between(datefield,date1,date2)
or
(select * from table where between() order by datefield into temp)
use temp exclu
index on datefield

The second way of doing it will run much faster in the following code then the first example.


David W. Grewe Dave
 

Hi Mike,

Thanks for the fast reply.

The only explanation I can suggest for what you are seeing is that, in this case, the cursors were being created on a slower drive than the temporary tables. Is that possible?

I don't think so (even though I didn't run the program again).

Tools->Options->File Locations confirm that my temporary folder is C:\Temp, which is the same folder I used to create temporary tables (instead of cursors).

I also selected a quick cursor from Command window just now, and it also got there.

Besides, I have some permanent watch expressions set up in my debugger: _TALLY, DBF(), ALIAS() and RECNO(). So, while running step-by-step, I clearly remember DBF being "C:\TEMP\TMPsome_number.DBF" right after the cursor was created.

I was thinking more along the lines of posibly different treatment that cursors get, maybe processing partly from memory and partly from hard drive, while loading data into memory in some portions (does buffer size, or something like that, has anything to do with this?). Or maybe memory size vs. available hard drive space. I really don't know. Is anything like that possible?

 

David,

Did you create the Cursor and the Temp table with basicly the same code?

Yes. Pretty much along these lines:
Code:
SELECT * FROM source1 WHERE date BETWEEN rdate AND edate ; 
   UNION ;
SELECT * FROM source2 WHERE date BETWEEN rdate AND edate ;   	
   INTO CURSOR tour
became
Code:
SELECT * FROM source1 WHERE date BETWEEN rdate AND edate ; 
   UNION ;
SELECT * FROM source2 WHERE date BETWEEN rdate AND edate ;   	
   INTO DBF(tmppath+"tour")

But if I understand your examples correctly, the speed increase should kick in at the stage of actually creating the tables or cursors, while I really needed it when further selecting from those cursors.

I don't really SET ORDER to any of the index tags during the process (it mostly consists of SELECT statements), I just have them for the purposes of benefiting from Rushmore technology, since some criteria in the WHERE clauses uses different fields.

 
Stella,
No the createing of the tables/Cursors in the select statement took a few seconds longer with the group by clause.

The speed increase I get is when I use the tables in the code after the select statement with indexes set.

David W. Grewe Dave
 
Working with cursors can be faster, particularly when doing so means you're dealing with small tables that can be kept in memory, rather than being written to disk. In your case, because you were creating enormous cursors, you obviously couldn't keep the data in memory. So there was no advantage.

As for why it was slower, I'll guess that VFP tried to keep stuff in memory and ended up doing lots of paging and so forth.

Tamar
 
I think Tamar hit the nail.

You might adjust SYS(3050) to give VFP enough physical memory to keep cursor data in memory. DBF("cursoralias") will show you a file name that won't exist, as long as vfp can keep the cursor in memory. But vfp doesn't differntiate between physical memory and the windows swap file, so it might be better to create the tables, if you can't give vfp enough physical ram.

You say you generate 3 cursors with a size about 100 MB. So that's 300MB. Then I'd add another 300 MB memory for intermediate results from your SQL-Selects (maybe check out what sys(3054) tells you about creating temporary results).

OS needs some RAM to, so it may need a GB Ram to make it work fast from memory.

Try to read in the small tables from network to cursors too and see if that helps speeding things up too. VFP should use caching when accessing records from those tables, but as those tables are open shared it might take the overhead of auto refreshs etc. Only because those tables are small it doesn't mean they cannot cause a performance hit.

Use the NOFILTER keyword to make sure those cursors are not filter to the actual network dbf files.

Bye, Olaf.
 

Tamar, Olaf - thanks!

Makes perfect sense.
Seems like I was thinking more or less in the right direction after all (I mean the last paragraph of my reply to Mike).

Tamar said:
VFP tried to keep stuff in memory and ended up doing lots of paging and so forth.
Olaf said:
You might adjust SYS(3050) to give VFP enough physical memory to keep cursor data in memory.
So the buffer size does matter. Even though I think I will keep it running from tables, I will definitely try to play with the buffer size and see what happens (it may take some significant testing time, though). Possibly, I will also try to read the reference tables into cursors or copy them locally. I am also thinking of reviewing the algorithm, but do not have time for this until a few months later

I tried just now to check SYS(3050) from Command window. ?SYS(3050,1) returned 722,206,720; and ?SYS(3050,2) returned 186,617,216. Not that small for a buffer size, but definitely not enough for the size of the cursors I use.

System Information showed that available physical memory I have is 676.5 Mb, and available virtual memory is 1.96 Gb. I think it should be enough?

Olaf said:
You say you generate 3 cursors with a size about 100 MB. So that's 300MB. Then I'd add another 300 MB memory for intermediate results from your SQL-Selects
Actually, 2 cursors of that size are created from 3 source tables of the same size. And then multiple much smaller cursors are selected from them as part of the process, but they also take some noticeable memory. But I get the idea.

Olaf said:
Use the NOFILTER keyword to make sure those cursors are not filter to the actual network dbf files.
Wait, aren’t they get filtered only when selected from a single table? Most of what I have there gets created from more than one table.

I am not sure, though, If I use UNION (as in example above), is it possible that both parts of the cursor would be filtered and not true cursors?

Thanks for the great ideas!

 
Stella,

Wait, aren’t they get filtered only when selected from a single table?

That's certainly my understanding. I can't see how a UNION could give rise to a filtered table. If that's right, NOFILTER is not an issue here.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

My Visual FoxPro site: www.ml-consult.co.uk
 

Thanks Mike, this is my understanding, too.

But I re-read it and I got it this time. NOFILTER does not refer to the huge cursors I tried creating with UNION, it refers to the ones that I still should try, where NOFILTER would definitely make a difference. It goes like this:
Try to read in the small tables from network to cursors too... Use the NOFILTER keyword to make sure those cursors are not filter to the actual network dbf files.
 
Yep, Stella. I meant that mostly for the single table you might also read into cursors. Nevertheless I use NOFILTER almost all the time, if I want to select from those cursors. if VFP does not make a filter because it's a UNION or JOIN, it doesn't hurt to put NOFILTER to that select, even though it's not needed. It's some kind of documentation that this cursor is needed as NOFILTER cursor.

If you want to review your algorithm, first take a look at the columns you really need. That might do the trick already.

And perhaps you're better off with using MS SQL Server technology and datamining techniques like OLAP Cubes. You can define them to synch with production data every 10 or 15 minutes, so you'll always have those statistics at hand with only one longer run at the start of defining that cube.

Bye, Olaf.
 
You've got lots of ideas about why it's slower, but how about ways to speed up the entire process? Look at keeping totals of previous months so you don't have total them every time. Should make for significant speed increases.

Craig Berntson
MCSD, Visual FoxPro MVP, Author, CrysDev: A Developer's Guide to Integrating Crystal Reports"
 

OK, the verdict is in.

I've done some selective time testing of separate SELECT statements and other parts of the program under different conditions.

I've made the buffer the largest size it allowed me - 2^30, and this alone made selection into the initial huge cursors to run about as fast as selection into tables. But selection from those cursors into some of the cursors during the process was still about twice as long as similar selection from tables.

Then I reduced the buffer size to 2^29, and the initial selection into cursors nearly 4 times longer than into tables, and selection into the cursors during the process became over 10 times longer than with 2^30 buffer, and over 20 times longer than selection from the tables.

So yes, it was memory and buffer size problem. Thanks for pointing this out to me. But, apparently, with cursors this big, I am still better off with tables.

Selecting reference tables into NOFILTER cursors, even with the maximum buffer size, didn't shave off some noticeable time off the process (maybe an average time on more extensive testing would show some productivity gain, not sure, but not those few quick tests.

Olaf,
Olaf said:
If you want to review your algorithm, first take a look at the columns you really need. That might do the trick already.
Possibly. Some of the columns are for reports that no longer printed on a regular basis, but still are requested once in a while. But I was thinking about reviewing the looping process, some clauses of the SELECTs, looking for unnecessary using of functions and macro substitutions, etc.(The core logic of the report generation I inherited several years ago, and have already reviewed the algorithm once. My modifications led to an average 1.5 times speed increase over the entire process. Now it is time to do it again - I think I can make it better now.)

Olaf said:
And perhaps you're better off with using MS SQL Server technology…
We discussed using SQL Server for these purposes here. We use it for some different things, and might eventually transfer those master tables into SQL server, but with this, we will add another field, to detail the data deeper, and the size of the table will increase about 4 times (those details are now stored separately and are pulled out only when requested).

Also, we might need to create a new application for reporting and frequently asked for data requests to go with it. We have a simple one currently, which our internal clients can access on the Intranet to get some of the numbers and selections independently, but for more complicated queries, they come to us.

On the other hand, VFP is still much more robust for many purposes, like constant ad-hocs, etc., and we don’t really require the high level of data protection MS SQL provides, so many things will still stay in VFP.

Olaf said:
…and datamining techniques like OLAP Cubes. You can define them to synch with production data every 10 or 15 minutes, so you'll always have those statistics at hand with only one longer run at the start of defining that cube.
I have to look into this. I am not really that familiar with those techniques. But the thing is, that I don’ t need statistics to be in synch with production data every 10 to 15 minutes. They even won’t make sense recalculated every 10 to 15, or even every 10 to 15 days.

The whole body of data for each month is downloaded from mainframe once a month. Than I screen it for data corruption (and we sure have a lot of those on so many records), clean it from garbage data, check for possible missing/lost data, massage it in many different ways, estimate some other things, etc. We have an application to aid this process, but it cannot run automatically; a programmer familiar with both, databases and business logic, has to man it. This massaged data we append to the main table once a month. It then can be reviewed/updated along with the next month’s data or at the end of the quarter. After that, the data is finalized and doesn’t change any more (except for that one time when a problem in the raw data we download was discovered much later, and I had to rework two years’ worth of data – I hope it doesn’t happen again).

The statistics is not straight-calculated on the whole body of data, it usually average weekday, average Saturday, average Sunday calculations on different criteria and for different entities and their groups/subgroups, usually for each month, or whole year, or some time period separately. Since I am not much familiar with OLAP Cubes, I am not sure if they can help me with that. But I will have to do a research on this.

Craig,
Craig said:
Look at keeping totals of previous months so you don't have total them every time. Should make for significant speed increases.
Not sure. Have to look into it. First, as I mentioned above, the monthly data can be updated within the next few months, so it would need to be recalculated, too. Second, I might end up with another big (not as big as the main files, of course) table of total after all, that would require some steps to make sure that it stays in synch with the data – the totals are a few hundred records and a few dozen fields for each month. But I will look into this – might work well out after all. Thanks for the idea.
 
Hi Stella,

I thought about additional 300 MB for selects from the large cursors, because rushmore might create intermediate results. You also have to think about buffering.

As the cursors are quasi in memory dbfs, as long as file(dbf("cursoralias")) is FALSE, that means they are equivalent to a dbf file, not equivalent to a buffer of the data of a real table. You might get a better performance by turning buffering off for those large cursors, as they are already in memory tables, buffering them in memory is just an overhead not needed.

Your 600 MB physical RAM might be too less. OS takes soimething off of it, other applications too. I'd invest in more RAM. Windows is also not optimally swapping ram to it's swap file, so you might be better off turning virtual memory off or minimize it's size.

Bye, Olaf.

Bye, Olaf.
 

Hi Olaf,

I might be misunderstanding the whole buffering idea and the memory usage.

So increase in buffer size gave me some performance increase, but turning it completely off might give me a even better results? Or turning off virtual memory?

Olaf said:
Your 600 MB physical RAM might be too less. OS takes soimething off of it, other applications too.
It probably already did. 600 Mb physical memory is just the available part, the whole thing is 1024 Mb, and this, I believe, includes recently installed 512 Mb.

Olaf said:
I'd invest in more RAM.
I am not investing in anything. I am a salaried employee, and, as I just mentioned, I recently got some additional memory installed. Of course, if I ask for it and justify the need, I might even get it, but I don’t think anyone will find that some 10 minutes off of a process that takes under 45 minutes once a month is a reason good enough.

I mean, I do operate with large amounts of data, but as the length of the processes stays reasonable, doesn’t tie up the whole computer, and doesn’t prevent me from doing my job on time, I don’t think an investment is warranted. I might drop a hint, though.


Also, whoever gave me this star, could you please tell me what it is for, or was it by mistake?

 
Hi Stella,

okay, my misunderstanding. I thought of about 600 MB physical ram (eg 512 MB +128 MB).

Sorry, if I confused you about the buffering terms. In my last post I thought about turning off buth VFPs caching mechanism (CURSORSETPROP("Buffering",0,"cursoralias")) and Windows swapping ram to disk via the pagefile.sys file.

Because

a) the cursors themself are in physical RAM and don't take up all the RAM, so buffering will not be helpful at all, it will just take up more RAM and the chances rise, windows swaps parts of the cursors o pagefile.sys, making cursor data access even slower than to a dbf file.

and

b) If you don't allow OS to swap memory to pagefile.sys you can prevent the process from slowing down. Windows might swap parts of the cursor even if it still would be able to hold it fully in RAM.

Bye, Olaf.
 

Olaf,

You, probably, meant
CURSORSETPROP("Buffering",1,"cursoralias"), not
CURSORSETPROP("Buffering",0,"cursoralias").
0 is not mentioned as a possible value for the Buffering property (and returns an error), 1 is documented as a setting-off value.

Well, anyway, I tried it, and I am still better off with tables, and even the maximum (2^30) buffering works faster than turned-off buffering. In fact, with buffering turned off for the source tables, selection into cursors takes about the same time as with max buffering or as selection into tables.

But with buffering then turned off for those huge cursors, selection from cursor takes about the same time as with 2^29 buffer (which is over 20 times slower than selection from tables).

Thanks for the ideas. I might try them in other apps on smaller cursors, where it might be helpful.

 
Hi Stella,

interesting observations. I have to do some testing on working with large cursors in memory too. If vfp really can better work from disk, a RAM disk might be a way to trick vfp into using memory anyway.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top