Stella740pl
Programmer
Hello everybody,
While I solved my own problem, now I would love to have an explanation of what and why actually happened, to get a perspective on things and some knowledge for the future.
Here it is.
The master tables are free tables, one for each year. Each one (for a complete year) has on average 2.5 million records and their average size is about 95 Mb. That’s why the speed actually matters. Each is indexed on all fields and then some. Since the current year’s table get updated only once a month (with all the records for the month at once), and the previous years are usually set in stone, I keep the copies of the last few years locally, for faster reporting and ad-hoc data request fulfilling (so the network speed is usually not an issue). Reference tables are not big, so I use them from the network.
Now, the reporting application (VFP6) I run on them once a month opens the files for the current year and the previous years, and creates a statistical comparison table with many fields from which a number of reports is printed then. The process is quite long, it runs in loops for different criteria, JOINing reference tables, using subqueries, and UNIONizing the results. The most important fields that get created are average figures for each entity for the last month, for the same month last year, and also average year-to-date figures for the current year and the same period of the previous year. So, the two big tables located on my hard drive (plus a few small ones located in the network) are open to run the report generation program.
Recently I got a request to get rid of year-to-date figures, and create instead a rolling average figures. That means that reports for July 2007, instead of comparing January-to-July 2006 and January-to-July 2007 average columns, should compare August 2005-to-July 2006 average and August 2006-to-July 2007 average columns.
I decided that I should open tables for the last three years and create two cursors with, say, August 2005-to-July 2006 and August 2006-to-July 2007 records, respectively. Each of them would be the size of the original 1-year table. I created them with the same aliases that previously the two source tables were using, so from that point on, it was all the same process as before, and, I expected, may even be faster, since it would be run from cursors.
Apparently not so.
Actually, it started to run about 10 times slower (didn’t do many timing tests, but instead of usual 20 to 40 minutes it now took some 5 hours). I found that creating the cursors themselves was not the problem; the problem was somewhere after that, running multiple complex SELECT statements from those cursors.
After some time spent trying to find what did I do wrong, I decided to try to create actual tables in the temporary folder instead of cursors, and – oh joy! – it started to run in time comparable to the previous process. (After I added indexing on them, it became even faster, of course, but it was nothing comparing to the effect that creating the tables instead of cursors had.)
Now, I was always under impression that cursors were supposed to be faster than the tables. I had to revisit that notion.
I made a search trying to find where it actually said so, but didn’t actually find that much, except for a few threads briefly mentioning this. Say, in thread1252-1267141, TamarGranor on 17-Aug-06 17:09 notes: “Do you need to be using a table as the result of that query? could you use a cursor instead? That should speed things up.”
Or this discussion on Foxite: that discusses what’s faster, select into cursor or select into table. While the results don’t seem to be conclusive to me, most of the people say that they always prefer cursor anyway.
What I didn’t find is the answer, what is faster, to run from cursor or to run from table, especially when a big body of data is involved.
Was I wrong about “cursor is faster” assumption? If it depends, then on what? Can someone please explain what was going on here?
Thanks.
Stella.
While I solved my own problem, now I would love to have an explanation of what and why actually happened, to get a perspective on things and some knowledge for the future.
Here it is.
The master tables are free tables, one for each year. Each one (for a complete year) has on average 2.5 million records and their average size is about 95 Mb. That’s why the speed actually matters. Each is indexed on all fields and then some. Since the current year’s table get updated only once a month (with all the records for the month at once), and the previous years are usually set in stone, I keep the copies of the last few years locally, for faster reporting and ad-hoc data request fulfilling (so the network speed is usually not an issue). Reference tables are not big, so I use them from the network.
Now, the reporting application (VFP6) I run on them once a month opens the files for the current year and the previous years, and creates a statistical comparison table with many fields from which a number of reports is printed then. The process is quite long, it runs in loops for different criteria, JOINing reference tables, using subqueries, and UNIONizing the results. The most important fields that get created are average figures for each entity for the last month, for the same month last year, and also average year-to-date figures for the current year and the same period of the previous year. So, the two big tables located on my hard drive (plus a few small ones located in the network) are open to run the report generation program.
Recently I got a request to get rid of year-to-date figures, and create instead a rolling average figures. That means that reports for July 2007, instead of comparing January-to-July 2006 and January-to-July 2007 average columns, should compare August 2005-to-July 2006 average and August 2006-to-July 2007 average columns.
I decided that I should open tables for the last three years and create two cursors with, say, August 2005-to-July 2006 and August 2006-to-July 2007 records, respectively. Each of them would be the size of the original 1-year table. I created them with the same aliases that previously the two source tables were using, so from that point on, it was all the same process as before, and, I expected, may even be faster, since it would be run from cursors.
Apparently not so.
Actually, it started to run about 10 times slower (didn’t do many timing tests, but instead of usual 20 to 40 minutes it now took some 5 hours). I found that creating the cursors themselves was not the problem; the problem was somewhere after that, running multiple complex SELECT statements from those cursors.
After some time spent trying to find what did I do wrong, I decided to try to create actual tables in the temporary folder instead of cursors, and – oh joy! – it started to run in time comparable to the previous process. (After I added indexing on them, it became even faster, of course, but it was nothing comparing to the effect that creating the tables instead of cursors had.)
Now, I was always under impression that cursors were supposed to be faster than the tables. I had to revisit that notion.
I made a search trying to find where it actually said so, but didn’t actually find that much, except for a few threads briefly mentioning this. Say, in thread1252-1267141, TamarGranor on 17-Aug-06 17:09 notes: “Do you need to be using a table as the result of that query? could you use a cursor instead? That should speed things up.”
Or this discussion on Foxite: that discusses what’s faster, select into cursor or select into table. While the results don’t seem to be conclusive to me, most of the people say that they always prefer cursor anyway.
What I didn’t find is the answer, what is faster, to run from cursor or to run from table, especially when a big body of data is involved.
Was I wrong about “cursor is faster” assumption? If it depends, then on what? Can someone please explain what was going on here?
Thanks.
Stella.