Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

data query speed difference with SET DELETED ON/OFF 1

Status
Not open for further replies.

AlastairP

Technical User
Feb 8, 2011
286
AU
I have noticed that there is a speed difference with SET DELETED ON

I have been working on trying to speed up data queries.
The application is running on Windows Server 2016
I have done a lot of changes to improve speed such as dividing tables and optimising queries, and indexes as required, codepage and collate.
I have experimented with the location of the database, looked at temp files and disabled virus scanning in the users temp files to test for any differences.

There is definitely a speed difference between my dev computer and the server. Server is slower for data queries.
The server is running Raid 10 with SSD drives, 256gb ram.
File copy is blazing fast. Most other operations very fast.

The speed difference with data queries is noticeable though.

I have experimented with queries. I ran some tests 20 times to get a result:

SET DELETED ON

Dev Computer:
select all records into cursor 0.5 seconds

Server:
select all records into cursor 1.6 seconds

SET DELETED OFF

Dev Computer:
select all records into cursor 0.002 seconds

Server:
select all records into cursor 0.004 seconds

The speed difference with SET DELETED ON becomes quite noticeable on the server
But with SET DELETED OFF the speed on both computers is smoking.

Comments welcome.


UPDATE:
I noticed that there is no speed difference when I run the test using the command window in the IDE
But there is a difference when I compiled the exe
I compiled the main exe to bypass all the startup and only run the speed test.






 
Take a look at what you got:

After the query into cursor do:
Code:
? DBF('cursoralias')

This will very likely not be a tmp file as it should be for a cursor, but the actual DBF file name, isn't it?

You don't get any file in split seconds over a LAN, what you got is just the DBF opened with cursoralias. This is a "cheat" of VFP to finish the query fast, but it didn't copy any data into a cursor. It just opened a DBF with an alias name.

An index on DELETED() can lead to such results, as Rushmore optimization deems a query to be fully optimizable even though there is a general filter on deleted() when you SET DELETED ON. The actual optimization performance is only getting to such low times, as it does not read actual data, unless you start reading fields or disply the cursor in a grid or browse.

Code:
Clear

Create Table normaldbf (id int autoinc, cData c(240), mData M)
Index on id Tag id Candidate
Index on cData tag cData
Use

Create Table deleteddbf (id int autoinc, cData c(240), mData M)
Index on id Tag id Candidate
Index on cData tag cData
Index on Deleted() tag deleted Binary 
Use

tStart = Seconds()
For nI = 1 to 50000
    Insert into normaldbf (cData, mData) values ('hello','world')
EndFor nI
Delete From normaldbf where id between 2 and 9999
? 'writing normaldbf:', Seconds()-tStart
Use
Flush Force

tStart = Seconds()
For nI = 1 to 50000
    Insert into deleteddbf (cData, mData) values ('hello','world')
EndFor nI
Delete From deleteddbf where id between 2 and 9999
? 'writing deleteddbf:', Seconds()-tStart
Use
Flush Force


Set Deleted On
Sys(3054,11)
tStart = Seconds()
Select * from normaldbf into cursor normalcursor
? 'reading normaldbf:', Seconds()-tStart
? Alias(), Dbf()

tStart = Seconds()
Select * from deleteddbf into cursor deletedcursor
? 'reading deleteddbf:', Seconds()-tStart
? Alias(), Dbf()

Set Deleted Off
Select normalcursor
? 'normalcursor, first two ids:'
? id
Skip 1
? id

Select deletedcursor
? 'deletedcursor, first two ids:'
? id
Skip 1
? id

This won't give you valuable experimental data about deleted index performance. To understand how Rushmore index usage saves network traffic you'd need to query data from dbfs with different percentages of deleted records. From my experience, it only makes a difference if an essential portion of your data is deleted. Which is not realistic and can be fruther minimized by recycling deleted records.

This type of result is also called filter cursor, as it's just a virtual filter of the actual DBF file. And that comes with negative side effects: You can't do further queries on this cursor. Let's assume you don't need that, then it can be totally okay and viable, this is not integrated into Rushmore to fool us or get better benchmark results.

But it only works, if you don't switch DELETED OFF after the query. That can be easily tested in the sample data of which I intentionally deleted a portion of records, but not record 1. The actual cursor normalcursor will show ids 1 and 10000 as the first two ids, even if when I do SET DELETED OFF afgter the query, but the deletedcursor will show ids 1 and 2 instead, which is just the consequence of the mechanism of a filter cursor failing, if you change the DELETED setting after the query. That shows the danger and potential problem of this mechanism. The actual optimization was doing nothing and relying on SET DELETED doing the actual restriction of access to deleted records. Add in a fully optimizable where clause and it still can fail on that.

And all in all, it means you can only measure real query times when you do queries using the NOFILTER clause or READWRITE, if you want a writable result cursor.

If you comply with the (unwritten/undocumented) rule of keeping SET DELETED ON, then a filter cursor can be an advantage, especially because of how BROWSE or a grid control allows to fetch data only as necessary. You might never fetch all the records as the user never scrolls down to the last undeleted record in the dbf. And even scrolling past data in the middle can mean the way the browse and grid works skips records from actually being read through the network or in case of a local dbf file read from it.

But never think the "query" time is the actual data read time, the rushmore optimization tells the SQL engine to not fetch data, but only open up the DBF, the rest is filtering.

Chriss
 
To follow up what Chris said about SET DELETED and Rushmore optimisation, there is very informative explanation of this in Hackfox - especially in regard to whether there should be indexes on DELETED().

Go to and scroll down to the section "To Delete or Not to Delete".

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Add to HackFox that a binary index helps reduce the size of the index tag on such binary data, not only for indexes on DELETED(), but also for indexes on logical fields, except when they also are allowed to be .NULL., then it could also be used on almost every table again, be it large or not. But as the book says, test it out in your production environment.

Take into account that Rushmore builds up bitmaps the length of which is determined by the reccount() of all tables involved in a query and the number of single expressions in join and where clauses you combine with AND or OR.

The fully optimizable queries lead to filtre cursors, which you might also want to prevent, the major goal of the optimization should be fetching fewer records, ideally, only the ones that matter for building the result. You help reduce the usage of the LAN or WAN bottleneck by selecting only necessary fields instead of *, but it can suffice to only index as far as is necessary to do the heavy lifting of optimization.

For example, if you store orders and get orderitems, the index on orderid in the orderitems is already limiting the records from say millions of orderitems to only 10, then it does not matter at all, if 2 of those 10 are deleted and they are still fetched and then discarded, it would be much worse when not having an index on the orderid in orderitems instead. You might say that's still 20% of the data, but you have to take into account that reading the tag on the DELETED index of a large table, even if only partially, also costs network bandwidth and traffic. So loading two records you discard afterward can mean fewer bytes going the wires than using the index.

Where the break-even is for that depends on many factors.

Chriss
 
A second demo shows where the name "filter cursor" stems from.

If you query a single table with a fully optimizable where clause, the SQL engine also just uses the dbf and sets a filter:
Code:
Clear

Create Table normaldbf (id int autoinc, cData c(240), mData M, iRand int)
Index on id Tag id Candidate
Index on cData tag cData
Index on iRand tag iRand
Use

Create Table deleteddbf (id int autoinc, cData c(240), mData M, iRand int)
Index on id Tag id Candidate
Index on cData tag cData
Index on iRand tag iRand
Index on Deleted() tag deleted Binary 
Use

=Rand(-1)

tStart = Seconds()
For nI = 1 to 50000
    Insert into normaldbf (cData, mData, iRand) values ('hello','world', Rand()*100)
EndFor nI
Delete From normaldbf where id between 2 and 9999
? 'writing normaldbf:', Seconds()-tStart
Use
Flush Force

tStart = Seconds()
For nI = 1 to 50000
    Insert into deleteddbf (cData, mData, iRand) values ('hello','world', Rand()*100)
EndFor nI
Delete From deleteddbf where id between 2 and 9999
? 'writing deleteddbf:', Seconds()-tStart
Use
Flush Force

Set Deleted On
Sys(3054,11)
tStart = Seconds()
Select * from normaldbf where iRand<10 into cursor normalcursor
? 'reading normaldbf:', Seconds()-tStart
? Alias(), Dbf()
? _tally, Reccount()
? 'Filter:', Set("Filter")
tStart = Seconds()
Select * from deleteddbf where iRand<10 into cursor deletedcursor
? 'reading deleteddbf:', Seconds()-tStart
? Alias(), Dbf()
? _tally, Reccount()
? 'Filter:', Set("Filter")
Count To nReccount
? nReccount

When the query extends with joins of other tables, this mechanism will not work, obviously, so it only really matters with single table queries.

Update: I added a section to display _TALLY vs RECCOUNT() vs counting with COUNT TO. You can see directly after the filter cursor creation which, I repeat, is just opening the cursor and setting a filter, the _tally already is the exact count, whereas reccount() of the deletedcursor alias is the dbf reccount, as it's actually just the dbf.

If you actually just use the dbf and set a filter to it, then you don't have a _tally yet, as _tally is only set by SQL queries anyway. But that _tally is already the number of records you actually count with COUNT TO -taking into account the FILTER condition, means Rushmore already determined the record numbers and their count, so it does a bit more than just setting the filter.

Still, I'd say if you want a query result, you don't want a filter cursor, you want the actual records in the result cursor workarea and have their records numbered 1 to _tally, whereas the RECNO() of the deletedcursor are actually the record numbers within the DBF and you just SCAN FOR not DELETED() AND SET('FILTER'), if you SCAN...ENDSCAN the deletedcursor, except when you SET DELETED OFF and only the SET('FILTER') still limits the scan loop, as I have shown above. So you have a few disadvantages and misbehavior, if you don't expect a filter cursor to be created. It also extends to reading changes that were done to the DBF after the query, as the cursor is what it is, just the DBF used again with another alias name and filter, not the result records copied from the dbf as a normal query result is, also in all other databases than VFP, especially in database servers. Even if you consider concepts like server side cursors, they are never only filtres to the actual table.

You can try out a lot of things and how timings change when you create the dbf in the network, you can use network monitoring tools also recording network traffic to see what happens, including the phase of the query itself reading the CDX file of the table for determining the optimization for fetching the result records or deciding to provide a filter cursor.

Chriss
 
Maybe I should have put one thing first:

If you SET DELETED OFF your query result will not need an index on DELETED(), but you get records marked deleted into your results, which is why you usually need to work with SET DELETED ON despite the negative effect on the query time, and to counteract that an index on DELETED() can help, but as I and as the Hackfox book says which Mike Lewis referenced: It's to be used with a pinch of salt and you should check out what fits the actual application and database in production in the users LAN or WAN.

And to refer back to the topic of filter cursors, if you have SET DELETED OFF the virtual global filter on NOT DELETED() doesn't exist so any SELECT * FROM table query is answered with a filter cursor again, which as you should know by now is just using the DBF with the given result cursor name, no more, no less, so you still really don't deal with an actual data retrieving query and the "query" time is deceiving.

What also is true is that the filter cursor you get when you SET DELETED OFF could also be used to only retrieve undeleted data, if you then SET DELETED ON, just like in the opposite case. As you don't really have a query result, that effects the result, which means you could consider it a trick to SET DELETED OFF, then SELECT * FROM some.dbf and the SET DELETED ON, but in the end you could also keepo DELETED ON and just USE some.dbf, then.

Rushmore optimihzation really pays off in more complex situations but also in case of single table queries which are very selective you can fetch out a few records only much faster with than without an index.

Chriss
 
A binary tag for DELETED() can seriously reduce the amount of space such a tag takes up, but if only a very few records are deleted (as is the case for most tables, in my experience), then it still may be better not to index on DELETED() and just let whatever records meet the other conditions of a query be checked sequentially for deletion.

Tamar
 
Mike that is a very good article, thank you.

When I, rarely, feel the need for a Deleted() index, I always compound it with something that makes it unique, like the key field.

Code:
index on iif(deleted(),"T","F")+UniqueID tag DelMarker

I used to have to employ tricks like that with clipper databases, where the index files would sometimes
fail if there were many identical keys... long time ago mind.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.

There is no place like G28 X0 Y0 Z0
 
AlistairP,

you never responded to your own thread and gave feedback on the answers. Maybe you've taken what you need already, but do you clearly understand the implications?
You are not free in the choice of the DELETED setting, if you don't want to access deleted records the only viable setting is SET DELETED ON.

You can make queries fully optimizable with an index on DELETED(), but you also have to understand the implications of a filter cursor, which in corner cases can go wrong, see above for detail.

To condense all I wrote: A query fully optimized by a filter cursor has...
a) ...the dbf itself open with the alias name given by INTO CURSOR aliasname
b) ...a _tally that is the actual undeleted record count. A small benefit, as that can differ from RECCOUNT() of the DBF and also of the filter cursor, that has the same RECCOUNT().
c) .. a filter set to the where clause - if you have a where clause. That's giving the concept the name filter cursor.
And on top of that
d) the alias can't be queried with further SQL
e) you can encounter corner cases like access to deleted records if you change back to SET DELETED OFF after the query.
f) RECNO()s of the result are not simply 1,2,3, they are the RECNO() of the original DBF.
g) You get access to modifications of the dbf after the query, not normal for a SQL result.

The last bullet point sounds as a bonus. But it's not how SQL results should work and if you want to access the full DBF or filtered, you can USE and SET FILTER yourself, too.

If you think it's worthwhile to keep SET DELETED OFF by doing SELECT * FROM some.dbf WHERE NOT DELETED() AND whatever other where clause: That only works with single table queries.

The essence of that is, that there's nothing better than SET DELETED ON that makes DBFs normal tables with deleted records not available anymore, just like any other database handles deletions. And you can easily live with the effects of that on query performance, deciding on a per table case and experimenting about whether an index on DELETED() helps Rushmore or not.

Take this thought into consideration: A partially optimized result still knows a series of record numbers it has to read from DBFs, that's always the actual optimization result - knowing which records to fetch. The dbf file access done by the VFP runtime can always first read the deletion mark byte to decide to skip that record. So building up a result cursor at worst has to visit some deletion mark bytes that are true and make VFP skip the record data. The time lost would just be examining the deleted byte of deleted records and the time it takes to get to these byte positions in the DBF files. Which is extremely low in the case of SSDs with no physical parts that have to move.

I mean, it would be a bummer if VFP would still read the full records to only check the deletion mark as an aftermath. That means only partially optimized queries aren't a showstopper for performance if the reason for partial vs full optimization is only due to the global NOT DELETED() condition.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top