Slow query when joining 2 tables

Dennis76 · Jul 17, 2016

Hello,

I have 2 tables: invoices (1M records) and invoice_details (5M records)

I want to select all invoices which containe artiklenr "1234":
SELECT customernr from invoice_details d left join invoices i on i.invoicenr = d.invoicenr where artnr = 1234

This query takes about 3 seconds on a remote computer to fetch 3000 results.
There is an index on invoicenr at both tables. Also artnr is indexed.

If I query only invoice_details (SELECT * from invoice_details where artnr = 1234) then it takes about 0.1 sec.
I tried different joins (left,right, etc...) but nothing helped.

I also copied both tables to a mysql database server and did exact the same query. There it takes about 0.0003s for the results.

Maybe someone has an idea how to speed things up.

Dennis

Olaf Doschke · Jul 17, 2016

You'd not want a left join, because you don't want to query all 5M details even when there is no match, do you?
Well, you filter details by artnr and an invoice detail should have an invoice head record always, so it may not matter much, unless rushmore doesn't really kick in.

To which table does artnr belong? invoice details?
Does rushmore make use of the indexes?

Code:

LOCAL cmemvar
=SYS(3054,11,"cmemvar")
SELECT customernr from invoice_details d inner join invoices i on i.invoicenr = d.invoicenr where artnr = 1234
? cmemvar
_cliptext = cmemvar

After you do this you'll have the result rushmore shows in cliptext, so you may simply paste it here.

Bye, Olaf.

Olaf Doschke · Jul 17, 2016

This query takes about 3 seconds on a remote computer to fetch 3000 results.

What execatly do you mean by this? The location of data in relation to the computer executing plays a role. If you do that via teamviewer/remote desktop or anything like that won't play much of a role, unless there is a long lag of seeing the result. Measure t0=seconds() before, t1= seconds() after and ? t1-t0.

VFP of course like any database works best with data stored locally. A slow LAN or even worse WAN slows down anything. MySQL server will run and have the data on the same machine, then only serve the (hopefully few) result rows to the caller. That always was, is and will be the advantage of DB servers to using DBFs in a LAN or worse a WAN.

Bye, Olaf.

Mike Lewis · Jul 17, 2016

I agree with Olaf. If your article number is in the details table, then there is no point in doing a left outer join. Doing a (default) inner join should speed things up.

In fact, regardless of the fact that you are looking for an article number, a left outer join probably doesn't make much sense for an invoice table. The only reason to use it is if you want to include invoice headers that have no detail lines, which I imagine will never apply.

For another possibility, you could try this:

Code:

SELECT * FROM Invoices ;
  WHERE InvoiceNumber IN (SELECT InvoiceNumber FROM InvoiceDetails WHERE Artiklenr = 1234)

There is no way of knowing in advance whether that will be faster than a join (it might well be slower), but it is worth giving it a try.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Dennis76 · Jul 17, 2016

Thanks for the answers.

@OlafDoschke:
You are right that a left join is not what I want, but an inner join was about 1 second slower.
The query is executed directly in foxpro and the time is what foxpro presents to me: "xxxx records selected in x.xx seconds".

Here is the output you asked for:

Using index tag Artnr to rushmore optimize table invoice_details
Rushmore optimization level for table invoice_details: full
Rushmore optimization level for table invoices: full
Joining table invoice_details and table invoices using index tag Invoicenr

The database files are on a server and I am connected with 1GbE. If I copy them to my local computer the query takes about 0.7s (inner join) and 0.08 (left join).

@MikeLewis:
I already did this in my test and this was horrible... 28 seconds.

I think there is no solution to make this faster.

Olaf Doschke · Jul 17, 2016

What data is in each record? Do you perhaps have larger memo field values?
Rushmore has already done full optimization, but actually that means you also have an index on deleted(), something I'd NOT recommend anymore.
With huge amounts of data any index on binary type data (booleans) is rather slowing things down than helping optimizing. Also your amount of deleted rows should be rather low.
Try deleting indexes on deleted()

How much else traffic is on the server. Your bandwidth is not telling how much the server can really serve. If you do too much on one server you also have problems with DBFs. Besides is opportunistic locking in effect?

Bye, Olaf.

Olaf Doschke · Jul 17, 2016

Ah, and what are your foreground/background memory settings? With large tables your rushmore bitmaps get large.
? SYS(3050,1)
? SYS(3050,2)

Bye, Olaf.

Dennis76 · Jul 17, 2016

Because both tables have a large number of fields, I created 2 new tables with only the 2 neccessary fields. I then filled them with the data from the original tables. It is a bit faster, but not more than 10%.

SYS(3054,11,"cmemvar"):
Using index tag Artnr to rushmore optimize table d
Rushmore optimization level for table d: partial
Rushmore optimization level for table i: none
Joining table d and table i using index tag Invoicenr

? SYS(3050,1): 728498176
? SYS(3050,2): 182190080

At the moment there is no traffic on the server and the database files are on separate SSD's.

I don't know if OpLocks are enabled or disabled. I know we had problems with corrupt indexes with server 2008, and did something with oplocks. But after we installed W2008R2 I think the problem was gone.
I checked the registry on the server and cannot find anything about Oplocks.
Should I set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\EnableOplocks to 0 ?

Thanks a lot for helping.

Olaf Doschke · Jul 17, 2016

You have lots of memory, maybe even too much. You can't have too much. Well, you can, especially if that's not really physically available due to other processes running.
The main point will be deleting the index on DELETED().

You can easily calculate when that rather slows things down:
First the rushmore bitmap for optimizing the "virtual" AND DELETED()=.F. is needing number of rows bits, the larger a table gets, the more memory is needed per single where or join condition. That's just 5MB/8. Set your foreground/background to about 25MB perhaps, not >600MB.

Building that bitmaps needs the index, so this is loaded from the CDX. Since a "real" binary index is not used by rushmore you have a normal index tag on DELETED(), that bloats quite much. Even if only the part of the index tree for the DELETED()=.F. rows is loaded, that means loading most of that index, if most data is undeleted. That can easily mean loading 5MB, even if the data itself only is a few rows. That's why optimization of the DELETED() clause is a bad idea for larger tables.

As the condition about a certain artnr is already very restrictive, the condition data has to be non deleted can also be checked after rows are loaded. In the worst case you load 10 more rows which turn out to be unneeded, but that's far less than 5MB.

In regard of opLock: Handling opLocks correctly is a bit more complicated than just changing one registry key. Especially W2008R2 is having a problem with that. This is the first version with SMB3 not allowing opLocks to be turned off without first reverting to SMB2, so that regkey does nothing for you.

Bye, Olaf.

Dennis76 · Jul 17, 2016

I don't have an index on DELETED().
I only have two tables with each 2 fields:
table Invoices: fields: invoicenr, customernr Index: invoicenr
table invoice_details: fields: invoicnr, artnr Index: invoicenr, artnr

I tried to reduce the foreground/background, but still the same.

atlopes · Jul 18, 2016

What are the types of the fields of both tables?

What is the expression of the artnr index?

Olaf Doschke · Jul 18, 2016

OK, your new table doesn't have that, but your old should, otherwise the Rushmore optimization level will never be full, unless you had SET DELETED OFF.
Indeed DELETED OFF is the default, so that's okay.

Atlopes has a good question, the types could also matter, especially, if they are not simply integers but char fields and differ in length.

How long does this run with some existing artnr and with some not existing: [tt]SELECT recno() FROM Invoice_Details WHERE artnr = X[/tt]

What is your VFP version?

And since you said the data is on SSD, how are its specs, especially for random access? IOPS is more important than throughput. And good IOPS with a bad latency for each IO operation still means slow access compared to reading/streaming large files.
With modern drives, the AS SSD Benchmark typically measures better IOPS for 4K blocks, while DBF and CDX files are organised in 512B blocks. From that perspective, DBFs have a bad legacy structure of small blocks, there's nothing you can do about that.

Bye, Olaf.

Dennis76 · Jul 18, 2016

I use VFP 9 SP2.
Field Invoicenr is in both tables char and 9 long. I didn't know that it would matter. In my testdatabase I changed it to Integer and the query is inner join and left join now about 1 second. So it is a lot faster.
But unfortunately I cannot change the field types in the real database. The invoicenumber has a letter in front of it, to represent the kind of invoice.

The expression of the indexes are always the same as the fieldname.

testdatabase with artnr as char:
SELECT recno() FROM Invoice_Details WHERE artnr = 'A 38014'
6000 records -> 3 sec
SELECT recno() FROM Invoice_Details WHERE artnr = 'A 00000'
0 records -> 0.02 sec

testdatabase with artnr as integer:
SELECT recno() FROM Invoice_Details WHERE artnr = 38014
6000 records -> 0.7 sec
SELECT recno() FROM Invoice_Details WHERE artnr = 0
0 records -> 0.0 sec

Maybe a little information what my goal is. The program we are using is written in VFP and we are using it on a terminal server. So everybody uses the files locally, which is fast enough.
But we also have PHP scipts, that use data from that database. These scripts are on a different computer. Most of the queries are fast, but this is the only one we have problems with. Because the scripts didn't need the information realtime, we are copying the new records every 30 minutes to a mysql database and use this one for the query.
But now we have a new script, where we need the information faster. I could copy the new records every minute, but it would be great to use the real database.

Olaf Doschke · Jul 18, 2016

Ints are 4 bytes, 9 chars are 9 bytes, so it's obviously better for smaller and more efficient indexes to have int keys.

Since you query from detail to head table the 6000 rows you find mean 6000 lookups in the 1M rows, that's not that bad with any type of field. Are they indexed with MACHINE collation? They would most probably be, as Rushmore said it is using the invoicenr index to join. With other collations any char of a string will be 2 in the index, blowing up index keys to 18 bytes. Integers don't depend on any collation.

Is your data in a DBC? Or free tables? How old is the database, did you ever rebuilt tables with VFP9 or where they inherited from an older VFP version? Your new table should be VFP9 made, so it shouldn't matter much unless you took a copy and removed unneeded columns.

Does a REINDEX help for a moment?

Bye, Olaf.

atlopes · Jul 18, 2016

Does this code serve as a demonstrator of the issue you're facing? Highlighted query on local data, returning approximately 5.7k records, takes 0.05 seconds and it is fully optimizable.

Code:

SET DEFAULT TO (GETDIR())

CREATE TABLE invoices (invoicenr C(9), customernr I)
INDEX ON invoicenr TAG invoicenr
CREATE TABLE  invoice_details (invoicenr C(9), artnr C(7))
INDEX ON invoicenr TAG invoicenr
INDEX ON artnr TAG artnr

LOCAL LoopInvoice AS Integer
LOCAL LoopDetails AS Integer

FOR m.LoopInvoice = 1 TO 1000000
	INSERT INTO invoices (invoicenr, customernr) VALUES ('A' + STR(m.LoopInvoice,8,0), INT(RAND() * 65000))
	FOR m.LoopDetails = 1 TO INT(RAND()*10) + 1
		INSERT INTO invoice_details (invoicenr, artnr) VALUES ('A' + STR(m.LoopInvoice,8,0), "A" + STR(INT(RAND() * 999),6,0))
	ENDFOR
ENDFOR

CLEAR

SYS(3054,0)

SELECT TOP 1 artnr, COUNT(*) FROM invoice_details GROUP BY artnr ORDER BY 2 DESC INTO CURSOR curArtnr

LOCAL Artnr AS Integer

m.Artnr = curArtnr.artnr

USE IN invoices
USE IN invoice_details

LOCAL RushmoreReport

SYS(3054,11,"RushmoreReport")

[highlight #FCE94F]SELECT customernr FROM invoices INNER JOIN invoice_details ON invoices.invoicenr = invoice_details.invoicenr WHERE artnr = ?m.Artnr INTO CURSOR curTest[/highlight]

? m.RushmoreReport

TamarGranor · Jul 18, 2016

How are SET EXACT and SET ANSI set? I vaguely remember there being an issue of a slowdown when they're different.

Tamar

Olaf Doschke · Jul 19, 2016

ANSI will determine comparison of strings, which is of essence for the join with a string column.

In general there's nothing wrong with using char fields for ids, I also use and recommend GUIDs, which in VFP terms are strings, even longer than char(9). In your situation with that much of data it would pay to add tables for the purpose to answer this question "who ordered some article?" with extra data instead of going through two large tables. Actually you'd need to aggregate the cutsomernr, wouldn't you?

I'd take atlopes nice test code and put DBFs on a 1Gbit connected drive and see how that performs, if you can't get that better, you can create an extra table with customernr/artnr and add to it at each order/invoice, only add a record, if a combination does not yet exist. It'll obviously be much faster to get the result from there.

Code:

SET DEFAULT TO (GETDIR())
Set Deleted Off
Set Ansi Off
Set Exact Off

CREATE TABLE invoices (invoicenr C(9), customernr I)
INDEX ON invoicenr TAG invoicenr
CREATE TABLE invoice_details (invoicenr C(9), artnr C(7))
INDEX ON invoicenr TAG invoicenr
INDEX ON artnr TAG artnr
CREATE TABLE article_customers (artnr C(7), customernr I)
Index on artnr+BinToC(customernr) TAG artcus

LOCAL LoopInvoice AS Integer
LOCAL LoopDetails AS Integer

FOR m.LoopInvoice = 1 TO 1000000
	INSERT INTO invoices (invoicenr, customernr) VALUES ('A' + STR(m.LoopInvoice,8,0), INT(RAND() * 65000))
	FOR m.LoopDetails = 1 TO INT(RAND()*10) + 1
		INSERT INTO invoice_details (invoicenr, artnr) VALUES ('A' + STR(m.LoopInvoice,8,0), "A" + STR(INT(RAND() * 999),6,0))
		
		If !Indexseek(invoice_details.artnr+BinToC(invoices.customernr),.f.,"article_customers","artcus") 
		   Insert into article_customers values (invoice_details.artnr, invoices.customernr)
		Endif
	ENDFOR
ENDFOR

CLEAR

SYS(3054,0)
SELECT TOP 1 artnr, COUNT(*) FROM invoice_details GROUP BY artnr ORDER BY 2 DESC INTO CURSOR curArtnr

LOCAL Artnr AS Integer

m.Artnr = curArtnr.artnr
USE IN Select("invoices")
USE IN Select("invoice_details")
USE IN Select("article_customers")

LOCAL RushmoreReport
SYS(3054,11,"RushmoreReport")

t0 = Seconds()
SELECT customernr FROM invoices INNER JOIN invoice_details ON invoices.invoicenr = invoice_details.invoicenr WHERE artnr = ?m.Artnr INTO CURSOR curTest
t1 = Seconds()
? t1-t0
? m.RushmoreReport 

t0 = Seconds()
SELECT customernr FROM article_customers WHERE artnr+BinToC(customernr) = ?m.Artnr INTO CURSOR curTest2
t1 = Seconds()
? t1-t0
? m.RushmoreReport

Bye, Olaf.

Olaf Doschke · Jul 19, 2016

Besides, it's quite easy to make any query with low result volume on large datasets run locally by creating a sql server:

Code:

LPARAMETERS tcQuery, tcResultFile
&tcQuery INTO CURSOR curResult
COPY TO (tcResultFile) && or "upload" to a MySQL table

You'd let the webserver query and create a result file. As this is a separate machine, you'd need to do this in a remote fashion, eg using VFPy to let this be accessible via a http request.

Bye, Olaf.

PS: or of course [tt]&tcQuery INTO TABLE (tcResultFile)[/tt], but you'd rather want VFP to send this data to the MySQL database so it can query it from there. It may turn out that the overall time is the same anyway, the transfer of data may take the longer part of the query time. You'll only see, if you try.

Dennis76 · Jul 19, 2016

Thanks for all the answers and tips. I was very busy yesterday. Today is such a nice weather and I will take the day off to go wakeboarding.
I will test everything and come back.

Olaf, I already use activeVFP for serving data to our Android devices. So that would also be an option.
And of course another table with only the data needed is also possible. The tables are in a DBC. But I dond't have the sourcecode to the program and so I can only access the database. Is it possible to use a stored procedure for this action?

Have a nice Day,
Dennis

Olaf Doschke · Jul 19, 2016

Have a nice day!

A stored procedure is not a solution, as VFP does in itself not work as a server. The ODBC driver or the OLEDB Provider you use for retrieving would also pull the stored procedure source and/or object code for execution and so the query would not act locally.

Use of ActiveVFP would rather be a solution if it's on the DBC server. So you'd do create an active VFP script doing the query and serving the result somehow, maybe also return it back as XML.

We already know the local execution is very fast. The question is, whether the time needed to transfer the result in any form will make this overall faster than querying from the web server side directly.

Bye, Olaf.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Slow query when joining 2 tables

Technical User

Programmer

Programmer

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Technical User

Programmer

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor