Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Making a join more efficient

Status
Not open for further replies.

azzazzello

Technical User
Jan 25, 2005
297
US
Here are the details.

1) Here are 2 tables

Code:
mysql> show create table globalPnL;
| globalPnL | CREATE TABLE `globalPnL` (
  `date` date NOT NULL,
  `symbol` varchar(32) NOT NULL,
  `account_id` int(16) NOT NULL,
  `currency` varchar(8) NOT NULL default 'usd',
  `pnl` float(13,4) NOT NULL default '0.0000',
  `update_time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`date`,`symbol`,`account_id`),
  KEY `date_acct_symbol_inx` (`date`,`account_id`,`symbol`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Code:
mysql> show create table currency_rates;
| currency_rates | CREATE TABLE `currency_rates` (
  `rate_date` date NOT NULL,
  `rate_currency` char(3) NOT NULL,
  `rate` float(13,4) NOT NULL default '0.0000',
  PRIMARY KEY  (`rate_date`,`rate_currency`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |

2) There are about 150 entries per date in currency table

3) There are around 300,000-400,000 entries per date in globalPnL table

Where I run the following query, it returns 2-3 times slower than the same query without joining the currency table. Can someone suggest what may be done to alleviate this?

Code:
SELECT g.date,g.account_id,sum(pnl * (case when c.rate is not NULL then c.rate else 1 end)) as convertedpnl from globalPnL g 
   LEFT JOIN currency_rates c on 
        g.date = c.rate_date and 
        g.currency = c.rate_currency 
where date between 20110401 and 20110419 
group by date,account_id

I realize that if I create an index of g.date,g.currency it would help, but the table is rather large, this is just part of the select (with other tables being joined that have nothing to do with currency), and I am trying to stay away from creating massive indexes.
I suspect that the fact that c.rate_currency is a char(3) and g.currency is varchar(8) might be contributing to this, but it's a guess.

Also, can someone suggest something other than "explain" that actually breaks down the query to its components and shows how much time what took after running it?

Thank you!
 
at the risk of annoying you, an EXPLAIN would probably shed a lot of light on the query

i suspect the EXPLAIN will show indexes being used properly

you might also try

SUM(pnl * COALESCE(c.rate,1)) AS convertedpnl

r937.com | rudy.ca
Buy my new book Simply SQL from Amazon
 
Rudy,

My bad, I actually did mean to post the explain and forgot. Unfortunately, I cannot see the issue from it :/ but here it is

Code:
mysql> explain SELECT g.date,g.account_id,sum(pnl) from globalPnL g where date between 20110401 and 20110419 group by date,account_id;
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type  | possible_keys              | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | g     | range | PRIMARY,date_acct_symbol_inx | PRIMARY | 3       | NULL | 3143880 | Using where; Using temporary; Using filesort |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
1 row in set (0.01 sec)
mysql> SELECT SQL_NO_CACHE g.date,g.account_id,sum(pnl * (case when c.rate is not NULL then c.rate else 1 end)) as convertedpnl from globalPnL g    
    LEFT JOIN currency_rates c on         
        g.date = c.rate_date and         
        g.currency = c.rate_currency 
    where date between 20110401 and 20110419 
    group by date,account_id 
    limit 1;
    ...
    1 row in set (5.79 sec)

And here is the faster query without the join
Code:
    mysql> explain SELECT g.date,g.account_id,sum(pnl) from globalPnL g where date between 20110401 and 20110419 group by date,account_id;
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type  | possible_keys              | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | g     | range | PRIMARY,date_acct_symbol_inx | PRIMARY | 3       | NULL | 3143880 | Using where; Using temporary; Using filesort |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+----------------------------------------------+
1 row in set (0.01 sec)    

    SELECT SQL_NO_CACHE g.date,g.account_id,sum(pnl) from globalPnL g 
    where date between 20110401 and 20110419 
    group by date,account_id 
    limit 1;
    ...
    1 row in set (1.64 sec)
 
that's very weird

in the single-table query, it's using the PRIMARY index, which is date/symbol/account, rather than the secondary KEY, which is date/account/symbol

this despite the fact that the WHERE clause can use either of them, and that the GROUP BY clause should actually be using the secondary KEY

you can see the inefficiency in "Using where; Using temporary; Using filesort" -- it's retrieving the rows with an index, but storing the retrieved rows in a temporary file, in order to sort them for the GROUP BY

i am ~not~ an internals expert, so i'm not sure i can take this any further

does changing the WHERE clause make a difference --

WHERE `date` BETWEEN '2011-04-01' AND '2011-04-19'

r937.com | rudy.ca
Buy my new book Simply SQL from Amazon
 
Rudy,

This is baffling me as well. Here is something even more baffling. When I specify an account id to use EXPLICITLY, it still does not use the date_account_symbol index!!

Code:
mysql> explain SELECT SQL_NO_CACHE g.date,g.account_id,sum(pnl) from globalPnL g where `date` between '20110401' and '20110419' and account_id = 3000 group by date,account_id limit 1 ;
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys              | key     | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+-------------+
|  1 | SIMPLE      | g     | range | PRIMARY,date_acct_symbol_inx | PRIMARY | 3       | NULL | 3143880 | Using where |
+----+-------------+-------+-------+----------------------------+---------+---------+------+---------+-------------+

Code:
mysql> explain SELECT SQL_NO_CACHE g.date,g.account_id,sum(pnl) from globalPnL g use index (date_acct_symbol_inx) where date between 20110401 and 20110419 and account_id = 3000 group by date,account_id limit 1 ;
+----+-------------+-------+-------+--------------------+--------------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys      | key                | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+--------------------+--------------------+---------+------+---------+-------------+
|  1 | SIMPLE      | g     | range | date_acct_symbol_inx | date_acct_symbol_inx | 7       | NULL | 2245376 | Using where |
+----+-------------+-------+-------+--------------------+--------------------+---------+------+---------+-------------+

In fact, when I FORCE it to use that index, it actually takes longer! Twice as long, despite the fact that explain shows there are a full million less rows to go through!! How can this be? Is the fact that one is PRIMARY affecting the search? I was under the impression all indexes are equal...
 
Could the fact that I am using "between" for date range be the culprit?
 
Hmm. It is in fact date between x and y that is the culprit. When I specify dates explicitly (date in (x,x+1,x+2,...,y)), the right index DOES get used and the query is much faster. 5 times faster.
I am investigating why that is, and whether changing my date column, which is ALWAYS an 8-digit integer, from date to int (or something like that) will improve performance.
 
BTW, to show what I mean, here is what I am doing. Now, mysql IS using the right indexis 5 times faster than the best one I had so far. I would REALLY appreciate it if someone told me how to store DATE better, make my index more efficient, or construct a better query. It is not an option to expand, say, all dates for 2 years into an "in" clause

Code:
explain SELECT SQL_NO_CACHE g.date,g.account_id,sum(pnl) from globalPnL g where date in (20110401,20110402,20110403,20110404,20110405,20110406,20110407,20110408,20110409,20110410,20110411,20110412,20110413,20110414,20110415,20110416,20110417,20110418,20110419) 
and account_id = 3000 
group by date,account_id limit 1;
+----+-------------+-------+-------+----------------------------+--------------------+---------+------+-------+-------------+
| id | select_type | table | type  | possible_keys              | key                | key_len | ref  | rows  | Extra       |
+----+-------------+-------+-------+----------------------------+--------------------+---------+------+-------+-------------+
|  1 | SIMPLE      | g     | range | PRIMARY,date_acct_symbol_inx | date_acct_apid_inx | 7       | NULL | 53934 | Using where |
+----+-------------+-------+-------+----------------------------+--------------------+---------+------+-------+-------------+
 
actually, your date column is not (at present) an integer, but an actual DATE

mysql does not store dates internally in any recognizable format

it would be interesting to see whether performance improves if you did use an actual INTEGER column instead

i suggest using UNIX_TIMESTAMP to convert existing dates to their unix epoch equivalents (seconds since 1970-01-01, and they will run out in 2032 or something like that)

r937.com | rudy.ca
Buy my new book Simply SQL from Amazon
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top