Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to index (or just speed up) a CONCAT of columns in two joined tables 1

Status
Not open for further replies.

OsakaWebbie

Programmer
Feb 11, 2003
628
JP
I have a structure like this:
Code:
CREATE TABLE person (
  PersonID int(11) unsigned NOT NULL auto_increment,
  FullName varchar(100) NOT NULL default '',
  HouseholdID int(11) unsigned NOT NULL default '0',
  (...other fields...)
  PRIMARY KEY  (PersonID),
  KEY Furigana (Furigana),
  KEY FullName (FullName),
  KEY Email (Email),
  KEY Organization (Organization,Furigana)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

CREATE TABLE household (
  HouseholdID int(11) unsigned NOT NULL auto_increment,
  NonJapan tinyint(1) NOT NULL default '0',
  PostalCode varchar(8) NOT NULL default '',
  Address varchar(200) NOT NULL default '',
  RomajiAddress varchar(200) NOT NULL default '',
  (...other fields...)
  PRIMARY KEY  (HouseholdID)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

CREATE TABLE postalcode (
  PostalCode varchar(8) character set ascii collate ascii_bin NOT NULL default '',
  Prefecture varchar(12) NOT NULL default '',
  ShiKuCho varchar(54) NOT NULL default '',
  Romaji varchar(200) NOT NULL default '',
  PRIMARY KEY  (PostalCode)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Queries for people info almost always join these three tables thus:
Code:
SELECT yada, yada from person
  LEFT JOIN household ON person.HouseholdID=household.HouseholdID
  LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode
If a person is in Japan, their address is not just the Address field, but:
Code:
CONCAT(h.PostalCode,pc.Prefecture,pc.ShiKuCho,h.Address)
And the romanized writing of their address is:
Code:
CONCAT(h.RomajiAddress,' ',pc.Romaji,' ',h.PostalCode)
There are lots of other tables that hinge from PersonID that are often involved in queries as well, and one main search page that combines stuff from almost all the tables and assists with building a variety of WHERE criteria.

Well, I learned today that multiple users using that main search to look for people based on some piece of text being in their address ("WHERE CONCAT(h.PostalCode,pc.Prefecture,pc.ShiKuCho,h.Address) LIKE '%yadayada%') can cripple the server. I got to thinking about it and realized that I should try to figure out how to get an index on that constructed text. I assumed I could make a view that includes those full addresses (which I should have done a long time ago anyway, to simplify my code) and then put an index on them. But apparently views can't have indexes. I could put indexes on the individual fields, but since I always search and/or retrieve them in the concatenated form, would that even help? (Sorry, I don't know how to do performance tests when I'm using a remote VPS.) Does someone have a suggestion as to what I can do to make this kind of query less of a CPU drain?
 
I had never heard of full-text search, but I just read about, and I don't understand how it would apply. The text to be searched is not human speech but Japanese addresses, so the idea of "natural speech" is irrelevent and even detrimental. For example, the docs say, "In addition, words that are present in 50% or more of the rows are considered common and do not match." I don't know what it would consider a "word" in a Japanese address that has no spaces, but it is entirely likely that the prefecture name that is local to the user could occur in over 50% of the addresses, and I would definitely not want to prevent them from searching on it. Plus, I see nothing in the docs about combining multiple fields, especially including another table (I want to be able to search across a striing that is constructed from postalcode.Prefecture, postalcode.ShiKuCho, and household.Address put together as one string.

I'll give an example - I'll romanize the text to make it easier for you to read, but the real text is Japanese (UTF-8, in case that matters). Suppose I have a household record with:
[ul]
[li]PostalCode='540-0004', Address='2-26-47'[/li]
[/ul]
And a postalcode record with:
[ul]
[li]PostalCode='540-0004', Prefecture='Oosakafu', ShiKuCho='OosakashiChuuoukuTamatsukuri'[/li]
[/ul]
(The text in Japanese is far shorter - that ShiKuCho value is only 8 characters in Japanese.) If a user searches for "Tamatsukuri2', I want them to find this record (and any others that are in the section #2 of the community called Tamatsukuri). Note that "Tamatsukuri" is in the postalcode table and the "2" is in the household table.
 
Hi

OsakaWebbie said:
In addition, words that are present in 50% or more of the rows are considered common and do not match.
That applies to natural language search. You could use boolean search instead. Boolean mode also has the benefit of allowing [tt]match()[/tt] fields from multiple tables. However partial matches must be at the beginning of fields.

That separatorless format may be a problem, but I would still give it a try.

Sorry I can not test it, I have almost nothing with UTF-8 support here.

Feherke.
[link feherke.github.com/][/url]
 
Feherke said:
However partial matches must be at the beginning of fields.
Hmm, that's a showstopper - I want to allow a search for any portion the user wants.

Related question: As I try to test things, is there a way to see how much CPU time the query took, or something else that can give me a relative comparison? I don't normally use "explain" because I don't really understand how to interpret it, but could that be of assistance in this case? Just timing a search from my browser isn't helpful, as it's on a live server, and internet delays are also possible - too many variables. By the way, it's a VPS and I can SSH as root, so if there is something I can do at the command line that would include some sort of reporting of the query time, let me know.
 
Hi

Feherke said:
However partial matches must be at the beginning of fields.
Sorry, wrong word. That should be words, not fields. But I guess it makes no difference here.

I prefer [tt]time[/tt] for performance test :
Code:
[blue]master #[/blue] time mysql database < script.sql > /dev/null

real    0m0.109s
user    0m0.005s
sys     0m0.005s
But for better precision there are some additional things :
[ul]
[li]You should execute the SQL statements multiple times and calculate the average run time.[/li]
[li]You should discard the first run as it may take longer. ( For all the next runs some of the data will be already in the memory. )[/li]
[li]You should use [tt]optimize table[/tt] before/between [tt]update[/tt] commands. ( Not 100% sure, I never measured [tt]update[/tt] on MySQL. But on PostgreSQL certainly have to use [tt]vacuum[/tt] in such case. )[/li]
[/ul]

Feherke.
[link feherke.github.com/][/url]
 
Okay, I did various tests, starting with a real query that my PHP interface constructed - the way the code is written now, even if you ask for only a list of IDs of the results (for other processing) instead of a bunch of information, my code leaves a couple extra joins in place that are technically unneeded. I can see now that one thing I should do is make those conditional (it's not as easy as it sounds because of other factors, but I'll manage.) And yeah, you're definitely right that data goes into memory making the second identical query really fast. But for my purposes, the first query is critical - a user will probably only ask for it once.

First I ran it with no indexes on the underlying columns:
Code:
# time mysql kizuna_crashdonors -e "SELECT person.PersonID FROM person LEFT JOIN household ON person.HouseholdID=household.HouseholdID LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode LEFT JOIN percat ON person.PersonID=percat.PersonID LEFT JOIN category ON percat.CategoryID=category.CategoryID WHERE (CONCAT(household.PostalCode,Prefecture,ShiKuCho,Address) LIKE '%京都府%' ) GROUP BY person.PersonID ORDER BY Furigana" > /dev/null

real    3m48.865s
user    0m0.004s
sys     0m0.006s
Then I added indexes on household.Address and on the combination of postalcode.Prefecture and postalcode.ShiKuCho. It seemed to make no difference:
Code:
# time mysql kizuna_crashdonors -e "SELECT person.PersonID FROM person LEFT JOIN household ON person.HouseholdID=household.HouseholdID LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode LEFT JOIN percat ON person.PersonID=percat.PersonID LEFT JOIN category ON percat.CategoryID=category.CategoryID WHERE (CONCAT(household.PostalCode,Prefecture,ShiKuCho,Address) LIKE '%京都府%' ) GROUP BY person.PersonID ORDER BY Furigana" > /dev/null

real    3m52.346s
user    0m0.003s
sys     0m0.005s
Then I removed the joins and "order by" that are unnecessary for an ID-only query:
Code:
# time mysql kizuna_crashdonors -e "SELECT person.PersonID FROM person LEFT JOIN household ON person.HouseholdID=household.HouseholdID LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode WHERE (CONCAT(household.PostalCode,Prefecture,ShiKuCho,Address) LIKE '%京都府%' ) GROUP BY person.PersonID" > /dev/null

real    1m7.423s
user    0m0.003s
sys     0m0.005s
A big improvement, but still too slow. There are 6503 person records, 10,100 household records (there really shouldn't be more of those, but apparently some cleanup needs to happen), and 4002 postalcode records - that shouldn't be too much for MySQL, should it?

Then I repeated the same two queries with no change to the database structure, and it clearly used memory cache only:
Code:
# time mysql kizuna_crashdonors -e "SELECT person.PersonID FROM person LEFT JOIN household ON person.HouseholdID=household.HouseholdID LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode WHERE (CONCAT(household.PostalCode,Prefecture,ShiKuCho,Address) LIKE '%京都府%' ) GROUP BY person.PersonID" > /dev/null

real    0m0.010s
user    0m0.002s
sys     0m0.007s
# time mysql kizuna_crashdonors -e "SELECT person.PersonID FROM person LEFT JOIN household ON person.HouseholdID=household.HouseholdID LEFT JOIN postalcode ON household.PostalCode=postalcode.PostalCode LEFT JOIN percat ON person.PersonID=percat.PersonID LEFT JOIN category ON percat.CategoryID=category.CategoryID WHERE (CONCAT(household.PostalCode,Prefecture,ShiKuCho,Address) LIKE '%京都府%' ) GROUP BY person.PersonID ORDER BY Furigana" > /dev/null

real    0m0.011s
user    0m0.004s
sys     0m0.007s
But like I said, repeating the same query is not very relevent, because users normally won't do that.

The shell was patient enough to wait for a four-minute query to finish, but even if the users would be that patient (which they probably wouldn't), PHP gives up and delivers just an Error 500 to the browser (and the query is probably still running, so if the user tries again he just compounds the problem!).

After that I did some other experiments to try to determine in more detail what MySQL doesn't like. Starting with the query that was 1:07, I changed only the WHERE clause to try concatenations with different tables involved - nothing was a medium-length response - either it was less than 0.1 second or more than 20 seconds (I would interrupt it at around that point):
[ul]
[li]CONCAT(person.FullName,household.LabelName) LIKE '%教会%' = quick[/li]
[li]CONCAT(postalcode.ShiKuCho,household.LabelName) LIKE '%教会%' = slow[/li]
[li]CONCAT(postalcode.Prefecture,postalcode.ShiKuCho) LIKE '%大阪市%' = quick[/li]
[li]CONCAT(postalcode.ShiKuCho,postalcode.PostalCode) LIKE '%横浜市%' = quick (I tried this one because I thought the previous one might have used my new index)[/li]
[li]CONCAT(postalcode.ShiKuCho,household.PostalCode) LIKE '%横浜市%' = slow[/li]
[li]CONCAT(household.Phone,household.PostalCode) LIKE '%540%' = quick[/li]
[li]CONCAT(household.Phone,postalcode.PostalCode) LIKE '%540%' = slow[/li]
[/ul]
Then I started using fields that are just ASCII instead of multibyte, and the long speeds got somewhat faster:
[ul][li]CONCAT(person.CellPhone,postalcode.PostalCode) LIKE '%540%' = 10 sec[/li]
[li]CONCAT(person.CellPhone,postalcode.PostalCode) LIKE '%530%' = 15 sec (I did this just to see if the first one was a fluke)[/li]
[li]CONCAT(person.CellPhone,household.PostalCode) LIKE '%530%' = 0.1 sec[/li] (notice the difference in table name)
[li]CONCAT(household.Phone,postalcode.PostalCode) LIKE '%540%' = 44 sec[/li][/ul]
So it's happy with concats involving any one table or a combination of person and household, but if I concat anything in postalcode with one of the other tables it gets slower, and household+postalcode is the worst. Since my tables are joined person->household->postalcode, I suppose that makes sense.

I need to solve this, but I don't want to make my tables less normalized - this particular client of this database structure is not using the normalization very well, but others really do have multiple households in the same postalcode and multiple people in the same household (families), and there are other reasons besides data volume for keeping them separate (data consistency and ease of entry).

I had really hoped that a view would be the answer, but although queries of the view look nice and pretty (I might keep the view idea just to clean up my PHP code a little), it's not any faster. I look forward to any other ideas.
 
Hi

I mentioned the speed differences caused by data cached in memory just for the relevance of comparisons.

One thought. [tt]outer join[/tt]s ( including the [tt]left join[/tt] you used ) are generally slower than [tt]inner join[/tt]s. As
MySQL Documentation said:
CONCAT() returns NULL if any argument is NULL
( MySQL Reference Manual | Functions and Operators | String Functions | CONCAT(str1,str2,...) )
your [tt]like[/tt] will unlikely work anyway if there was no match in the joined table. So from that point of view you should use [tt]inner join[/tt]. Or take care of [tt]null[/tt]s if you need those records too.

Another though. Even that [tt]like[/tt] and [tt]instr()[/tt] usually have similar performance, maybe there is a difference in case of multibyte characters. Try to change the condition to [tt]instr(CONCAT(postalcode.ShiKuCho,household.LabelName),'教会')[/tt].

Sorry, no proper idea for solving this, excepting the use of a dedicated field with the already concatenated values. Note that you not necessarily have to maintain it "manually". You can associate triggers to the person, household and postalcode tables which to keep concatenated value up to date in that dedicated field.

Regarding the view, you would need a materialized view. Which are not supported in MySQL. But a search for "mysql materialized view" shows a couple of workarounds.

Feherke.
[link feherke.github.com/][/url]
 
Thanks for trying to help me brainstorm.

Feherke said:
I mentioned the speed differences caused by data cached in memory just for the relevance of comparisons.
And I appreciate it, because I was not really aware of that kind of caching and would have been fooled by it. But my response to that new knowledge was to continually change the criteria, because I'm researching the performance of getting the data the first time, not of getting it from cache.

Feherke said:
your like will unlikely work anyway if there was no match in the joined table. So from that point of view you should use inner join.
True for that criteria, but my code assembles a big query from pieces based on a variety of criteria that the user may or may not use. If that had been the magic bullet, I would have happily had my code use different joins depending on whether the address was involved, but alas, there was no dramatic speed difference in my test.

Feherke said:
Try to change the condition to instr(CONCAT(postalcode.ShiKuCho,household.LabelName),'教会').
Nice thought, but if anything, that was slightly slower.

Feherke said:
Sorry, no proper idea for solving this, excepting the use of a dedicated field with the already concatenated values. Note that you not necessarily have to maintain it "manually". You can associate triggers to the person, household and postalcode tables which to keep concatenated value up to date in that dedicated field.
Yeah, unless someone else reading this comes up with something you and I haven't thought of, that would appear to be the only way to get a tolerable response. As long as it's just a duplicate that I only use in data retrieval, it won't ruin my logic the way un-normalizing the tables would have. I'm not as concerned about the data size - one more field in the person table isn't that big a deal.

The trigger idea is nice, but I don't think it will completely free me from manual maintenance. For example, when a new person is entered, first I add the postalcode record if necessary, then the household record if necessary, and then the person record - at the time the postalcode and household records are inserted, there would be no person record for the trigger code to work on, so when I add the person record I'll need to make sure to fill that field too. In fact, come to think of it, there are only two files in my code that would need to change to do the update manually: the file that responds to the form for a new or edited person, and a maintenance page that allows edits on postalcode records. I'll look at the triggers, but they might not be worth the bother. Also, this application is set up with a single codebase but multiple "clients" with their own databases - it's complex to make a change to everyone's database structure but trivial to edit the code, so for ease of future maintenance, the more that can be in the code the better. Hmm, as I'm writing this I think I'm talking myself out of triggers and into just doing it in code...[talk]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top