Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

5 years mysql experience but can't figure out good architecture...

Status
Not open for further replies.

Thanasus

Programmer
Nov 25, 2003
13
US
I am wracking my brain trying to figure out the best way to design the database. The database will store real estate data. My biggest concern is speed of searching.

Lets say I have a home which has 150 different attributes to it. Example attributes would be price, year built, square footage, number of stories, number of bedrooms, high school which home is zoned for, roofing description, or landscape description. I am making a separate table for all 150 attributes and one table to contain property numbers.

CREATE TABLE `property` (
`id` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT ,
`timeadded` TIMESTAMP NOT NULL ,
`timeupdated` TIMESTAMP NOT NULL ,
PRIMARY KEY ( `id` )
);

CREATE TABLE `bedrooms` (
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`bedrooms` TINYINT UNSIGNED NOT NULL ,
INDEX ( `fkey_prop` )
);

CREATE TABLE `bathrooms` (
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT ,
`bathrooms` FLOAT UNSIGNED NOT NULL ,
INDEX ( `fkey_prop` )
);

CREATE TABLE `landscaping` (
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`code` TINYINT UNSIGNED NOT NULL ,
INDEX ( `fkey_prop` )
);

So the 3 attribute tables given as examples above (bedrooms, bathrooms, and landscaping) have one-to-one and many-to-one relationships with the property table. Bedrooms and bathrooms are a simple 1-1, a home will always have one set value for number of bedrooms - it might be 4 or 2 or 3 or whatever.

However, a home could have multiple landscaping entries. Now, you'll note that the landscaping table has a field called code. Code is a numeric code for a sting which is stored in another table called 'landscape_codes'. Perhaps landscape_codes contains the values "sprinklers, grass, desert landscaping, fruit trees, chain fence, brick wall, wood fence". A home might have sprinklers and be grassy with a wooden fence. Therefore it would have 3 landscaping codes.

I want the user to be able to search homes by criteria they specify and sort by criteria they specify. They may search for "all homes containing at least 3 bedrooms that have a brick wall or chain fence".

So, I am gong to have > comparisons, < comparisons, between comparisons, and &quot;in&quot; statements possibly.

Also every property search will also be able to specify a timestamp range for the home. &quot;Show me homes meeting my criteria which were added to the system between Nov 4 2003 and Nov 5 2003 AND that had a last update of Dec 28 2003&quot;

I am trying to not do filesorts when executing the SQL query but cannot figure out a way around it. Any ideas?
 
figure out a way around it? around what?

what exactly is the problem?

complexity? create one table for attributes rather than 150

performance? index your search fields

by the way, to sort results by input criteria is tricky, you have to have some way for the user to rank the input criteria, and the user interface (form) is likely to get insanely complex -- better just to return only those properties that meet all criteria in some standard fashion, e.g. by price descending, or by zip code alpha, or something


rudy
 
why not make 3 tables

hometable
hmlineid int key autoincrement
hmdescr varchar

homepropertylines
hmplineid int key autoincrement
hmphmlineid int
hmpprlineid int
hmpvalue varchar

propertytable
prlineid int key autoincrement
prdescr varchar

now you put you're 150 properties in the propertietable. when you want to enter the properties for selected home you can generate a form with input fields from the propertytable (so yu get a form with 150 input fields) on submit you insert all the properties that have a value into the homepropertytable so you insert the hmlineid into hmphmlineid and the property name (the prlineid) into hmpprlineid and the value in hmpvalue

you are always free to make a script which allows you to enter the same property more than one time
 
Rudy,

The problem with 150 attributes all in one table is the attributes can have various data types. &quot;Number of bedrooms&quot; will never be more than 288 so TINYINT will work just fine. &quot;Price&quot; requires a MEDIUMINT. &quot;Number of Bathrooms&quot;, in the USA, often includes a decimal such as 2.5 or 3.75, so it would require a FLOAT(3,2). &quot;Master Bedroom Dimensions&quot; is a string such as 15x20 which means the master bedroom is 15 feet by 20 feet in size. I can't just store the string &quot;15x20&quot; so I break it into to TINYINT values and store them in an attribute table that has 3 fields:

CREATE TABLE `masterbeddim` (
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`x` TINYINT UNSIGNED NOT NULL ,
`y` TINYINT UNSIGNED NOT NULL ,
INDEX ( `fkey_prop` )
);

So, as you can see, I cannot store all the attributes in one table because of varying data types. Now, I have thought about a separate table for each datatype, such as:

CREATE TABLE `dimension_attributes` (
`attribute_id` TINYINT UNSIGNED NOT NULL ,
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`x` TINYINT UNSIGNED NOT NULL ,
`y` TINYINT UNSIGNED NOT NULL ,
INDEX ap( `attribute_id` , `fkey_prop` )
);


The reason I stayed away from that however is because of the integer datatypes. &quot;Number of Bedrooms&quot; would be TINYINT while &quot;Price&quot; would be MEDIUMINT. I would have to make the integer_attributes table MEDIUMINT, cauing myself to store an extra 2 bytes for every TINYINT entry, since it would have to be stored as a MEDIUMINT.

CREATE TABLE `integer_attributes` (
`attribute_id` TINYINT UNSIGNED NOT NULL ,
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`value` MEDIUMINT UNSIGNED NOT NULL ,
INDEX ap( `attribute_id` , `fkey_prop` )
);

I thought this wasteful of space. I also figured this would really slow down queries in the case of multiple self joins. Lets say there were 40 attributes which were integer types and there were 100,000 properties - so you'd be looking at 4,000,000 rows of data in the &quot;integer_attributes&quot; table. Let's say soimeone wanted to do a search of all homes &quot;> 3 bedrooms&quot;, &quot;< $800,000 in price&quot;, &quot;> bathrooms&quot;. The query would have to self join the &quot;integer_attributes&quot; table 3 times. With 4 millions rows of data (even indexed properly) that could get hefty.

By contrast, if each of the attributes had its own separate table, then the above query would involve 3 joins but each table in the join would only be 100,000 rows (assuming 100,000 properties in the system). Additionally, the bedrooms and bathrooms tables in the join would be TINYINT instead of MEDIUMINT, making them even smaller and faster.

While I am not fond of 150 different tables, its the most efficient design I can see. While it may be difficult for a human to manage - it will allow for quick and efficient joins, sorts, and storage. Do you disagree?

To answer the last part of your statement regarding the sorting of results... The user will be able to select one attribute to sort by and then be able to specify either ascending or descending. The attribute the user can sort by is limited to integer attributes (such as price or square footage).

Lastly results will be paginated, showing 20 homes at a time with a &quot;Next&quot; and &quot;Previous&quot; button. So, I'd also throw a LIMIT clause onto the end such as &quot;LIMIT 80,20&quot; to show results 80-100.
 
your comments about the datatypes of the attributes shows that you are thinking

however, there's no real reason to have &quot;number of bedrooms&quot; as an integer, instead of, say, a varchar, because you aren't going to do any arithmetic on it, e.g. you aren't going to find the average number of bedrooms in all the homes in santa monica

same for bedroom dimensions, you're not going to do arithmetic, such as &quot;all homes with a master beroom where the long dimension is at least twice the short dimension and the total square footage is no less than 4500 sq ft&quot;

all your selection criteria can, it seems to me, be handled by varchar with simple search...

... but then, i haven't really sat down to study the application, this is just what comes to mind right away

;-)

as for space, dude, the last thing you think about is space

space is cheap

when i bought my first personal computer, i paid over US$2500, and it had 16megs of main memory and a whopping huge 2gig hard drive!!!!!

i can't sell that computer today for ten bucks, heck, i can't even give it away

the technical term for this is &quot;boat anchor&quot;

i'll tell you what is still very pricey today -- the cost of debugging complex queries

if you're doing this for a client, and that client changes the specs on you, and you come back with &quot;gee, mister client, that's a new one, i need to add a 151st table, and of course all the queries in the application will have to be rewritten, that'll cost you probably a dozen hours of my time&quot; then the client should really kick you in the @ss, no offence, and the client will be right

and if you're doing this for yourself, do yourself a favour, forget file size, and concentrate on query complexity

yes, a multiple-table join can involve gazillions of records, but not if you have defined good indexes...

... and especially not if you don't even need to do multiple-table joins in the first place

rudy
 
I completely agree with you about not worrying about the cost of space. That isn't a concern of mine. This is going to run on a dedicated UNIX MySQL server dual processor with 4gb RAM and 15,000 RPM SCSI Ulta single platter drives. I'm not trying to save a couple bucks here.

My space concern comes from the more space something occupies, the more memory it takes up and the longer it takes to sort through.

Here is my reasoning. The number of criteria a person puts in will be directly relational to how many joins there are. So if someone wanted to specify &quot;bedrooms&quot;, &quot;bathrooms&quot;, and &quot;price&quot; in their search, the SQL query would contain at least 3 joins. Now if all the data was in the same table or if it was broken into 150 different table, there would still be 3 joins. I figured since I'm going to do the same amount of joins, I might as well keep the tables in those joins as small as possible. By using 150 total tables, I'll cut down the size of the tables in the joins by at least 100-fold - which means MySQL whill have index 100 times smaller, and have to sort through 100 times less data.

However, I can see the &quot;all in one table&quot; idea being attractive because the mysql key buffer can load just one index and sort using that.

If I do the 150 tables, I don't mind managing all those tables and adding new tables for new attributes. I'll write a web based admin panel to allow me to make changes on the fly. My only concern is speed of MySQL.

Also, I do understand your point about not performing arithmetic on the attributes such as &quot;numer of bedrooms&quot; or &quot;master bedroom dimensions&quot;. However, would it not be faster to say &quot;WHERE bedrooms > 2&quot; instead of &quot;WHERE bedrooms IN (`2`,`3`,`4`,`5`,`6`,`7`,`8`,`9`,`10`,`11`,`12`,`17`)&quot; ? (yes, there is a house here that has 17 bedrooms)
 
who says you *must* do joins when searcing for multiple attributes?
 
Now you have me itrigued. Perhaps because I have programmed with joins so long, I am brainwashed into using them. So lets, says all attributes are in one big table...

CREATE TABLE `homes` (
`fkey_prop` MEDIUMINT UNSIGNED NOT NULL ,
`fkey_attribute` TINYINT UNSIGNED NOT NULL ,
`value` VARCHAR( 255 ) NOT NULL ,
`timeadded` TIMESTAMP NOT NULL ,
`timeupdated` TIMESTAMP NOT NULL ,
INDEX av( `fkey_attribute`, 'value`(10) )
);


So, lets say the following attribute ids existed:
bedrooms: 42
bathrooms: 43
landscaping: 89

Lets say someone wanted a home with > 3 bedrooms, > 2.5 bathrooms, and landscaping of &quot;sprinklers&quot; AND &quot;brick wall&quot;. How could that be done without joins?

 

assuming your front end captures the attribute codes, so that we don't have to look up 'bedrooms' to find out that it's attribute 42, basically what you do is filter the homes table by the selected attribute values, then count how many were found for each property

so if you have 4 conditions, and you found 4 rows for a given property, then that property must have satisfied all 4 conditions

[tt]select fkey_prop
from homes
where fkey_attribute = 42
and value > '003' -- bedrooms
or fkey_attribute = 43
and value > '002.5' -- bathrooms
or fkey_attribute = 89
and value > 'sprinklers' -- landscaping
or fkey_attribute = 89
and value > 'brick wall' -- landscaping
group
by fkey_prob
having count(*) = 4[/tt]

this will be very efficient if you have an index on attribute, which you probably would, right?
 
I STRONGLY disagree that the 150-table solution would work very well.
This would not be fun to administer, for one thing, and joining all the tables
would be murder on your CPU(s), not to mention cruel to the poor programmer
who has to write (and maintain) the joins.

I agree with you that data that is naturally considered integer should be stored as
an INT (or similar) field in the database (as opposed to VARCHAR). If you ever decide to
run arithmetic queries on the field it will be easier and less stressful on the CPU
(not having to convert to INT when arithmetic queries are performed).

The way I would structure the tables:

- put ALL fields that have a 1-1 relationship with the house in one table.
(You could call this table 'house'.)
- put ALL fields that have either a 0-1 OR 1-1 relationship with your house
in another table.
- create a separate table for all fields that have a 1-Many relationship.

What this means is that the bedrooms, bathrooms and price fields are apart of
the main 'house' table, and landscaping (as you already mentioned) is
its own table.

I would probably create a separate 'house_room' table to store:
room_name_id (dining room, living room, etc. go in a room_name table)
dimension_x
dimension_y

I don't know what you mean by 'The problem with 150 attributes all in one
table is the attributes can have various data types.'
So what?
Tables can have a field (of various data types) for each attribute.
There is no penalty for having a VARCHAR field and an INT field and a
MEDIUMINT field in the same table. None.

Using your 3-joins argument you state above, I can beat that:
by sticking all fields that have a 1-1 relationship in the same table you
have ZERO joins and this will beat your 3-join query all day long.

MySQL has a few weaknesses, and creating 150 tables at least gives you
the one advantage of being able to distribute your data across a number
of spindles, but I think you've taken it to the extreme and will cause
more problems then it's worth.
 
i also agree with vanekl and Rudy - create normalized tables. Besides 150 tables require openeing time and other os issues to handle them. And queries can be optimised using index.






[ponder]
----------------
ur feedback is a very welcome desire
 
Assuming there are 2 million records and the primary key is a
MEDIUMINT and you configure the machine as recommended by
MySQL AB, all the indexes will not even fit into RAM on a typical PC.
150 indexes of 20.5MB/each = 3GB just for indexes.
And if you can't keep the indexes in RAM your DB speed degrades
by (at least) two orders of magnitude every time you do a join on the
tables.
 
I think you still need joins but coming back to you're example

Lets say someone wanted a home with > 3 bedrooms, > 2.5 bathrooms, and landscaping of &quot;sprinklers&quot; AND &quot;brick wall&quot;. How could that be done without joins?

and I take my db example you can build a fill in form on all the properties from the property table. when someone fills in above values you can make a loop which adds every filled in field to a string. further on you make a count of how many search conditions there are

the loop should generate 1 string like this
(hmprlineid=42 and hmpvalue='3') or (hmprlineid=43 and hmpvalue='2.5')
(hmprlineid=89 and hmpvalue = 'sprinklers') or
(hmprlineid=89 and hmpvalue = 'brick wall')

so you filter out all homepropertylines which are true to the search conditions.

&quot;select max(hmphmlineid),count(hmphmlineid)as total from home homepropertylines
where&quot;. $condition .&quot; group by hmphmlineid order by total desc&quot;

you get a result with the most 'hits' on top where you can make an url from which has the homeid in it. advantage is that you can also go down with lesser hits on a customers demand. if you would only show the records where all search conditions are true perhaps homes that are close but perhaps have 3 bathrooms the customer would also not mind
 
well, hos2, that sure looks like my suggestion

now let me pick some nits ;-)

if you want a home with more than 3 bedrooms and more than 2.5 bathrooms, you can't say (hmprlineid=42 and hmpvalue='3') or (hmprlineid=43 and hmpvalue='2.5')

for these situations, i recommend storing the number as a character string with three digits for the integer part, so that you can say

[tt]where fkey_attribute = 42
and value > '003' -- bedrooms
or fkey_attribute = 43
and value > '002.5' -- bathrooms[/tt]


your GROUP BY is wrong -- you should not try to group by hmphmlineid while at the same time trying to get the MAX of it

but your idea of sorting by total &quot;hits&quot; descending is a good one


rudy
 
oops sorry didn't quite read thoroughly all the reactions, you are right that the MAX is unnecessary in this case since it's the grooup by field. (I mostly use MAX to force a value, otherwise the group by doesn't work or gives just 'one' of the possible values)
 
so it's perhaps also a good possibilty for the user to change the condition where they don't mind if there are more (otherwise all castles will show up if you are looking for a simple appartment with 1 bathroom)

you can default put it on = and that the user can change it to > or < also excepted
(so it's another field in you're form behind or before the value field)
 
Wow, so much good information.

Ok, now I am kind of understanding this from a different angle. One of the big problems I had with the 150 tables was running out of file descriptors on the OS and that is why I did not want to implement this concept before getting some additional ideas.

Rudy, everything in one table is so radically different from how I was taught to do things, its difficult for me to grasp/believe... In your sample query above:


Would you mind explaining how that works? One thing I don't understand is how you can perform a mathematical greater-than operation on a varchar efficiently. According to MySQL &quot;Note that when you are comparing a string column with a number, MySQL can't use index to quickly look up the value&quot; at
Also, I've always been leary of HAVING clauses since they use no optimization:
Your query and layout does excite me though and I am now getting worried that my training in MySQL has been very one-dimensional and way off basis!

One thing I haven't really described that much is the need for two timestamp fields. I have them in some of my previous posts and let me explain what they are for. Throughout the day, people can add homes into the system. I store when they are added via a TIMESTAMP field called
&quot;time_added&quot;. Throughout the day people can also remove homes from the system. They aren't actually removed, I just set a second TIMESTAMP field &quot;time_removed&quot;. When performing any home search, I have it done within the scope of the present time. This also allows me to perform a search through any previous time as well. I can simply add a clause

&quot;WHERE time_added < [timestamp] AND time_removed > [timestamp]&quot;

Problem is, by putting all the property data into one select fkey_prop
from homes
where fkey_attribute = 42
and value > '003' -- bedrooms
or fkey_attribute = 43
and value > '002.5' -- bathrooms
or fkey_attribute = 89
and value > 'sprinklers' -- landscaping
or fkey_attribute = 89
and value > 'brick wall' -- landscaping
group
by fkey_prop
having count(*) = 4
table, I would also have to put the timestamps into that same table. Before I only had two timestamp occurances per home. Now, I would have two timestamp occurances for every atribute value.

Am I just to brainwashed regarding conserving space? Does it matter if the timestamps are in there multiple times as long as the table is indexed well? I always figured that would be poorly normalized. Your query would be turned into something like this for homes active on December 25, 2003...


select fkey_prop
from homes
where fkey_attribute = 42
and value > '003' -- bedrooms
or fkey_attribute = 43
and value > '002.5' -- bathrooms
or fkey_attribute = 89
and value > 'sprinklers' -- landscaping
or fkey_attribute = 89
and value > 'brick wall' -- landscaping
and (
time_added <= TIMESTAMP('12-25-03 00:00:00')
and
time_removed > TIMESTAMP('12-25-03 00:00:00')
)
group
by fkey_prop
having count(*) = 4
 
timestamps are for automatic datetimegeneration when the record is inserted or updated. when you for instance update al the records all timestamps are merely the same (ok some microseconds in between) you could better use the normal date-time option if you want to store date-times that you want to insert.


 
your last query is parenthesized incorrectly

it should be

where (
fkey_attribute = 42
and value > '003' -- bedrooms
or fkey_attribute = 43
and value > '002.5' -- bathrooms
or fkey_attribute = 89
and value > 'sprinklers' -- landscaping
or fkey_attribute = 89
and value > 'brick wall' -- landscaping
)
and time_added <= TIMESTAMP('12-25-03 00:00:00')
and time_removed > TIMESTAMP('12-25-03 00:00:00')

HAVING is not optimized in mysql? perhaps you should consider a real database, then ;-)

but seriously, i didn't see anything on that page to suggest HAVING is not optimized

the comparison of varchars with numerics may not be efficient, but varchar to varchar is, and that's what i was suggesting, store number of bedrooms as a varchar, then compare it to a string like '003'

as far as the timestamps are concerned, they would go on the property table, not the attributes table, unless you care about when a particular attribute was added or removed (highly unlikely, in my opinion)

i'm not sure i can explain how my query works, it seems pretty obvious to me -- select all the rows that pass the 4 conditions, group by property, and thus select any property that has any of the conditions

hos2's suggestion was good, remove the HAVING clause (in effect, you want HAVING COUNT(*) > 0, which is redundant), thus you will accept any property that has even one of the conditions, but sort them descending by hits

rudy
 
and also put a limit on it then so you only return something like the first 100 records

&quot;group by field limit 0,100&quot;



 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top