I am creating a document database that will be used as the backend for a website. The organization itself is cross-relational in nature, the documents are added to certain "areas" and these areas can be combined as filters to give the users precisely what they are looking for. The design works great for the current active sample (about 1100 documents). However, I began testing with a larger sample (about 25,000 documents) and the query runs way too slowly. I actually need the database to perform operations on entries for approximately 2 MILLION documents, so the current state is unacceptable. I've placed indices on all of the tables being used, but it's still not producing the kind of speed I want. I believe the problem is in my query, but very well may be in the structure. Any help in optimizing this would be greatly appreciated.
Here are the three tables, documents, areas, and area_assignments. Documents and areas are related through the foreign key table area_assignments, and both have a one to many relationship with this table.
CREATE TABLE `areas` (
`area` int(11) NOT NULL auto_increment,
`parent_area` int(11) NOT NULL default '0',
`name` varchar(255) default NULL,
`description` text NOT NULL,
`priority` int(1) NOT NULL default '0',
`archive` int(1) NOT NULL default '0',
`hold` int(1) NOT NULL default '0',
PRIMARY KEY (`area`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `documents` (
`document` int(11) NOT NULL auto_increment,
`originator` varchar(9) default NULL,
`volume` varchar(10) default NULL,
`title` varchar(255) default NULL,
`file_type` varchar(20) NOT NULL default 'document',
`href` varchar(255) default NULL,
`draft_folder` varchar(255) default NULL,
`classification` varchar(255) default NULL,
`proponent` int(11) default NULL,
`person` int(11) default NULL,
`complete` int(1) NOT NULL default '0',
`origin_division` int(11) default NULL,
`division` int(11) default NULL,
`entered_on` datetime default '0000-00-00 00:00:00',
`published_on` date default '0000-00-00',
`expires_on` date default NULL,
`synopsis` text,
`hold` int(1) NOT NULL default '0',
`web` int(1) NOT NULL default '0',
`priority` int(1) NOT NULL default '0',
`archive` int(1) NOT NULL default '0',
PRIMARY KEY (`document`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `area_assignments` (
`area_assignment` int(11) NOT NULL auto_increment,
`area` int(11) default NULL,
`document` int(11) default NULL,
PRIMARY KEY (`area_assignment`),
KEY `Area` (`area`),
KEY `Document` (`document`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Here is the query I use when I pull up documents which apply to two areas.
SELECT documents.document,documents.volume,
documents.title,documents.file_type,documents.href,
documents.synopsis,DATE_FORMAT(published_on,'%M %Y') AS format_published_on
FROM documents
WHERE documents.web = 1
AND documents.hold = 0
AND documents.archive = 0
AND EXISTS (SELECT area_assignment
FROM area_assignments
WHERE area_assignments.area = '126'
AND area_assignments.document = documents.document)
AND EXISTS (SELECT area_assignment
FROM area_assignments
WHERE area_assignments.area = '22'
AND area_assignments.document = documents.document)
ORDER BY documents.published_on DESC
To better explain what is going on with this query, the user has selected two different areas, such as "Cars" and "Trucks" and he/she should see a list of documents which apply only to BOTH "Cars" and "Trucks." The query pulls up the documents and checks to ensure that there are entries in the area_assignments table for both areas for each document. I only want the query to return documents which apply to BOTH of the areas, not all of the documents for EITHER of the two areas. Currently the query takes about 10-15 seconds to run, which is unacceptable for a website. I believe the subqueries are the biggest hold-up, but I can't come up with a JOIN that will give me the same results.
Here are the three tables, documents, areas, and area_assignments. Documents and areas are related through the foreign key table area_assignments, and both have a one to many relationship with this table.
CREATE TABLE `areas` (
`area` int(11) NOT NULL auto_increment,
`parent_area` int(11) NOT NULL default '0',
`name` varchar(255) default NULL,
`description` text NOT NULL,
`priority` int(1) NOT NULL default '0',
`archive` int(1) NOT NULL default '0',
`hold` int(1) NOT NULL default '0',
PRIMARY KEY (`area`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `documents` (
`document` int(11) NOT NULL auto_increment,
`originator` varchar(9) default NULL,
`volume` varchar(10) default NULL,
`title` varchar(255) default NULL,
`file_type` varchar(20) NOT NULL default 'document',
`href` varchar(255) default NULL,
`draft_folder` varchar(255) default NULL,
`classification` varchar(255) default NULL,
`proponent` int(11) default NULL,
`person` int(11) default NULL,
`complete` int(1) NOT NULL default '0',
`origin_division` int(11) default NULL,
`division` int(11) default NULL,
`entered_on` datetime default '0000-00-00 00:00:00',
`published_on` date default '0000-00-00',
`expires_on` date default NULL,
`synopsis` text,
`hold` int(1) NOT NULL default '0',
`web` int(1) NOT NULL default '0',
`priority` int(1) NOT NULL default '0',
`archive` int(1) NOT NULL default '0',
PRIMARY KEY (`document`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `area_assignments` (
`area_assignment` int(11) NOT NULL auto_increment,
`area` int(11) default NULL,
`document` int(11) default NULL,
PRIMARY KEY (`area_assignment`),
KEY `Area` (`area`),
KEY `Document` (`document`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Here is the query I use when I pull up documents which apply to two areas.
SELECT documents.document,documents.volume,
documents.title,documents.file_type,documents.href,
documents.synopsis,DATE_FORMAT(published_on,'%M %Y') AS format_published_on
FROM documents
WHERE documents.web = 1
AND documents.hold = 0
AND documents.archive = 0
AND EXISTS (SELECT area_assignment
FROM area_assignments
WHERE area_assignments.area = '126'
AND area_assignments.document = documents.document)
AND EXISTS (SELECT area_assignment
FROM area_assignments
WHERE area_assignments.area = '22'
AND area_assignments.document = documents.document)
ORDER BY documents.published_on DESC
To better explain what is going on with this query, the user has selected two different areas, such as "Cars" and "Trucks" and he/she should see a list of documents which apply only to BOTH "Cars" and "Trucks." The query pulls up the documents and checks to ensure that there are entries in the area_assignments table for both areas for each document. I only want the query to return documents which apply to BOTH of the areas, not all of the documents for EITHER of the two areas. Currently the query takes about 10-15 seconds to run, which is unacceptable for a website. I believe the subqueries are the biggest hold-up, but I can't come up with a JOIN that will give me the same results.