Guest_imported
New member
- Jan 1, 1970
- 0
Hi Everyone,
I've newly installed MySql Ver 3.23 and am in the process of populating it with data from flat files. Things were going fine until I saw some *serious* slowdown in record insertion speeds. I want to gather opinions on what solutions to try.
First, some measurements.
Num Records Total Time(s) Rate
--- ------- ----- ------- ----
001 20 21 0.01 1333.3
002 363 384 0.17 2110.5
003 2699 3083 0.72 3753.8
004 8241 11324 2.14 3850.9
005 18911 30235 9.58 1974.4
006 35642 65877 13.67 2606.9
007 60424 126301 23.52 2569.5
008 96109 222410 42.39 2267.2
009 143606 366016 72.70 1975.2
010 203779 569795 90.86 2242.8
Num refers to the text file from which the data is read. It contains the number of records shown in the Records column, followed by the total number read in to that point. The time in seconds to insert these records is under the Time column. And the rightmost column shows the average insertion rate. So far the speed is pretty impressive. But I have a LOT of data to index. Look at how the performance nosedives.
020 1353759 8377979 1172.17 1154.9
030 2147433 27184976 3859.09 556.5
040 2170491 49124176 6384.61 340.0
050 1964395 69808672 8363.19 234.9
060 1675533 87925330 9651.06 173.6
070 1359195 102944321 10037.75 135.4
080 1053478 114868539 9628.03 109.4
090 747348 123503148 8278.75 90.3
100 553939 129876045 13612.53 40.7
At this point, with 130 million record, the size of data table and index ran about 28GB. (This is all on my home PC.) Because I knew the files would exceed 4GB I set it up for "big times" with an ALTER TABLE command, AVG_ROW_LENGTH = 512, and MAX_ROWS = 64x2^30. That provides ample room for growth, but maybe it is a cause of the slow access speeds?
To populate the database I wrote a python script that reads each line of the input file and issues this SQL command:
INSERT IGNORE INTO table (name, num, val) VALUES (a,b,c)
And the corresponding table definitions is as follows. (I've simplified this example. There are other columns defined that are not updated in this processing stage.)
CREATE TABLE table
(id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(120) NOT NULL,
num SMALLINT NOT NULL,
val TINYINT UNSIGNED NOT NULL,
time TIMESTAMP NOT NULL,
INDEX(NAME), INDEX(num), UNIQUE(name))
I'm indexing two columns in addition to the primary key, which is defined as a BIGINT. The name string averages 80 characters and is constrained to be unique. In contrast, the num column is densely packed -- its values range from 1-100 over 130 million rows.
Given all this, where do you think the bottleneck lies? Clearly, the btree indexes are getting large, with many disk hits required to find a record. But more specifically, is there an alternative to the layout that could allieviate the problem? For example, not declaring UNIQUE(name) and instead placing the responsibility of the guarantee outside of MySql. Or, does creating tables capable of growing past 4GB carry an inherent penalty? Not having much experience the with MySql product, it's hard for me to finger the likely suspect.
As another idea, would it be advisable to insert all the data into an un-indexed table (except for the primary key) and then perform one large indexing operation after all the data is loaded?
Any suggestions from the voice of experience?
john
I've newly installed MySql Ver 3.23 and am in the process of populating it with data from flat files. Things were going fine until I saw some *serious* slowdown in record insertion speeds. I want to gather opinions on what solutions to try.
First, some measurements.
Num Records Total Time(s) Rate
--- ------- ----- ------- ----
001 20 21 0.01 1333.3
002 363 384 0.17 2110.5
003 2699 3083 0.72 3753.8
004 8241 11324 2.14 3850.9
005 18911 30235 9.58 1974.4
006 35642 65877 13.67 2606.9
007 60424 126301 23.52 2569.5
008 96109 222410 42.39 2267.2
009 143606 366016 72.70 1975.2
010 203779 569795 90.86 2242.8
Num refers to the text file from which the data is read. It contains the number of records shown in the Records column, followed by the total number read in to that point. The time in seconds to insert these records is under the Time column. And the rightmost column shows the average insertion rate. So far the speed is pretty impressive. But I have a LOT of data to index. Look at how the performance nosedives.
020 1353759 8377979 1172.17 1154.9
030 2147433 27184976 3859.09 556.5
040 2170491 49124176 6384.61 340.0
050 1964395 69808672 8363.19 234.9
060 1675533 87925330 9651.06 173.6
070 1359195 102944321 10037.75 135.4
080 1053478 114868539 9628.03 109.4
090 747348 123503148 8278.75 90.3
100 553939 129876045 13612.53 40.7
At this point, with 130 million record, the size of data table and index ran about 28GB. (This is all on my home PC.) Because I knew the files would exceed 4GB I set it up for "big times" with an ALTER TABLE command, AVG_ROW_LENGTH = 512, and MAX_ROWS = 64x2^30. That provides ample room for growth, but maybe it is a cause of the slow access speeds?
To populate the database I wrote a python script that reads each line of the input file and issues this SQL command:
INSERT IGNORE INTO table (name, num, val) VALUES (a,b,c)
And the corresponding table definitions is as follows. (I've simplified this example. There are other columns defined that are not updated in this processing stage.)
CREATE TABLE table
(id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(120) NOT NULL,
num SMALLINT NOT NULL,
val TINYINT UNSIGNED NOT NULL,
time TIMESTAMP NOT NULL,
INDEX(NAME), INDEX(num), UNIQUE(name))
I'm indexing two columns in addition to the primary key, which is defined as a BIGINT. The name string averages 80 characters and is constrained to be unique. In contrast, the num column is densely packed -- its values range from 1-100 over 130 million rows.
Given all this, where do you think the bottleneck lies? Clearly, the btree indexes are getting large, with many disk hits required to find a record. But more specifically, is there an alternative to the layout that could allieviate the problem? For example, not declaring UNIQUE(name) and instead placing the responsibility of the guarantee outside of MySql. Or, does creating tables capable of growing past 4GB carry an inherent penalty? Not having much experience the with MySql product, it's hard for me to finger the likely suspect.
As another idea, would it be advisable to insert all the data into an un-indexed table (except for the primary key) and then perform one large indexing operation after all the data is loaded?
Any suggestions from the voice of experience?
john