Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

dramatic slowdown on insertion speed

Status
Not open for further replies.

Guest_imported

New member
Jan 1, 1970
0
Hi Everyone,

I've newly installed MySql Ver 3.23 and am in the process of populating it with data from flat files. Things were going fine until I saw some *serious* slowdown in record insertion speeds. I want to gather opinions on what solutions to try.

First, some measurements.

Num Records Total Time(s) Rate
--- ------- ----- ------- ----
001 20 21 0.01 1333.3
002 363 384 0.17 2110.5
003 2699 3083 0.72 3753.8
004 8241 11324 2.14 3850.9
005 18911 30235 9.58 1974.4
006 35642 65877 13.67 2606.9
007 60424 126301 23.52 2569.5
008 96109 222410 42.39 2267.2
009 143606 366016 72.70 1975.2
010 203779 569795 90.86 2242.8

Num refers to the text file from which the data is read. It contains the number of records shown in the Records column, followed by the total number read in to that point. The time in seconds to insert these records is under the Time column. And the rightmost column shows the average insertion rate. So far the speed is pretty impressive. But I have a LOT of data to index. Look at how the performance nosedives.

020 1353759 8377979 1172.17 1154.9
030 2147433 27184976 3859.09 556.5
040 2170491 49124176 6384.61 340.0
050 1964395 69808672 8363.19 234.9
060 1675533 87925330 9651.06 173.6
070 1359195 102944321 10037.75 135.4
080 1053478 114868539 9628.03 109.4
090 747348 123503148 8278.75 90.3
100 553939 129876045 13612.53 40.7

At this point, with 130 million record, the size of data table and index ran about 28GB. (This is all on my home PC.) Because I knew the files would exceed 4GB I set it up for "big times" with an ALTER TABLE command, AVG_ROW_LENGTH = 512, and MAX_ROWS = 64x2^30. That provides ample room for growth, but maybe it is a cause of the slow access speeds?

To populate the database I wrote a python script that reads each line of the input file and issues this SQL command:

INSERT IGNORE INTO table (name, num, val) VALUES (a,b,c)

And the corresponding table definitions is as follows. (I've simplified this example. There are other columns defined that are not updated in this processing stage.)

CREATE TABLE table
(id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(120) NOT NULL,
num SMALLINT NOT NULL,
val TINYINT UNSIGNED NOT NULL,
time TIMESTAMP NOT NULL,
INDEX(NAME), INDEX(num), UNIQUE(name))

I'm indexing two columns in addition to the primary key, which is defined as a BIGINT. The name string averages 80 characters and is constrained to be unique. In contrast, the num column is densely packed -- its values range from 1-100 over 130 million rows.

Given all this, where do you think the bottleneck lies? Clearly, the btree indexes are getting large, with many disk hits required to find a record. But more specifically, is there an alternative to the layout that could allieviate the problem? For example, not declaring UNIQUE(name) and instead placing the responsibility of the guarantee outside of MySql. Or, does creating tables capable of growing past 4GB carry an inherent penalty? Not having much experience the with MySql product, it's hard for me to finger the likely suspect.

As another idea, would it be advisable to insert all the data into an un-indexed table (except for the primary key) and then perform one large indexing operation after all the data is loaded?

Any suggestions from the voice of experience?

john


 
I'd like to know what your home PC is, it must be pretty darn fast. MySQL themselves claim to be using tables with 50 million records, and they know of others with 60 thousand tables and 5 billion rows. ( You've really peaked my interest in what you are storing too.

As far as your speed problems, with a solution that huge, you are going to want to work with MySQL themselves. They know how to do tons of performance tweaking. You're problem is far beyond the comparatively tiny tables I work with.

Note:You'll want to look out for the maximum file size for the operating system you are using.
 
28GB!?! ( What operating system are you using? Are you using MySQL's RAID table method to beat the 4GB OS file constraint, or are you running a 64-bit OS? What hardware? What MySQL table type?

Honestly, these times you are showing me are no surprise at all, considering the size. You are pushing MySQL to its limits. I have no experience dealing with that amount of data in MySQL, and I doubt anyone here has, but here are some things to think about:

1. Remember that an index slows down inserts, in favor of fast SELECTs. If I were in your position, I would attempt to do all inserts first, and index the columns later. (are your "name" column values already guaranteed to be unique, or do you need the INSERTS to constrain that?). It would take a long time to index the columns once the data is in there, but I still think it would be quicker than inserting with indices, which (I think) will try to index after every single insert as it happens.

Or do you need constant interaction with the database, retrieving and imputting new data? In that case, your indices and constraints should always be present.

2. I would not recommend leaving data integrity up to the external code. Your UNIQUE constraint should be put into the database, whether after population or before, but definitely before trying to do something meaningful with the data.

3. You are doing an explicit INSERT for every row, rather than a much faster import operation. What do the textfiles themselves look like? Do they need to be processed as an INSERT for every line? I recommend trying to parse your data into some sort of comma-separated or tab-separated format, and then running mysqlimport. This should be much faster.

Also, have you thought about using PostgreSQL? It has a few advantages for very large databases. One of these is there are less constraints for table size, row size, etc... While I think MySQL is faster for small/medium size databases, I don't think it scales as well as PostgreSQL ( -------------------------------------------

"Now, this might cause some discomfort..."
(
 
> What operating system are you using? Are you using MySQL's RAID table method to beat the 4GB OS file constraint, or are you running a 64-bit OS? What hardware? What MySQL table type?

Though I shouldn't have, somehow I forgot to mention the background specs. I'm running version 3.23 on Win2000. The computer is brand new, less than a month old. It's a dual Athlon MP2000+, with 2GB of main memory and a 360 GB ATA Raid array comprised of 4 120GB drives in a Raid 5 configuration. Each drive is connected to the controller on separate channels, with the controller harboring 64MB of onboard cache. The dual-headed ATI video card is pretty nice too. Hot machine, I bet you're thinking, but I'm not actually "using" it in the normal sense. It's running around the clock in preparation for competition. ( I thought I'd have plenty of time to prepare my database, but now I'm a tad concerned, shall we say.

> Right. I hit the 4 GB limit in my first attempt. Then I read this page and dug deeper into the ALTER TABLE operation to circumvent the OS limits. That said, I didn't even notice the RAID_TYPE option on CREATE TABLE until you mentioned it. It doesn't appear necessary anymore to get over the 4GB limit, but it does sound like it can increase IO throughput. I'll put this on my list of things to try.

> If I were in your position, I would attempt to do all inserts first, and index the columns later.

I'm thinking along similar lines. At this stage I need to upload large amounts of data into 4 tables - as a giant batch operation - then link the things together. Only afterwards do I need them indexed in order to to support fast queries and subsequent incremental updates.

The worrysome part is that while I can monitor the insertion progress, once the massive indexing effort starts I have no way of estimating how long it will take (a day, a week?) because MySql doesn't display completion percentages.

> 3. You are doing an explicit INSERT for every row, rather than a much faster import operation. What do the textfiles themselves look like? Do they need to be processed as an INSERT for every line? I recommend trying to parse your data into some sort of comma-separated or tab-separated format, and then running mysqlimport. This should be much faster.

I looked at the program mysqlimport initially, but got stopped by small details. One is that the text files contain extra fields that I didn't want to populate. Another is that field separator is a vertical bar ('|') instead of a comma. Since preparing the set of files takes about a week, modifying them is not a minor undertaking. Let me recheck the command line options for mysqlimport to see if I can configure it to do what I want.

And yes, I have considered Postgress but haven't set it up for evaluation yet. I'm giving MySql a shot first. At work we tend to run either Oracle or Sybase -- which of course are beyond my personal budget.

> As far as your speed problems, with a solution that huge, you are going to want to work with MySQL themselves. They know how to do tons of performance tweaking.

No doubt, but big tables translates to "advanced customer support contract" with a $2500 minimum entry fee.

> 28GB!?!

Indeed. And by the end of the year, 10 times that amount. I'm a believer of Big Science in the Basement (TM)!

Thanks for your prompt replies.

john
 
Hi John,

Nice machine!!

I clicked on the link for the competition, but the site was down. Some sort of scientific computation contest? I love the idea of science in the basement too ;-).

The mysqlimport program is very easily configurable to use any column delimiter, text delimiter, and record delimiter you want (
If you are using Windows 2000, then you are probably better off NOT using PostgreSQL, because it is really built more for Unix, and just doesn't run as fast in Windows.

If I were you, though, I would be using FreeBSD Unix, or Slackware Linux on that super box of yours, and it would probably provide twice the performance for your database.

One of my database development systems is a junky old AMD K6 350, 128 MB RAM, 6 GB hard drive, etc... But with FreeBSD, I run Apache, PHP, Perl, Tomcat(Java servlets), PostgreSQL, MySQL, NAT firewall services, etc... and the system is quick enough that it has served me fine for prototyping all kinds of databases and web applications.

Please tell us what the competition is about. I am intrigued. And good luck. Hope you win. -------------------------------------------

"Now, this might cause some discomfort..."
(
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top