Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How put 900,000,000 rows of data into mysql fast!!!

Status
Not open for further replies.

wecc

Programmer
Aug 3, 2009
9
US
I need to get data from text files and put them into a mysql table. The problem is there are about 900,000,000 million rows of data and I want to do it as fast as possible. Here is a script that works but it just takes a very long time to load the data. Is it possible to do it any faster? Thank you very much!!

#!/usr/local/bin/perl
#opening narr files

use warnings;
use strict;
use Math::Trig;
use DBI;

my ($id, $variance_speed, $variance_angle, $stnd_dev, @x, @y, $date, $hour, $dbh, $sth, $x, $y, $lat, $longi, $theta, $rho, $R, $pi, $rads, $degs, $phizero, $phi1, $phi2, $lambda0, $N, $F, $rhozero, $i, @files1, $file1, $filename, @uwind, $uwind, @vwind, $vwind, $resultant, $angle, $uw, $vw);

$R=6370.997;
$pi = (atan2 1,1) * 4;
$rads = $pi / 180;
$degs = 180 / $pi;
$phizero = 39 * $rads;
$phi1 = 33 * $rads;
$phi2 = 45 * $rads;
$lambda0 = -96;
$x = -5632.668;
$y = -4612.566;
$id = 0;

$N = log(cos($phi1) / cos($phi2)) / log(tan($pi / 4 + (($phi2) / 2)) / tan($pi / 4 + (($phi1)/ 2)));
$F = cos($phi1) * (tan($pi/4 + $phi1/2))**($N) / $N;
$rhozero = ($R * $F) / ((tan($pi/4 + $phizero/2))**$N);

###Open Directory to where the NARR text files are located
opendir (DIR, '1980\\1') or die "$!";

###Grabs all the files names that have "10m_u" and end in ".txt" and place them into @files1

$dbh=DBI->connect('dbi:mysql:narr_project','usrname','passwrd') || die "Error opening database: $DBI::errstr\n";

@files1 = grep {/30m_u.txt*?/} readdir DIR;
foreach my $filename (@files1) {
$i = 1;
print "Uploading " . $filename . " into mysql 10m table\n";
print "---------------------------------------------------\n";

#Takes the filename from $filename and skips the first 11 characters
#then grabs the next 8 characters and puts them into $1. next
#space in filename is an underscore and then grabs the next
#four characters and puts them into $2 and ignores the rest

if ($filename =~ /.{11}(\d\d\d\d\d\d\d\d)_(\d\d\d\d)/) {
$date = $1;
$hour = $2;
}

#$filename now equals Directory to open one of text files (This
#will change to get the other text files one at a time as the foreach
#loop (above) cycles through. All below is inside that loop and
#repeats as the foreach loop cycles through each time

$filename = "1980\\1\\" . $filename;

#Open's the file with the data. First will be the U wind data

open(FILE, $filename) or die "$!";
@uwind = <FILE>;

$filename =~ s/u.txt/v.txt/;

open(FILE, $filename) or die "$!";
@vwind = <FILE>;
$sth=$dbh->prepare
("INSERT INTO 10m (ID, DATE, TIME, LATITUDE, LONGITUDE, X, Y, U, V, RESULTANT, VARIANCE_RESULTANT, ANGLE, VARIANCE_ANGLE)
VALUES ($id, '$date', '$hour', ?, ?, ?, ?, $uwind[$i], $vwind[$i], ?, ?, ?, ?)");
for ($y=-4612.566; $y<=4300; $y=$y+32.463) {
for ($x=-5632.668; $x<=5600; $x=$x+32.463) {

$id= $id + 1;
$i = $i + 1;



if ($uwind[$i] < 100 && $vwind[$i] < 100) {

$rho = sqrt($x*$x + ($rhozero - $y)*($rhozero - $y));
$theta = atan2($x,($rhozero - $y));
$lat = (2*atan(($R*$F/$rho)**(1/$N)) - $pi/2) * $degs;
$longi = (($theta/$N) * $degs) + $lambda0;

$resultant = sqrt($uwind[$i]**2 + $vwind[$i]**2);
$angle = (atan2($vwind[$i],$uwind[$i])) * $degs;
if ($angle<0) {
$angle=$angle + 360;
}

$variance_speed = 0;
$variance_angle = 0;

$sth->execute ($lat, $longi, $x, $y, $resultant, $variance_speed, $angle, $variance_angle) || die "Couldn't insert record : $DBI::errstr";
}
}
}
}
close(FILE);
close DIR;
 
Sorry this is the working script. That one does not work.

#!/usr/local/bin/perl
#opening narr files

use warnings;
use strict;
use Math::Trig;
use DBI;

my ($id, $variance_speed, $variance_angle, $stnd_dev, @x, @y, $date, $hour, $dbh, $sth, $x, $y, $lat, $longi, $theta, $rho, $R, $pi, $rads, $degs, $phizero, $phi1, $phi2, $lambda0, $N, $F, $rhozero, $i, @files1, $file1, $filename, @uwind, $uwind, @vwind, $vwind, $resultant, $angle, $uw, $vw);

$R=6370.997;
$pi = (atan2 1,1) * 4;
$rads = $pi / 180;
$degs = 180 / $pi;
$phizero = 39 * $rads;
$phi1 = 33 * $rads;
$phi2 = 45 * $rads;
$lambda0 = -96;
$x = -5632.668;
$y = -4612.566;
$id = 0;

$N = log(cos($phi1) / cos($phi2)) / log(tan($pi / 4 + (($phi2) / 2)) / tan($pi / 4 + (($phi1)/ 2)));
$F = cos($phi1) * (tan($pi/4 + $phi1/2))**($N) / $N;
$rhozero = ($R * $F) / ((tan($pi/4 + $phizero/2))**$N);

###Open Directory to where the NARR text files are located
opendir (DIR, '1980\\1') or die "$!";

###Grabs all the files names that have "10m_u" and end in ".txt" and place them into @files1

$dbh=DBI->connect('dbi:mysql:narr_project','root','indent0') || die "Error opening database: $DBI::errstr\n";

@files1 = grep {/30m_u.txt*?/} readdir DIR;
foreach my $filename (@files1) {
$i = 1;
print "Uploading " . $filename . " into mysql 10m table\n";
print "---------------------------------------------------\n";

#Takes the filename from $filename and skips the first 11 characters
#then grabs the next 8 characters and puts them into $1. next
#space in filename is an underscore and then grabs the next
#four characters and puts them into $2 and ignores the rest

if ($filename =~ /.{11}(\d\d\d\d\d\d\d\d)_(\d\d\d\d)/) {
$date = $1;
$hour = $2;
}

#$filename now equals Directory to open one of text files (This
#will change to get the other text files one at a time as the foreach
#loop (above) cycles through. All below is inside that loop and
#repeats as the foreach loop cycles through each time

$filename = "1980\\1\\" . $filename;

#Open's the file with the data. First will be the U wind data

open(FILE, $filename) or die "$!";
@uwind = <FILE>;

$filename =~ s/u.txt/v.txt/;

open(FILE, $filename) or die "$!";
@vwind = <FILE>;

for ($y=-4612.566; $y<=4300; $y=$y+32.463) {
for ($x=-5632.668; $x<=5600; $x=$x+32.463) {

$id= $id + 1;
$i = $i + 1;

$sth=$dbh->prepare
("INSERT INTO 10m (ID, DATE, TIME, LATITUDE, LONGITUDE, X, Y, U, V, RESULTANT, VARIANCE_RESULTANT, ANGLE, VARIANCE_ANGLE)
VALUES ($id, '$date', '$hour', ?, ?, ?, ?, $uwind[$i], $vwind[$i], ?, ?, ?, ?)");

if ($uwind[$i] < 100 && $vwind[$i] < 100) {

$rho = sqrt($x*$x + ($rhozero - $y)*($rhozero - $y));
$theta = atan2($x,($rhozero - $y));
$lat = (2*atan(($R*$F/$rho)**(1/$N)) - $pi/2) * $degs;
$longi = (($theta/$N) * $degs) + $lambda0;

$resultant = sqrt($uwind[$i]**2 + $vwind[$i]**2);
$angle = (atan2($vwind[$i],$uwind[$i])) * $degs;
if ($angle<0) {
$angle=$angle + 360;
}

$variance_speed = 0;
$variance_angle = 0;

$sth->execute ($lat, $longi, $x, $y, $resultant, $variance_speed, $angle, $variance_angle) || die "Couldn't insert record : $DBI::errstr";
}
}
}
}
close(FILE);
close DIR;
 
You've posted 4 threads about the same script now, do you think you can keep it to one thread, please?

Also, try posting your code between [ignore]
Code:
 ...
[/ignore] tags please to make it easier to read.

Annihilannic.
 
How put 900,000,000 rows of data into mysql fast!!!
The problem is there are about 900,000,000 million rows
So which is it? 900 million or 900 (US-style) trillion?

Even if it's the smaller number, and you're able to insert 1000 rows a second, it'll still take over 10 days. If it's the larger number, it'll take nearly 30 years!

Plus, I'm not sure how well mySQL will cope with a table with 900 million/trillion rows. It doesn't always scale well.

Maybe you need to re-think how to do whatever it is you're trying to do.


-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd
 
How many lines are in each of these files?

Code:
                open(FILE, $filename) or die "$!";
                @uwind = <FILE>;

What you're doing here is slurping all the lines of the file into @uwind at once. If your file has 100 characters on each line and 100,000 lines, this means 10,000,000 bytes are loaded into memory with the "@uwind = <FILE>" line - about 10 MB. This may not be very efficient either and will slow down your code a bit (or at least slow down your system in general -- depends how much RAM is available).

If you slurp a file like this, you have to consider that the size of the file you're slurping is now going to be entirely in your system's RAM (try slurping the data out of a DVD-quality AVI that is about 1 GB in size... unless you have > 2GB in your system, your system would prolly crash or at least become so unresponsive that it'd require a hard reboot).

Just food for thought.

Cuvou.com | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
Two things you could do to improve things slightly:
[ol]
[li]Move the [tt]$dbh->prepare[/tt] statement out of the loop, with placeholders instead of [tt]$uwind[$i], $vwind[$i][/tt][/li]
[li]Remove all the calculations of derived columns. All you need to insert are id, date, time, x, y, u and v.

You can have a separate table of x,y -> lat,long conversions, or have an extra loop where you work out the lat/long of a given x/y and update all matching rows in your table.

Resultant and angle can be calculated from u,v when you need them (which is what I'd do), or you could do an update statement to do the whole table once you're out of the loop.[/li]
[/ol]
I still think, though, that the amount of data you're talking about is gonna overwhelm a mySQL database; and that snipping a microsecond off here or there in the loop isn't gonna bring the runtime down to manageable proportions.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd
 
I'd still go with the idea of writing this data to a flat file and then bulk loading it although I have to agree with ChrisHunt here that MySQL might not be the best solution (even if using an RDBMS is a must).


Trojan.
 
So which is it? 900 million or 900 (US-style) trillion?

900,000,000

In US(A) thats 900 million. Still a lot of whatever.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
In the title he says "900,000,000"; in the first post he says "900,000,000 million". So is it a very big number, or is it a very big number multiplied by a million?

OP seems to have wandered off, so we may never know.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd
 
To start off, i didn't take a look at your script...these are just mysql-hints.


First, start a transaction and lock the table in question in exclusive mode.

Code:
SET auto_commit = 'off';
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
START TRANSACTION;
LOCK TABLE `myTable`  WRITE;
-- or LOCK TABLE `myTable`  LOW_PRIORITY WRITE;

then, you load your data:
i would begin with creating "batches" of data. you could try
Code:
INSERT INTO `myTable`
(`a`, `b`, `c`) VALUES
(1,2,3),
(3,4,5),
(6,7,8),
(9,10,11);
maybe this reduces IPC overhead.

you could also try loading the data file as is or after processing using the
Code:
LOAD DATA INFILE
-Statement.

after you're done:
Code:
COMMIT;
SET auto_commit='on';

maybe this helps.
 
Really?
So you want the DB to parse thousands or even millions of SQL INSERT statements?



Trojan.
 
well, not exactly. this is what the script would do in the first place. for each record there is one insert-statement.
if he's taking the trouble of making it a prepared statement, he might as well insert multiple rows at a time.

personally, i would first try the LOAD DATA INFILE approach.


i ran some tests. with one insert per row i crammed 1000000 rows into a table with one index column in 90 seconds without locking and in 80 seconds with locking. i then grouped the data in blocks of 1000 records. so i had 1000 inserts each inserting 1000 records (second code-snippet in my last post). guess what, it takes 5 to 10 seconds. within a transaction.

so, start off by grouping the inserts. you could also try writing a c program to write a single data file containing one huge insert-statement. then just source that script into the db. should be quite fast. by the way, this is how mysqldump stores table data.

so in summary: group the records, so that you insert thousands of records with just one insert. (still, also try load data infile!)

cheers and happy tuning :)
michael
 
I have to agree with MikeLacey.

The times you have for a 1M dataset do not necessarily extrapolate linearly for 900M records.

It'd be interesting to hear what happens in the end here, what solutions are trialled and how long they take to run.



Trojan.
 
I'd be really interested to hear not just the insertion method used but the hardware, O/S, cpu, amount of ram, disk configuration.

When you start working with larger amounts of data these start become significant - you're not, I presume, just going to insert this data into a table and walk away from it thinking how pretty it looks - you're probably going to search through it, probably quite a bit.

I've not looked at your data. Is it normalised? Does it need to be? Should it really all be in one table? How are you going to search through it? (what will be in the WHERE clause of an SQL statement?) and what indexes are you going to build that will make these searches complete in any useful timescales?

Mike

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top