Data Structure and Memory

perln00b · Oct 12, 2005

Thank you Trojan and Kevin giving me a good lesson on comparison by using hash. However, I have some concerns about memory and hash. I want to use the same structure showed below to load all my data into hashes, and so I can compare the incoming data before I can insert or update the data.

Code:

$statement = "select studentID, studentName, courseID from student_table";
if ($db -> Sql($statement))
{
 # ..doing error handling
}
while ($db -> FetchRow())
{
    @fields = $db -> Data();
    $tempids{$fields[1]}{$fields[0]}{$fields[1]} = 1;
    $rev_tempids{$fields[1]}{$fields[1]}{$fields[0]} = 1;
    $tempidstrings{$fields[1]}{$fields[0]} .= "$fields[1],";
}

I have 15,000 rows of incoming data to compare to 1000,000 rows of existing data which using the same way load them to the memory to do the comparison. I have 8 gig memory on my box. How come i get "Out of Memory!" while running the code? Does anyone know what is the best way to handle these huge amount of data comparison by using Perl? Or what language is best choice to handle this case?

Thanks,
Lucas

TrojanWarBlade · Oct 12, 2005

The language is not the problem.

I don't really understand your code and why your data structures are so deep or even what the names mean (temp id this, temp id that) but if you explain what you are trying to do then maybe I (or others) can help.

I noticed that you are concatenating strings here and if there are massive numbers of fields then that might be a bad idea.

You might find that Kevin can be more help here since he may well understand your code better than I.

Trojan.

KevinADC · Oct 13, 2005

maybe the array should be redefined for each loop through the data:

Code:

while ($db -> FetchRow())
{
    [b]my[/b] @fields = $db -> Data()

might save some memory but 8 gigs of ram seems like quite a lot for most applications.

TrojanWarBlade · Oct 13, 2005

I think we might need to see more of your code to understand what is going on here.
I have written perl code that has processed tens or millions of records per run with no trouble at all so I'm sure we can find a solution.

Trojan.

icrf · Oct 13, 2005

Can you use an external tool to check your script's memory usage to see how much it's taking? Could probably do it from Perl with the GTop module, if that's available.

But in reality, most problems can be solved with much less memory, so some more detail on what you're doing could shave an order of magnitude off.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

perln00b · Oct 13, 2005

we have about 8 million rows of data load to the memory to do the comparisons.

Code:

                                            (3 million rows) 
                                              |--------|
|----------|        |------------|  load to   |database|
| incoming |load to |    Data    |<-----------|    1   | 
|   data   |------->|Comparisons |   memory   |--------|
|----------| memory |(Using Perl)|<-----------|--------|
(14k rows)          |------------|            |database|
                                              |    2   |
                                              |--------|   
                                            (5 million rows)

As diagram showed above, the total amounts of data loading to the memory is 8 - 9 million rows at the same time. The comparison strategy exactly the same as what Trojan said -- using hash (fastest way). And the Perl scripts are "smart"
enough to update/insert data to two different databases. I have no idea what causes memory leak. Any clues?

Thanks,
Lucas

icrf · Oct 13, 2005

A hefty design change, but does your database(s) support bulk copy operations and stored procedures? The fastest way would be to bulk insert your 14k incoming records into a temp table and then use a stored procedure to process the data. It would be more local to your script, and the database handles memory management. This assumes your processing can be feasible written in a stored-procedure-compatible language for your db. If you were using PostgreSQL or EnterpriseDB, you'd be all set already.

Keeping your current system, just be careful to track whenever you save a reference to an anonymous structure, that it drops out of scope or is overwritten when it's not needed anymore. That's about the only kind of "memory leak" I think you can do in perl. Perl's pretty good about everything else. Be sure you're not duplicating any more data than you have to, and try to simplify the structure of the many-millions of records. The code you posted above seems to store a lot of duplicated data.

Just trying to think the process through.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

perln00b · Oct 13, 2005

Code:

The code you posted above seems to store a lot of duplicated data.

OMG, Andrew, you are great, man!!!
I just wonder how you predict the duplicated data without looking at the real codes/database. For real, we HAVE a lot of duplicated data. That's pain in my butt to get rid of those god damn dups. And using stored procedures is also a very good idea to handle this case. However, the logic is very complicate in the perl scripts. I just wonder the stored procedure can handle it.

Thank you so much,
Lucas

TrojanWarBlade · Oct 13, 2005

I wonder if you actually need to read all the data into perl at once at all.
As Andrew suggests, you could load the data into a temporary table and maybe perform a join to select the data you want or a select for each table and then process it from there, record by record. That way you only have one record from each table in memory at once. That way you could keep your perl. Either way would be acceptable so I guess it's down to what you consider is most apropriate.

Trojan.

icrf · Oct 14, 2005

Well, sarcasm aside, how are you using %rev_tempids that you can't use %tempidstrings? Unless you're loading data into those hashes from somewhere else in a different fashion, I don't see how you wouldn't be able to use %tempidstrings for both. Does $rev_tempids{key1} ever include anything other than $rev_tempids{key1}{key1}? Non-null strings will always test as true, I assume that's all you were using the 1 for.

Do you need the second $fields[1] key on %tempids? If you populate it by "$tempids{$fields[1]}{$fields[0]}{$fields[1]}" then you know that key's going to exist because you only got to that hashref because you used the same key earlier. I've got to be missing something yet.

Since it looks like most of this are various "if exists" kind of tests, that's the kind of thing a couple left/right joins to a temp table are designed for. Let your database handle the data comparisons, it's much better suited for it. Just put the logic/decision making in code.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

perln00b · Oct 15, 2005

Andrew, I made a mistake in my post. The source code should like this.

Code:

$statement = "select studentID, studentName, courseID from student_table";
if ($db -> Sql($statement))
{
 # ..doing error handling
}
while ($db -> FetchRow())
{
    @fields = $db -> Data();
    $tempids{$fields[1]}{$fields[0]}{$fields[2]} = 1;
    $rev_tempids{$fields[1]}{$fields[2]}{$fields[0]} = 1;
    $tempidstrings{$fields[1]}{$fields[0]} .= "$fields[2],";
}

Do you the application will also create duplicated data?

thanks,
lucus

icrf · Oct 16, 2005

Ah yes, knew I was missing something. To be honest, I didn't notice the query up above, that should have clued me in.

No, that removed most of the duplicate data. Depending on how you're using it (frequency, insert-order-sensitive), you may want to consider generating the tempidstrings when you need them via something like:

Code:

join ',', keys %{$tempids{$field[1]}{$field[0]}}

Still, from looking at the structure, you should strive to do most of your data comparison work in joined queries rather than perl hashes, it's much more efficient. What databases are you working with?

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

perln00b · Oct 17, 2005

Code:

What databases are you working with?

We use Oracle9i. You suggested that I should create a temp table to hold all the incoming data. Locally, I can use a stored procedures to loop through all the data. I have some concerns about the performances of Oracle. Since there are (7-8 millions x 14-15k incoming data), there would be a huge amount of comparisons. I dont' know how long it will take.
Any clues?

Thanks,
Lucas

icrf · Oct 18, 2005

I assume that you would start with the temp table and left join to each of the databases, ending up with 14-15k rows of joined data? In such a case, I'm sure you'd be better off. Yes, that's a lot of comparisons, but so long as you have indexes on the columns you're joining, the database can do that awfully quickly. If you can cut down the number of results returned to perl, I think you'll end up saving a lot of time and memory when running the process.

As I'm still finding out, the grand SQL mindset is not to "loop through all the data" so much as it is to break data up into sets, join tables and structure queries to let the database do the work for you. I've often been surprised at what can be accomplished.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Data Structure and Memory

perln00b

Programmer

TrojanWarBlade

Programmer

KevinADC

Technical User

TrojanWarBlade

Programmer

icrf

Programmer

perln00b

Programmer

icrf

Programmer

perln00b

Programmer

TrojanWarBlade

Programmer

icrf

Programmer

perln00b

Programmer

icrf

Programmer

perln00b

Programmer

icrf

Programmer

Similar threads

Part and Inventory Search

Sponsor